As MR Fingers needs to upload documents on a system, where users from WEB/APPS/API's can upload files quickly without compromising the system security and file integrity. This project delivers a file upload system that is fast, secure and can be used across any areas of the platform and with a set of policies that protects the final user of having files exposed, even within the organisation.
- The system should be scalable and initially attend to:
-
- Scale: How many files? (10M+ users, 50TB+ storage).
-
- File Size: Mostly PDFs and Images (average 2MB, max 20MB).
- Increase the upload speed and create seamless CX.
- Mitigate security breaches
- Keep file metadata database
- IAM User groups
- Authentication
- VPC / Subnets configuration
- Lambda warm strategy / provisioned concurrency
The client component is the door to the upload platform, it allows final users to upload files via web interface or app, it can also be an API based as is possible to request pre-signed urls via API if you have a valid auth token.
The API gateway will be the keeper between the application and the client, with exception for the presigned urls upload ( which is handled by AWS S3 ). It should have rate-limit rules and check authorization via Incognito ( which is not a part of this TDD ).
The gateway is a REST API with a single {proxy+} resource forwarding everything to one Lambda, which dispatches via chi internally. Three logical endpoints are exposed:
Initiates a multipart upload. Body { "name": "...", "size": 12345 }. Returns the file ID, multipart upload ID, chunk size, and a list of presigned PUT URLs (one per 5 MB chunk).
Finalises the multipart upload after the client has PUT every chunk. Body { "parts": [{ "part_number": 1, "etag": "..." }] }. Flips the DynamoDB row from pending_upload to pending_scan, after which GuardDuty Malware Protection takes over.
Returns a presigned download URL — but only when the file's status is clean. Files in pending_scan, infected, or deleted return 409.
- API key required at the gateway (
x-api-keyheader). - Identity is bound to
requestContext.identity.apiKeyId— never the secret value — and recorded asuploaded_byin DynamoDB. - Usage plan: 10 rps / burst 20 / 1000 requests per day.
- 20 MB max file size; designed to scale for 10M+ users and 50TB+ storage.
Full setup as JSON-as-data in infra/api-gateway/.
This is our go to as market default and high scalability, speed ( presigned links support streamlined chunk uploads ), support, good development experience and integrates with other AWS products like the event bridge. It needs to have a set of policies set in order to achieve system security: a bucket policy denies s3:GetObject unless GuardDuty has tagged the object with NO_THREATS_FOUND, and files which are compromised are deleted outright by a remediation Lambda triggered via EventBridge.
Decided to delete infected files outright instead of keeping a second quarantine bucket. Once GuardDuty flags THREATS_FOUND, the remediation Lambda performs an s3:DeleteObject (with VersionId — the bucket has versioning) and updates the DynamoDB row to status=deleted with a 30-day TTL. The DynamoDB row is the audit trail; the file bytes add no value once we have the metadata. A second bucket would only become necessary if forensic copies of malicious uploads were a hard requirement.
AWS Lambda (setup)
As this project is for personal usage and ramping up on lambda architecture integrated with go language, we followed a simple approach by using chi Router which creates a monolithic lambda. This is not a recommended approach on real businesses, but here we choose fast shipping / learning process / code structure organisation over IaC w Terraform Learning curve / Granularity of permissioning. However we should also bear in mind other trade-offs such as cold starts / memory usage per endpoint ( which are now coupled ).
Future-proofing note: the current {proxy+} API Gateway resource routes POST /files, POST /files/{id}/complete-upload, and GET /files/{id} through the same safe-upload-create-file Lambda process. This keeps the MVP simple, but couples route-level concurrency, cold starts, memory, timeout, logs, metrics, alarms, deploys, and IAM permissions. Do not split this yet without evidence; consider separate API Lambdas only if one route starts starving another, route-level debugging becomes painful, or an endpoint needs materially different concurrency, timeout, memory, or IAM boundaries.

End-to-end coverage runs against the deployed dev stage as black-box scripts — no AWS credentials needed beyond the public API key.
| Script | What it proves |
|---|---|
tests/smoke.sh |
Happy path: POST /files → PUT bytes to presigned URL → complete-upload → GuardDuty scan completes → status clean → presigned download → bytes round-trip |
tests/security-smoke.sh |
Security path: EICAR signature uploaded → GuardDuty flags it → remediation Lambda deletes the object → GET /files/{id} stays 409 (never returns 200). EICAR pattern generated inline; no on-disk fixture required |
Run via make smoke and make security-smoke from repo root after exporting SAFE_UPLOAD_URL and SAFE_UPLOAD_KEY. See tests/README.md for env-var setup, tunable timeouts, and troubleshooting notes.
Two-tier defence-in-depth gate, with the same checks running locally before commit and in CI on pull requests.
ci-checks is a chain of five targets, ordered cheapest-first so failures surface fast.
What: lists Go source files whose formatting doesn't match the canonical style. Empty output = pass; any file listed = fail. Why: Go has a single, opinionated formatter. Treating "is the code formatted?" as a check eliminates an entire category of code-review bikeshedding (tabs vs spaces, brace placement, import ordering) and makes diffs cleaner — every formatting commit is intentional. Cost: sub-second.
What: static analysis catching suspicious constructs the compiler accepts but are usually bugs — printf format-string mismatches, unreachable code, struct-tag typos, locks copied by value, etc.
Why: it's a bug detector, not a linter. The Go standard library team treats go vet failures as compile errors. Cheap, catches real bugs.
Limits: doesn't replace a real linter (golangci-lint); we may layer those in later if needed.
What: runs the Lambda unit test suite across lambda/.
Why: handler logic is tested with injected AWS client interfaces and hand-written mocks, so core behavior is checked without touching real AWS resources.
Limits: unit tests do not prove API Gateway mapping, S3 presigned URL behavior, GuardDuty scan timing, or EventBridge delivery — those stay in smoke tests.
What: cross-compiles both Lambda binaries (GOOS=linux GOARCH=arm64) and zips them ready for update-function-code.
Why: "compiles on my machine" ≠ "builds for the Lambda runtime." Same toolchain CI uses, same target architecture as production. Missing imports, broken module references, type errors all surface here.
Side-effect: the zips it produces are exactly what deploy.yml later uploads — no surprise rebuilds at deploy time.
What: scans the working tree and full git history for patterns matching known secret formats (AWS access keys, GitHub tokens, Stripe keys, generic high-entropy strings, etc.). Default ruleset has ~150 detectors.
Why: once a secret is committed it's effectively published — git log retains it, public repos are mirrored and indexed within minutes. Local dev hygiene cannot be the only line of defence.
Real catch: during setup gitleaks flagged lambda/output.json (committed in 5b0ff1f) containing two ASIA...-prefixed STS temporary tokens. They had a 15-minute expiry from March, so functionally dead — but gitleaks correctly flagged them. We deleted the file, gitignored it, and wrote a .gitleaks.toml allowlist for that path. History wasn't rewritten because the credentials were already expired.
--redact: prints "REDACTED" in place of the matched value so logs don't leak the secret further.
What's deliberately outside ci-checks:
- Smoke tests — run post-deploy because they require the deployed API URL and API key.
- AWS drift checks — run with
make verify-awsonly when AWS read credentials are available. - Optional security audit —
make security-auditincludesgovulncheck; useful before releases, but not in pre-commit until the local dependency is part of the standard setup. - Extra linters (
golangci-lint,gosec) — useful later, but should be baselined before becoming required gates.
- Local pre-commit hook (
scripts/pre-commit.sh, installed viamake install-hooks) runsmake ci-checkson everygit commit. Bypassable with--no-verify; CI is the backstop. - Local pre-push hook (
scripts/pre-push.sh, installed viamake install-hooks) runsmake coderabbit-severity-gatebeforegit push, usingCODERABBIT_BASE=origin/main,CODERABBIT_TYPE=committed, andCODERABBIT_BLOCK_SEVERITIES="critical major"by default. Bypass intentionally withSKIP_CODERABBIT_REVIEW=1 git pushorgit push --no-verify. - PR gate (
.github/workflows/ci.yml) runs the same checks on every pull request tomain. Required status check; merge blocked until green. - Deploy (
.github/workflows/deploy.yml) re-runsci-checksdefensively, then deploys, then re-runs the smoke scripts.
- PR template:
.github/pull_request_template.mdforces scope, validation, security/AWS impact, cost, rollback, and docs review before merge. - Dependabot:
.github/dependabot.ymlopens weekly Go module and GitHub Actions update PRs. - CodeRabbit CLI:
make coderabbit-reviewis the primary human-readable AI-review loop. It runs local CodeRabbit review againstCODERABBIT_BASE(defaultmain) andCODERABBIT_TYPE(defaultall), then passesAGENTS.md,README.md, and.coderabbit.yamlas review context. Usemake coderabbit-auth-login,make coderabbit-auth-status,make coderabbit-agent-auth-login,make coderabbit-agent-auth-status,make coderabbit-doctor, andmake coderabbit-updatefor setup and diagnostics. This is outsideci-checksbecause it requires local auth, network access, and consumes per-developer CLI review quota. - CodeRabbit severity gate:
make coderabbit-severity-gateuses structuredcoderabbit review --agentoutput and exits1only when a finding severity matchesCODERABBIT_BLOCK_SEVERITIES(defaultcritical major). Minor, trivial, and info findings are reported but do not block push. Structured mode uses the local authenticated CLI session frommake coderabbit-agent-auth-login;CODERABBIT_API_KEY/CODE_RABBIT_API_KEYare optional local-only overrides and must never be committed. - CodeRabbit agent skill:
.agents/skills/code-review/SKILL.mdwas installed withnpx skills add coderabbitai/skills -a codex -s code-review -y;skills-lock.jsonpins the installed skill source/hash so agents can restore it withnpx skills experimental_install. - CodeRabbit auth hygiene: run
make coderabbit-auth-loginin your own terminal. Do not paste CodeRabbit OAuth tokens or API keys into chat, PRs, logs, or committed files. - CodeRabbit PR app:
.coderabbit.yamlkeeps the hosted PR reviewer non-blocking, enables free-tier behavior, skips dependency bot PRs, and gives project-specific review guidance. Treat it as a PR-side supplement to the local CLI, not the primary gate. - Security audit:
make security-auditrunsgitleaksplusgovulncheckwhen both tools are installed. It is not in pre-commit yet becausegovulncheckis a separate local dependency. - AWS drift check:
make verify-awsruns read-only AWS CLI checks against the deployed MVP. It verifies intended live state, but is intentionally outsideci-checksbecause it requires AWS credentials and read permissions.
The deploy workflow fires on push to main (i.e. PR merge) and updates both Lambda zips. To call AWS it needs credentials. We don't store any.
The "obvious" path is to create an IAM user, generate access keys (AKIA... + secret), store them as GitHub repo secrets, and let the workflow export them as env vars. This works, and it's how a lot of AWS credential leaks have happened. Problems:
- Long-lived: keys are valid until manually rotated. Leaks (compromised laptop, malicious workflow dep, ex-contributor) stay valid.
- Out of band: GitHub stores them; AWS doesn't know where they live or who's using them.
- Hard to scope and rotate: usually end up over-privileged because per-workflow IAM users are operationally painful.
OpenID Connect is a thin auth layer on OAuth 2.0. The model: an issuer signs short-lived JWTs proving "this caller is X." A relying party verifies the signature against the issuer's public keys and trusts the claims.
In our case:
- Issuer: GitHub's OIDC service at
https://token.actions.githubusercontent.com. Mints a JWT for every workflow run, automatically. - Relying party: AWS IAM. We told it (by creating an Identity Provider resource) that this issuer is trusted; AWS knows where to fetch its public keys from
/.well-known/openid-configuration. - Action:
sts:AssumeRoleWithWebIdentity— exchanges a verified JWT for short-lived AWS credentials.
deploy.ymlstarts because someone pushed tomain.- The runner asks GitHub's OIDC service for a token. GitHub mints a signed JWT containing claims like:
iss:https://token.actions.githubusercontent.comaud:sts.amazonaws.com(we asked for this audience viapermissions: id-token: write)sub:repo:wheresdiasd/safe-upload-platform:ref:refs/heads/mainrepository,ref,actor,run_id,sha, etc.
aws-actions/configure-aws-credentials@v4receives the token and callssts:AssumeRoleWithWebIdentityagainst AWS, passing the JWT and the role ARN.- AWS STS verifies the JWT signature against GitHub's public keys, then evaluates the role's trust policy:
Only allow this assume call if the audience matches AND the subject is exactly that repo + branch.
"Condition": { "StringEquals": { "token.actions.githubusercontent.com:aud": "sts.amazonaws.com", "token.actions.githubusercontent.com:sub": "repo:wheresdiasd/safe-upload-platform:ref:refs/heads/main" } }
- STS issues short-lived credentials — typically valid for 1 hour. The action exports them as env vars (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_SESSION_TOKEN, all***-redacted in logs). When the job ends, they expire on their own. - The rest of the workflow uses those credentials for
aws lambda update-function-codeetc.
The sub claim is the security boundary:
- Forked-repo attack. Someone forks the repo and tries to assume the role from a malicious workflow. Their
subisrepo:attacker/safe-upload-platform:...— doesn't match. Denied. - Wrong-branch attack. A PR from a feature branch in our own repo runs malicious code. Their
subis...:ref:refs/heads/feature-x, doesn't matchmain. Denied. (deploy.ymlonly fires on push to main anyway — defence in depth.) - Different repo. Some unrelated repo of yours tries to assume the role. Different
repositoryclaim, differentsub. Denied.
Once assumed, the role's permission policy (infra/iam/github-actions-deploy-policy.json) is intentionally tiny:
{
"Action": [
"lambda:UpdateFunctionCode",
"lambda:GetFunction",
"lambda:GetFunctionConfiguration"
],
"Resource": "arn:aws:lambda:<AWS_REGION>:<AWS_ACCOUNT_ID>:function:safe-upload-*"
}Even with valid temporary credentials, the role cannot:
- Touch IAM, API Gateway, DynamoDB, or S3.
- Read CloudWatch logs.
- Invoke Lambdas (so a compromised workflow can't burn the AWS budget by spamming invocations).
- Create new functions, only update existing
safe-upload-*ones.
Principle of least privilege made concrete.
- AWS side: the OIDC Identity Provider resource + the
safe-upload-github-deployrole. Created once via console; templates ininfra/iam/github-actions-deploy-trust-policy.jsonandgithub-actions-deploy-policy.json. - GitHub side: one repo variable (
AWS_DEPLOY_ROLE_ARN) telling the workflow which role to assume. No long-lived secret. - Walkthrough:
infra/iam/github-actions-oidc-setup.mddocuments the entire one-time setup.



