Rationale

As MR Fingers needs to upload documents on a system, where users from WEB/APPS/API's can upload files quickly without compromising the system security and file integrity. This project delivers a file upload system that is fast, secure and can be used across any areas of the platform and with a set of policies that protects the final user of having files exposed, even within the organisation.

Goals

The system should be scalable and initially attend to:
- Scale: How many files? (10M+ users, 50TB+ storage).
- File Size: Mostly PDFs and Images (average 2MB, max 20MB).
Increase the upload speed and create seamless CX.
Mitigate security breaches
Keep file metadata database

Non-Goals

IAM User groups
Authentication
VPC / Subnets configuration
Lambda warm strategy / provisioned concurrency

Proposed architecture

User/Client

The client component is the door to the upload platform, it allows final users to upload files via web interface or app, it can also be an API based as is possible to request pre-signed urls via API if you have a valid auth token.

API Gateway

The API gateway will be the keeper between the application and the client, with exception for the presigned urls upload ( which is handled by AWS S3 ). It should have rate-limit rules and check authorization via Incognito ( which is not a part of this TDD ).

The gateway is a REST API with a single {proxy+} resource forwarding everything to one Lambda, which dispatches via chi internally. Three logical endpoints are exposed:

`POST /files`

Initiates a multipart upload. Body { "name": "...", "size": 12345 }. Returns the file ID, multipart upload ID, chunk size, and a list of presigned PUT URLs (one per 5 MB chunk).

`POST /files/{id}/complete-upload`

Finalises the multipart upload after the client has PUT every chunk. Body { "parts": [{ "part_number": 1, "etag": "..." }] }. Flips the DynamoDB row from pending_upload to pending_scan, after which GuardDuty Malware Protection takes over.

`GET /files/{id}`

Returns a presigned download URL — but only when the file's status is clean. Files in pending_scan, infected, or deleted return 409.

Auth & limits

API key required at the gateway (x-api-key header).
Identity is bound to requestContext.identity.apiKeyId — never the secret value — and recorded as uploaded_by in DynamoDB.
Usage plan: 10 rps / burst 20 / 1000 requests per day.
20 MB max file size; designed to scale for 10M+ users and 50TB+ storage.

Full setup as JSON-as-data in infra/api-gateway/.

S3 Bucket

This is our go to as market default and high scalability, speed ( presigned links support streamlined chunk uploads ), support, good development experience and integrates with other AWS products like the event bridge. It needs to have a set of policies set in order to achieve system security: a bucket policy denies s3:GetObject unless GuardDuty has tagged the object with NO_THREATS_FOUND, and files which are compromised are deleted outright by a remediation Lambda triggered via EventBridge.

Trade-offs

Decided to delete infected files outright instead of keeping a second quarantine bucket. Once GuardDuty flags THREATS_FOUND, the remediation Lambda performs an s3:DeleteObject (with VersionId — the bucket has versioning) and updates the DynamoDB row to status=deleted with a 30-day TTL. The DynamoDB row is the audit trail; the file bytes add no value once we have the metadata. A second bucket would only become necessary if forensic copies of malicious uploads were a hard requirement.

Amazon EventBridge

Amazon GuarDuty

AWS Lambda (setup)

Trade-offs

Chi Router vs Terraform + IaC

As this project is for personal usage and ramping up on lambda architecture integrated with go language, we followed a simple approach by using chi Router which creates a monolithic lambda. This is not a recommended approach on real businesses, but here we choose fast shipping / learning process / code structure organisation over IaC w Terraform Learning curve / Granularity of permissioning. However we should also bear in mind other trade-offs such as cold starts / memory usage per endpoint ( which are now coupled ).

Future-proofing note: the current {proxy+} API Gateway resource routes POST /files, POST /files/{id}/complete-upload, and GET /files/{id} through the same safe-upload-create-file Lambda process. This keeps the MVP simple, but couples route-level concurrency, cold starts, memory, timeout, logs, metrics, alarms, deploys, and IAM permissions. Do not split this yet without evidence; consider separate API Lambdas only if one route starts starving another, route-level debugging becomes painful, or an endpoint needs materially different concurrency, timeout, memory, or IAM boundaries.

Dynamo DB

Testing

End-to-end coverage runs against the deployed dev stage as black-box scripts — no AWS credentials needed beyond the public API key.

Script	What it proves
`tests/smoke.sh`	Happy path: POST /files → PUT bytes to presigned URL → complete-upload → GuardDuty scan completes → status `clean` → presigned download → bytes round-trip
`tests/security-smoke.sh`	Security path: EICAR signature uploaded → GuardDuty flags it → remediation Lambda deletes the object → GET /files/{id} stays 409 (never returns 200). EICAR pattern generated inline; no on-disk fixture required

Run via make smoke and make security-smoke from repo root after exporting SAFE_UPLOAD_URL and SAFE_UPLOAD_KEY. See tests/README.md for env-var setup, tunable timeouts, and troubleshooting notes.

CI/CD

Two-tier defence-in-depth gate, with the same checks running locally before commit and in CI on pull requests.

The checks (`make ci-checks`)

ci-checks is a chain of five targets, ordered cheapest-first so failures surface fast.

1. `gofmt -l lambda/`

What: lists Go source files whose formatting doesn't match the canonical style. Empty output = pass; any file listed = fail. Why: Go has a single, opinionated formatter. Treating "is the code formatted?" as a check eliminates an entire category of code-review bikeshedding (tabs vs spaces, brace placement, import ordering) and makes diffs cleaner — every formatting commit is intentional. Cost: sub-second.

2. `go vet ./...`

What: static analysis catching suspicious constructs the compiler accepts but are usually bugs — printf format-string mismatches, unreachable code, struct-tag typos, locks copied by value, etc. Why: it's a bug detector, not a linter. The Go standard library team treats go vet failures as compile errors. Cheap, catches real bugs. Limits: doesn't replace a real linter (golangci-lint); we may layer those in later if needed.

3. `go test ./...`

What: runs the Lambda unit test suite across lambda/. Why: handler logic is tested with injected AWS client interfaces and hand-written mocks, so core behavior is checked without touching real AWS resources. Limits: unit tests do not prove API Gateway mapping, S3 presigned URL behavior, GuardDuty scan timing, or EventBridge delivery — those stay in smoke tests.

4. `make build` → `lambda/Makefile build-all`

What: cross-compiles both Lambda binaries (GOOS=linux GOARCH=arm64) and zips them ready for update-function-code. Why: "compiles on my machine" ≠ "builds for the Lambda runtime." Same toolchain CI uses, same target architecture as production. Missing imports, broken module references, type errors all surface here. Side-effect: the zips it produces are exactly what deploy.yml later uploads — no surprise rebuilds at deploy time.

5. `gitleaks detect --source . --no-banner --redact`

What: scans the working tree and full git history for patterns matching known secret formats (AWS access keys, GitHub tokens, Stripe keys, generic high-entropy strings, etc.). Default ruleset has ~150 detectors. Why: once a secret is committed it's effectively published — git log retains it, public repos are mirrored and indexed within minutes. Local dev hygiene cannot be the only line of defence. Real catch: during setup gitleaks flagged lambda/output.json (committed in 5b0ff1f) containing two ASIA...-prefixed STS temporary tokens. They had a 15-minute expiry from March, so functionally dead — but gitleaks correctly flagged them. We deleted the file, gitignored it, and wrote a .gitleaks.toml allowlist for that path. History wasn't rewritten because the credentials were already expired. --redact: prints "REDACTED" in place of the matched value so logs don't leak the secret further.

What's deliberately outside ci-checks:

Smoke tests — run post-deploy because they require the deployed API URL and API key.
AWS drift checks — run with make verify-aws only when AWS read credentials are available.
Optional security audit — make security-audit includes govulncheck; useful before releases, but not in pre-commit until the local dependency is part of the standard setup.
Extra linters (golangci-lint, gosec) — useful later, but should be baselined before becoming required gates.

Where the checks run

Local pre-commit hook (scripts/pre-commit.sh, installed via make install-hooks) runs make ci-checks on every git commit. Bypassable with --no-verify; CI is the backstop.
Local pre-push hook (scripts/pre-push.sh, installed via make install-hooks) runs make coderabbit-severity-gate before git push, using CODERABBIT_BASE=origin/main, CODERABBIT_TYPE=committed, and CODERABBIT_BLOCK_SEVERITIES="critical major" by default. Bypass intentionally with SKIP_CODERABBIT_REVIEW=1 git push or git push --no-verify.
PR gate (.github/workflows/ci.yml) runs the same checks on every pull request to main. Required status check; merge blocked until green.
Deploy (.github/workflows/deploy.yml) re-runs ci-checks defensively, then deploys, then re-runs the smoke scripts.

SDLC support resources

PR template: .github/pull_request_template.md forces scope, validation, security/AWS impact, cost, rollback, and docs review before merge.
Dependabot: .github/dependabot.yml opens weekly Go module and GitHub Actions update PRs.
CodeRabbit CLI: make coderabbit-review is the primary human-readable AI-review loop. It runs local CodeRabbit review against CODERABBIT_BASE (default main) and CODERABBIT_TYPE (default all), then passes AGENTS.md, README.md, and .coderabbit.yaml as review context. Use make coderabbit-auth-login, make coderabbit-auth-status, make coderabbit-agent-auth-login, make coderabbit-agent-auth-status, make coderabbit-doctor, and make coderabbit-update for setup and diagnostics. This is outside ci-checks because it requires local auth, network access, and consumes per-developer CLI review quota.
CodeRabbit severity gate: make coderabbit-severity-gate uses structured coderabbit review --agent output and exits 1 only when a finding severity matches CODERABBIT_BLOCK_SEVERITIES (default critical major). Minor, trivial, and info findings are reported but do not block push. Structured mode uses the local authenticated CLI session from make coderabbit-agent-auth-login; CODERABBIT_API_KEY / CODE_RABBIT_API_KEY are optional local-only overrides and must never be committed.
CodeRabbit agent skill: .agents/skills/code-review/SKILL.md was installed with npx skills add coderabbitai/skills -a codex -s code-review -y; skills-lock.json pins the installed skill source/hash so agents can restore it with npx skills experimental_install.
CodeRabbit auth hygiene: run make coderabbit-auth-login in your own terminal. Do not paste CodeRabbit OAuth tokens or API keys into chat, PRs, logs, or committed files.
CodeRabbit PR app: .coderabbit.yaml keeps the hosted PR reviewer non-blocking, enables free-tier behavior, skips dependency bot PRs, and gives project-specific review guidance. Treat it as a PR-side supplement to the local CLI, not the primary gate.
Security audit: make security-audit runs gitleaks plus govulncheck when both tools are installed. It is not in pre-commit yet because govulncheck is a separate local dependency.
AWS drift check: make verify-aws runs read-only AWS CLI checks against the deployed MVP. It verifies intended live state, but is intentionally outside ci-checks because it requires AWS credentials and read permissions.

Auto-deploy on merge — OIDC

The deploy workflow fires on push to main (i.e. PR merge) and updates both Lambda zips. To call AWS it needs credentials. We don't store any.

The problem we're avoiding

The "obvious" path is to create an IAM user, generate access keys (AKIA... + secret), store them as GitHub repo secrets, and let the workflow export them as env vars. This works, and it's how a lot of AWS credential leaks have happened. Problems:

Long-lived: keys are valid until manually rotated. Leaks (compromised laptop, malicious workflow dep, ex-contributor) stay valid.
Out of band: GitHub stores them; AWS doesn't know where they live or who's using them.
Hard to scope and rotate: usually end up over-privileged because per-workflow IAM users are operationally painful.

What OIDC gives us instead

OpenID Connect is a thin auth layer on OAuth 2.0. The model: an issuer signs short-lived JWTs proving "this caller is X." A relying party verifies the signature against the issuer's public keys and trusts the claims.

In our case:

Issuer: GitHub's OIDC service at https://token.actions.githubusercontent.com. Mints a JWT for every workflow run, automatically.
Relying party: AWS IAM. We told it (by creating an Identity Provider resource) that this issuer is trusted; AWS knows where to fetch its public keys from /.well-known/openid-configuration.
Action: sts:AssumeRoleWithWebIdentity — exchanges a verified JWT for short-lived AWS credentials.

The runtime flow on every deploy

deploy.yml starts because someone pushed to main.
The runner asks GitHub's OIDC service for a token. GitHub mints a signed JWT containing claims like:
- iss: https://token.actions.githubusercontent.com
- aud: sts.amazonaws.com (we asked for this audience via permissions: id-token: write)
- sub: repo:wheresdiasd/safe-upload-platform:ref:refs/heads/main
- repository, ref, actor, run_id, sha, etc.
aws-actions/configure-aws-credentials@v4 receives the token and calls sts:AssumeRoleWithWebIdentity against AWS, passing the JWT and the role ARN.

AWS STS verifies the JWT signature against GitHub's public keys, then evaluates the role's trust policy:

"Condition": {
  "StringEquals": {
    "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
    "token.actions.githubusercontent.com:sub": "repo:wheresdiasd/safe-upload-platform:ref:refs/heads/main"
  }
}

Only allow this assume call if the audience matches AND the subject is exactly that repo + branch.

STS issues short-lived credentials — typically valid for 1 hour. The action exports them as env vars (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN, all ***-redacted in logs). When the job ends, they expire on their own.
The rest of the workflow uses those credentials for aws lambda update-function-code etc.

Why the trust scope matters

The sub claim is the security boundary:

Forked-repo attack. Someone forks the repo and tries to assume the role from a malicious workflow. Their sub is repo:attacker/safe-upload-platform:... — doesn't match. Denied.
Wrong-branch attack. A PR from a feature branch in our own repo runs malicious code. Their sub is ...:ref:refs/heads/feature-x, doesn't match main. Denied. (deploy.yml only fires on push to main anyway — defence in depth.)
Different repo. Some unrelated repo of yours tries to assume the role. Different repository claim, different sub. Denied.

What the deploy role can actually do

Once assumed, the role's permission policy (infra/iam/github-actions-deploy-policy.json) is intentionally tiny:

{
  "Action": [
    "lambda:UpdateFunctionCode",
    "lambda:GetFunction",
    "lambda:GetFunctionConfiguration"
  ],
  "Resource": "arn:aws:lambda:<AWS_REGION>:<AWS_ACCOUNT_ID>:function:safe-upload-*"
}

Even with valid temporary credentials, the role cannot:

Touch IAM, API Gateway, DynamoDB, or S3.
Read CloudWatch logs.
Invoke Lambdas (so a compromised workflow can't burn the AWS budget by spamming invocations).
Create new functions, only update existing safe-upload-* ones.

Principle of least privilege made concrete.

Where everything lives

AWS side: the OIDC Identity Provider resource + the safe-upload-github-deploy role. Created once via console; templates in infra/iam/github-actions-deploy-trust-policy.json and github-actions-deploy-policy.json.
GitHub side: one repo variable (AWS_DEPLOY_ROLE_ARN) telling the workflow which role to assume. No long-lived secret.
Walkthrough: infra/iam/github-actions-oidc-setup.md documents the entire one-time setup.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.agents/skills/code-review		.agents/skills/code-review
.github		.github
infra		infra
lambda		lambda
scripts		scripts
skills		skills
tests		tests
.coderabbit.yaml		.coderabbit.yaml
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
AGENTS.md		AGENTS.md
Makefile		Makefile
README.md		README.md
skills-lock.json		skills-lock.json

Folders and files

Latest commit

History

Repository files navigation

Rationale

Goals

Non-Goals

Proposed architecture

User/Client

API Gateway

POST /files

POST /files/{id}/complete-upload

GET /files/{id}

Auth & limits

S3 Bucket

Trade-offs

Amazon EventBridge

Amazon GuarDuty

AWS Lambda (setup)

Trade-offs

Chi Router vs Terraform + IaC

Dynamo DB

Testing

CI/CD

The checks (make ci-checks)

1. gofmt -l lambda/

2. go vet ./...

3. go test ./...

4. make build → lambda/Makefile build-all

5. gitleaks detect --source . --no-banner --redact

Where the checks run

SDLC support resources

Auto-deploy on merge — OIDC

The problem we're avoiding

What OIDC gives us instead

The runtime flow on every deploy

Why the trust scope matters

What the deploy role can actually do

Where everything lives

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`POST /files`

`POST /files/{id}/complete-upload`

`GET /files/{id}`

The checks (`make ci-checks`)

1. `gofmt -l lambda/`

2. `go vet ./...`

3. `go test ./...`

4. `make build` → `lambda/Makefile build-all`

5. `gitleaks detect --source . --no-banner --redact`

Packages