diff --git a/backend/Dockerfile b/backend/Dockerfile index 228601cc..b41106a6 100644 --- a/backend/Dockerfile +++ b/backend/Dockerfile @@ -56,6 +56,8 @@ ENV PYTHONUNBUFFERED=1 \ PYTHONPATH=/app \ PATH="/app/.venv/bin:$PATH" \ UV_SYSTEM_PYTHON=1 \ + UV_CACHE_DIR=/tmp/.cache/uv \ + HOME=/tmp \ ENV=${ENV} WORKDIR /app diff --git a/backend/app/scripts/README.md b/backend/app/scripts/README.md index 5a2bfffd..4e5b3229 100644 --- a/backend/app/scripts/README.md +++ b/backend/app/scripts/README.md @@ -23,7 +23,7 @@ scripts/ ### Domains -- **db/** — Database seeding and rollback utilities +- **db/** — Database migration, seeding, and rollback utilities - **users/** — Administrative and service account management --- diff --git a/backend/entrypoint.sh b/backend/entrypoint.sh index 327b336f..1ded5bc7 100755 --- a/backend/entrypoint.sh +++ b/backend/entrypoint.sh @@ -30,16 +30,6 @@ until pg_isready -d "${PG_URL}" -q; do done echo "✅ Database is ready" -# ----------------------------------------------------------- -# Run Alembic migrations -# ----------------------------------------------------------- -echo "🔄 Running Alembic migrations..." -if ! uv run alembic upgrade head; then - echo "❌ Alembic migrations failed" - exit 1 -fi -echo "✅ Alembic migrations complete" - # ----------------------------------------------------------- # Start application # ----------------------------------------------------------- @@ -47,7 +37,7 @@ if [ "$ENV" = "production" ]; then echo "🚀 Starting SimBoard backend (production mode)..." # In production, HTTPS is expected to be handled by a reverse proxy (e.g., Traefik). # Uvicorn is started without SSL options here; do not enable HTTPS at the app layer in production. - exec uv run uvicorn app.main:app --host 0.0.0.0 --port 8000 + exec uvicorn app.main:app --host 0.0.0.0 --port 8000 else echo "⚙️ Starting SimBoard backend (development mode with HTTPS + autoreload)..." @@ -58,7 +48,7 @@ else exit 1 fi - exec uv run uvicorn app.main:app \ + exec uvicorn app.main:app \ --host 0.0.0.0 \ --port 8000 \ --ssl-keyfile "${SSL_KEYFILE}" \ diff --git a/docs/README.md b/docs/README.md index 397d7587..6a624654 100644 --- a/docs/README.md +++ b/docs/README.md @@ -8,10 +8,12 @@ Documentation for the SimBoard project. ```bash docs/ -├── README.md # This file -└── cicd/ # CI/CD and deployment - ├── README.md # Quick start and overview - └── DEPLOYMENT.md # Complete reference guide +├── README.md # This file +├── cicd/ # CI/CD and deployment +│ ├── README.md # Quick start and overview +│ └── DEPLOYMENT.md # Complete reference guide +└── deploy/ # Environment-specific deployment runbooks + └── spin.md # Spin backend migration rollout + frontend/db/ingress config ``` --- @@ -25,6 +27,7 @@ docs/ **Need deployment details?** - [cicd/DEPLOYMENT.md](cicd/DEPLOYMENT.md) - Complete reference +- [deploy/spin.md](deploy/spin.md) - Spin backend/frontend/db/ingress workload runbook --- @@ -34,6 +37,7 @@ All CI/CD and deployment documentation is in the [`cicd/`](cicd/) directory: - **[cicd/README.md](cicd/README.md)** - Quick start, overview, and common operations - **[cicd/DEPLOYMENT.md](cicd/DEPLOYMENT.md)** - Complete deployment guide with workflows, Kubernetes examples, and troubleshooting +- **[deploy/spin.md](deploy/spin.md)** - Spin-specific backend migration-first plus frontend/db/ingress runbook --- diff --git a/docs/cicd/DEPLOYMENT.md b/docs/cicd/DEPLOYMENT.md index 1f1a506c..6e4e7236 100644 --- a/docs/cicd/DEPLOYMENT.md +++ b/docs/cicd/DEPLOYMENT.md @@ -11,6 +11,7 @@ Complete reference for CI/CD pipelines and NERSC Spin deployments. - [Image Tagging Strategy](#image-tagging-strategy) - [Development Deployment](#development-deployment) - [Production Release Process](#production-release-process) +- [Database Migrations](#database-migrations) - [Rollback Procedure](#rollback-procedure) - [Manual Builds](#manual-builds) - [Troubleshooting](#troubleshooting) @@ -244,6 +245,8 @@ Update the image tags in the [Rancher UI](https://rancher2.spin.nersc.gov/dashbo 4. Set **Pull Policy** to `IfNotPresent` 5. Click **Save** — Rancher will roll out the new version +For backend releases, migrations run automatically in a backend initContainer during rollout. See [Database Migrations](#database-migrations). + ### Step 5: Verify Production 1. In Rancher, check that pods are **Running** under **Workloads → Pods** in the prod namespace @@ -254,19 +257,63 @@ Update the image tags in the [Rancher UI](https://rancher2.spin.nersc.gov/dashbo ## Database Migrations -Alembic database migrations run **automatically** when the backend container starts. No manual migration step is required during deployment. +Database migrations are executed by a backend Deployment initContainer during rollout, not on backend app startup. + +### Runtime Behavior + +- Backend container starts the API directly and does not run migrations at startup. +- InitContainer runs before backend container start and executes: + - `test -n "$DATABASE_URL" || { echo "DATABASE_URL is required"; exit 1; }; alembic upgrade head` + +### Spin Workloads -### Startup Sequence +Reference runbook: -1. **Database readiness check** — the container waits (up to 30 seconds) for the PostgreSQL server to accept connections using `pg_isready`. -2. **`alembic upgrade head`** — applies any pending migrations. If the database is already up to date, this is a no-op. -3. **Application start** — Uvicorn launches only after migrations succeed. +- [`docs/deploy/spin.md`](../deploy/spin.md) -If either the database readiness check or migration step fails, the container exits immediately and does **not** start the application. +- Backend service/deployment baseline is defined for in-cluster API routing (`backend` on `8000`). +- Backend Deployment uses the image entrypoint directly (no app args required). +- Backend Deployment includes initContainer `migrate` using the same backend image tag to run Alembic before app start. +- Frontend service/deployment baseline is defined for UI routing (`frontend` on `80`). +- Frontend Deployment uses the frontend image default CMD (no explicit args). +- DB service/deployment baseline is defined for in-cluster Postgres (`db`). +- Ingress baseline (`lb`) terminates TLS via `simboard-tls-cert` and routes frontend/backend hosts. +- Backend and migration initContainer env values are sourced via `envFrom` from secret `simboard-backend-env`. +- DB container env values are sourced via `envFrom` from secret `simboard-db`. + +### Deployment Order (Required) + +1. Roll out backend deployment with the target image tag. +2. Wait for initContainer migration step to succeed. +3. Confirm backend pods become `Running` and `Ready`. + +If initContainer migration fails, backend pods will not become ready and rollout should be treated as failed. ### Concurrency Note -The current deployment assumes a **single backend replica**. If horizontal scaling is introduced, migration execution should be separated into a one-time init container or deployment job to avoid race conditions. +InitContainers run per pod. If more than one backend pod is created simultaneously, migrations may execute concurrently. + +Use an explicit rollout strategy that guarantees only one new pod (and therefore one migration initContainer) is created at a time: + +```yaml +spec: + replicas: 1 + strategy: + type: RollingUpdate + rollingUpdate: + maxSurge: 0 + maxUnavailable: 1 +``` + +Why this is required: with default `RollingUpdate` settings, Kubernetes may create a surge pod during updates, which can run a second migration initContainer even when the steady-state replica count is `1`. + +If you need `replicas > 1`, use a DB-level migration lock so only one initContainer can run Alembic at a time. For PostgreSQL, wrap migration execution with a single transaction-scoped advisory lock (for example, `SELECT pg_advisory_lock(); ... alembic upgrade head ...; SELECT pg_advisory_unlock();`). + +Production-safe recommendation: apply both controls (serialized rollout strategy plus DB-level lock) for defense in depth. + +### Rollback Caveat + +Rolling back the backend container image does not roll back database schema automatically. Use backward-compatible migrations (expand/contract pattern), and use a separate, explicit rollback migration only when needed. ## Rollback Procedure @@ -296,13 +343,17 @@ Alternatively, use the built-in Rancher rollback: ## Manual Builds -For testing or emergency builds: +For testing or emergency builds, you can manually build and push images using Docker Buildx. This is not recommended for regular use, as it bypasses CI checks and versioning conventions. + +First login to the NERSC registry: ```bash -# Login docker login registry.nersc.gov +``` + +### Backend -# Backend +```bash cd backend docker buildx build \ --platform=linux/amd64,linux/arm64 \ @@ -311,7 +362,23 @@ docker buildx build \ --push \ . -# Frontend +``` + +### Frontend (with API URL override) + +```bash +# Development +cd frontend +docker buildx build \ + --platform=linux/amd64,linux/arm64 \ + --build-arg VITE_API_BASE_URL=https://simboard-dev-api.e3sm.org \ + -t registry.nersc.gov/e3sm/simboard/frontend:manual \ + --push \ + . +``` + +```bash +# Production cd frontend docker buildx build \ --platform=linux/amd64,linux/arm64 \ diff --git a/docs/deploy/spin.md b/docs/deploy/spin.md new file mode 100644 index 00000000..661750f6 --- /dev/null +++ b/docs/deploy/spin.md @@ -0,0 +1,198 @@ +# NERSC Spin Workloads (Backend InitContainer Migrations) + +This runbook defines the NERSC Spin workload baseline and backend rollout flow using an initContainer for automatic Alembic migrations. +This runbook uses the Rancher UI as the primary deployment workflow. + +## Rancher UI Configs + +This document is the source of truth for Spin workload settings managed in Rancher UI. +No workload manifests are versioned under `deploy/spin/`. + +### Backend Deployment (`backend`) + +| Rancher field | Value | +| -------------------------- | ----------------------------------------------------------------------------------------------------- | +| Workload type | `Deployment` | +| Name | `backend` | +| Labels | `app=simboard-backend` | +| Replicas | `1` | +| Image pull secret | `registry-nersc` | +| Init container name | `migrate` | +| Init container image | `registry.nersc.gov/e3sm/simboard/backend:` | +| Init container command | `sh` | +| Init container args | `-c` | +| Init container script | `test -n "$DATABASE_URL" \|\| { echo "DATABASE_URL is required"; exit 1; }; alembic upgrade head` | +| Init envFrom secret | `simboard-backend-env` | +| App container name | `backend` | +| App container image | `registry.nersc.gov/e3sm/simboard/backend:` | +| App pull policy | `Always` | +| App command | leave empty (use image entrypoint) | +| App arguments | leave empty | +| Port | `8000/TCP` | +| App envFrom secret | `simboard-backend-env` | +| Container security context | `allowPrivilegeEscalation=false`, `privileged=false`, capabilities add `NET_BIND_SERVICE`, drop `ALL` | + +Canonical init container command/args to copy into Rancher: + +```sh +Command: sh +Args[0]: -c +Args[1]: test -n "$DATABASE_URL" || { echo "DATABASE_URL is required"; exit 1; }; alembic upgrade head +``` + +### Backend Service (`backend`) + +| Rancher field | Value | +| ---------------------- | -------------------------- | +| Service type | `ClusterIP` | +| Service name | `backend` | +| Service selector label | `app=simboard-backend` | +| Service port | `8000/TCP` (target `8000`) | + +### Mounting NERSC E3SM Performance Archive + +To mount the E3SM performance archive into backend pods, configure a bind mount in Rancher: + +| Rancher field | Value | +| ---------------------------- | -------------------------------------------- | +| Scope | Backend Deployment (`backend`) | +| Section | `Pod` -> `Storage` | +| Volume type | `Bind-Mount` | +| Volume name | `performance-archive` | +| Path on node | `/global/cfs/cdirs/e3sm/performance_archive` | +| The Path on the Node must be | `An existing directory` | + +Then mount that volume into the backend container (and only other containers that need it): + +| Rancher field | Value | +| ------------------------ | -------------------------------------------- | +| Scope | Backend container (`backend`) | +| Section | `Storage` | +| Volume | `performance-archive` | +| Mount path (recommended) | `/global/cfs/cdirs/e3sm/performance_archive` | +| Read only | `true` (recommended) | + +Security context requirements for NERSC global file system (NGF/CFS) mounts: + +- Set numeric `runAsUser` at pod/container level. +- If `runAsGroup` is set, also set `runAsUser`. +- Set `runAsGroup` and `fsGroup` to the appropriate numeric group ID. +- Keep Linux capabilities minimal (`drop: ALL`; only add what is required). + +Source: [NERSC Spin Storage - NERSC Global File Systems](https://docs.nersc.gov/services/spin/storage/#nersc-global-file-systems). + +### Frontend Deployment (`frontend`) + +| Rancher field | Value | +| -------------------------- | ------------------------------------------------------------------------------------------------------------------------- | +| Workload type | `Deployment` | +| Name | `frontend` | +| Labels | `app=simboard-frontend` | +| Replicas | `1` | +| Container name | `frontend` | +| Image | `registry.nersc.gov/e3sm/simboard/frontend:` | +| Pull policy | `Always` for `:dev`; `IfNotPresent` for versioned tags | +| Command | leave empty (use image CMD) | +| Arguments | leave empty | +| Port | `80/TCP` | +| Image pull secret | `registry-nersc` | +| Container security context | `allowPrivilegeEscalation=false`, `privileged=false`, capabilities add `CHOWN,SETGID,SETUID,NET_BIND_SERVICE`, drop `ALL` | + +### Frontend Service (`frontend`) + +| Rancher field | Value | +| ---------------------- | ----------------------- | +| Service type | `ClusterIP` | +| Service name | `frontend` | +| Service selector label | `app=simboard-frontend` | +| Service port | `80/TCP` (target `80`) | + +### DB Service (`db`) + +| Rancher field | Value | +| ---------------------- | -------------------------- | +| Service type | `ClusterIP` | +| Service name | `db` | +| Service selector label | `app=simboard-db` | +| Service port | `5432/TCP` (target `5432`) | + +### DB Deployment (`db`) + +| Rancher field | Value | +| -------------------------- | ---------------------------------------------------------------------------------------------------------------------------- | +| Workload type | `Deployment` | +| Name | `db` | +| Labels | `app=simboard-db` | +| Replicas | `1` | +| Container name | `db` | +| Image | `postgres:17` | +| Pull policy | `Always` | +| Port | `5432/TCP` | +| EnvFrom secret | `simboard-db` (includes all required DB runtime vars) | +| Container security context | `allowPrivilegeEscalation=false`, `privileged=false`, capabilities add `CHOWN,DAC_OVERRIDE,FOWNER,SETGID,SETUID`, drop `ALL` | + +### TLS Secret (`simboard-tls-cert`) + +| Rancher field | Value | +| ------------- | --------------------------- | +| Resource type | `Secret` | +| Name | `simboard-tls-cert` | +| Secret type | `kubernetes.io/tls` | +| Data key | `tls.crt` (certificate PEM) | +| Data key | `tls.key` (private key PEM) | + +### Ingress (`lb`) + +| Rancher field | Value | +| ------------------- | -------------------------------------------------------------------------------------------------- | +| Resource type | `Ingress` | +| Name | `lb` | +| Ingress class | `nginx` | +| TLS secret | `simboard-tls-cert` | +| TLS hosts | `simboard-dev.e3sm.org`, `simboard-dev-api.e3sm.org`, `lb.simboard.development.svc.spin.nersc.org` | +| Rule | Host `simboard-dev.e3sm.org`, path `/`, service `frontend:80` | +| Rule | Host `simboard-dev-api.e3sm.org`, path `/`, service `backend:8000` | +| Optional host alias | `lb.simboard.development.svc.spin.nersc.org` | + +## Required Secrets + +Create a backend env secret (example: `simboard-backend-env`) with all backend runtime vars +consumed by both app and migration init container, including: + +- `ENV`, `ENVIRONMENT`, `PORT` +- `FRONTEND_ORIGIN`, `FRONTEND_AUTH_REDIRECT_URL`, `FRONTEND_ORIGINS` +- `DATABASE_URL`, `TEST_DATABASE_URL` +- `GITHUB_CLIENT_ID`, `GITHUB_CLIENT_SECRET`, `GITHUB_REDIRECT_URL`, `GITHUB_STATE_SECRET_KEY` +- `COOKIE_NAME`, `COOKIE_SECURE`, `COOKIE_HTTPONLY`, `COOKIE_SAMESITE`, `COOKIE_MAX_AGE` + +Create a DB env secret (example: `simboard-db`) with DB container runtime vars, including: + +- `POSTGRES_USER`, `POSTGRES_PASSWORD` +- `POSTGRES_DB`, `POSTGRES_PORT`, `POSTGRES_SERVER` +- `PGDATA`, `PGTZ` + +Create a TLS secret (example: `simboard-tls-cert`) with: + +- `tls.crt`: TLS certificate in PEM format +- `tls.key`: TLS private key in PEM format + +## Deploy Order + +1. Open the [Rancher UI](https://rancher2.spin.nersc.gov/dashboard/home) and select the target namespace. +2. Ensure DB service/deployment (`db`) are healthy in **Service Discovery → Services** and **Workloads → Deployments**. +3. Update/redeploy backend deployment with the target backend image tag. +4. Watch backend pod init container logs (`migrate`) in Rancher to confirm migration success. +5. Verify backend deployment health and pod status under **Workloads → Pods**. +6. Verify ingress routing under **Service Discovery → Ingresses** for `lb` and confirm both frontend and backend hosts resolve via HTTPS. + +Frontend deploys independently from backend migration initContainer. For frontend releases, update/redeploy the `frontend` deployment in **Workloads → Deployments** with the target frontend image tag. + +## Failure Handling + +- If backend init container `migrate` fails, the backend pod will not become Ready. +- Fix database connectivity or migration issues, then redeploy backend. +- Backend image rollback does not revert schema automatically; handle schema rollback explicitly via Alembic when required. + +## Concurrency Note + +Migrations run once per new backend pod via initContainer. During a rollout, more than one backend pod can exist at the same time (for example, with multiple replicas or a RollingUpdate strategy and `maxSurge > 0`), and multiple pods can attempt migrations concurrently. If your migration safety model depends on a single migrator, configure the backend deployment to use either a **Recreate** rollout strategy or a **RollingUpdate** strategy with `maxSurge=0` (and typically `maxUnavailable=1`), or ensure your migration tooling enforces a DB-level migration lock.