Skip to content

[DevOps]: Production-Safe Alembic Migration Strategy (NERSC Spin) #117

@tomvothecoder

Description

@tomvothecoder

Context

The backend currently runs alembic upgrade head automatically on container startup.

This works under single-replica, recreate-style deployments but will fail under:

  • Multiple replicas
  • Rolling deployments
  • Horizontal scaling

Deployment occurs on NERSC Spin (Kubernetes).

The database is behind a firewall and requires explicit service/network configuration to allow access from Spin workloads.

We must define a production-safe migration strategy that works within these network constraints.


Problem

Startup-based migrations introduce:

  • Race conditions when multiple replicas start simultaneously
  • Tight coupling between application boot and schema changes
  • Crash loops if migration fails
  • Implicit constraint of replicas=1
  • Unsafe behavior during rolling updates

Additional constraint:

  • Database access requires proper firewall/service configuration.
  • Migration logic must run from within an allowed network boundary (e.g., Spin namespace).

Required Outcome

Define and implement a production-safe migration strategy for NERSC Spin that:

  • Prevents concurrent schema migrations
  • Decouples schema changes from application startup
  • Supports multi-replica deployments
  • Works within NERSC firewall/network constraints
  • Clearly documents DB access requirements

Migration Strategies to Evaluate

Copilot should evaluate and propose one of the following:

  1. Dedicated Kubernetes Job in Spin namespace to run alembic upgrade head
  2. CI/CD migration step executed from within NERSC network boundary
  3. Explicit manual migration step from a controlled NERSC host
  4. Guarded startup migration using advisory DB locks (only if justified)

The recommendation must:

  • Address firewall/service access requirements
  • Specify where migrations execute (Spin pod, login node, CI runner, etc.)
  • Include operational tradeoffs

Network / Firewall Requirements

Document:

  • How Spin pods reach the database (service name, host, port)
  • Required firewall rules or network policies
  • Whether a dedicated migration Job requires separate service account or network policy
  • Any changes required to expose or allow DB connectivity

Acceptance Criteria

  • Selected migration strategy documented

  • Deployment flow clearly defined:

    build → migrate → deploy

  • Explicit scaling constraints documented (if any)

  • Firewall / service configuration documented

  • Kubernetes manifests updated if required

  • Startup-time migration removed or gated appropriately

  • Rollback strategy documented


Deliverables

  • Code changes (if required)
  • Kubernetes manifest updates (Deployment / Job / NetworkPolicy)
  • README / ops documentation update
  • Clear summary of chosen strategy and rationale

Metadata

Metadata

Labels

type: devopsDevOps task (e.g., CI/CD, Docker)

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions