Skip to content

fix: align Databricks Bundle deployment config, resources, and documentation #188

@larsgeorge-db

Description

@larsgeorge-db

Problem Statement

Deploying Ontos as a Databricks App via databricks bundle deploy does not work reliably. The bundle configuration (src/databricks.yaml), runtime configuration (src/app.yaml), and documentation (README, CONFIGURING.md, pyproject.toml) are inconsistent with each other in multiple ways:

  1. Missing resources: The bundle only provisions sql-warehouse, but the app requires database (Lakebase), volume, and optionally serving-endpoint (LLM). Deployment succeeds but the app crashes at startup because valueFrom references in app.yaml resolve to nothing.

  2. Entry point mismatch: databricks.yaml uses uvicorn src.app:app (which fails without --app-dir backend), while app.yaml uses python backend/src/app.py. The bundle's app_config variable overrides app.yaml, creating a conflict.

  3. Config duality: The bundle defines a full app_config variable (command + env vars) that overrides app.yaml at deploy time, but it's incomplete (missing PYTHONPATH, PGSCHEMA, APP_ADMIN_DEFAULT_GROUPS, etc.). Neither file is authoritative.

  4. Wrong variable names in docs/scripts: README says --var="catalog=app_data" but the actual variable is catalog_name. Same issue in pyproject.toml deploy script. The README also shows a separate databricks apps deploy command that's redundant with bundle deploy.

  5. Dead configuration: FRONTEND_STATIC_DIR is set in app.yaml but never read by the backend (which hardcodes the static path).

  6. Local dev friction: README tells users to manually mkdir backend/static with no explanation of why.

  7. Lakebase chicken-and-egg for DAB deploys: The DAB database resource requires an opaque auto-generated ID path (e.g., projects/.../databases/db-8uv1-...), not a human-readable instance name. This ID is only available after the Lakebase instance is created, creating a chicken-and-egg problem for DAB deploys. Marketplace installs don't have this issue (users select the instance via UI). The backend has a LAKEBASE_INSTANCE_NAME config field but get_lakebase_instance_name() in database.py does not use it as a fallback when the app resource lookup returns None.

Solution

Establish a clear config responsibility split and fix all inconsistencies:

  • databricks.yaml = infrastructure authority (resources, targets, permissions, variable definitions)
  • app.yaml = runtime authority (command, env vars, resource references via valueFrom)
  • manifest.yaml = resource contract with platform (already correct, no changes)

Remove the bundle's app_config override entirely. Add missing resources to the bundle (volume, serving-endpoint). For the database resource, use the manifest.yaml spec for Marketplace installs but support a LAKEBASE_INSTANCE_NAME env var fallback for DAB deploys where the opaque ID isn't available. Use per-target config overrides in databricks.yaml only for environment-varying env vars. Fix all documentation to match.

Add a config consistency validation script to catch drift between these files.

Database Strategy: Two Deployment Paths

Deployment Mode Database Config How It Works
Marketplace install User selects Lakebase instance in UI Platform creates database resource with opaque ID → get_lakebase_instance_name() reads it via ws_client.apps.get() → works automatically
DAB deploy Cannot declare database resource (opaque ID unknown) Set LAKEBASE_INSTANCE_NAME env var in app.yaml/bundle targets → get_lakebase_instance_name() falls back to this → SDK resolves host + credentials

Backend change needed: get_lakebase_instance_name() must check settings.LAKEBASE_INSTANCE_NAME as a fallback when the app resource lookup returns None.

User Stories

  1. As a developer, I want databricks bundle deploy -t dev to provision all required resources (SQL Warehouse, Volume) so that the app starts without missing resource errors.
  2. As a developer, I want databricks bundle deploy -t prod to provision all resources including the LLM serving endpoint so that AI features work in production.
  3. As a developer, I want to deploy without an LLM endpoint by setting LLM_ENABLED=False so that I can run the app in environments without a serving endpoint.
  4. As a developer, I want the README deployment instructions to use correct variable names (catalog_name, schema_name) so that copy-pasting commands actually works.
  5. As a developer, I want a single deploy command (databricks bundle deploy -t <target>) without needing a separate databricks apps deploy so that the deployment process is simple and predictable.
  6. As a developer, I want the bundle to automatically build the frontend during deployment (via npm run build auto-detection) so that I don't need manual build steps.
  7. As a developer, I want app.yaml to be the single source of truth for runtime configuration so that I only need to look in one place for command and env vars.
  8. As a developer, I want environment-specific values (DATABRICKS_CATALOG, PGSCHEMA) to be set per-target in the bundle so that dev and prod deployments use different catalogs/schemas automatically.
  9. As a developer, I want to run hatch -e dev run deploy-and-run with correct variable names so that the convenience script actually works.
  10. As a developer, I want the local development Quick Start in the README to explain the static directory requirement and ideally auto-create it so that onboarding is smooth.
  11. As a developer, I want a validation script that checks config consistency between databricks.yaml, app.yaml, and manifest.yaml so that drift is caught before deployment.
  12. As an operator, I want the CONFIGURING.md Lakebase deploy section to reference databricks bundle deploy (not databricks apps deploy) so that documentation matches the actual workflow.
  13. As a developer, I want dead configuration (FRONTEND_STATIC_DIR) removed from app.yaml so that the config is clean and doesn't mislead.
  14. As a developer deploying via DAB, I want to set LAKEBASE_INSTANCE_NAME as an env var so that I don't need the opaque database ID that's only available after Lakebase instance creation.
  15. As a Marketplace user installing Ontos, I want to select my existing Lakebase instance from the UI and have the app connect to it automatically without any manual configuration.
  16. As a developer, I want get_lakebase_instance_name() to fall back to the LAKEBASE_INSTANCE_NAME env var when no database resource is attached to the app, so that both DAB and Marketplace deployment paths work.

Implementation Decisions

Config Architecture

  • Remove the app_config variable and config: ${var.app_config} from databricks.yaml. The bundle should not define command or env vars — that is app.yaml's job.
  • Keep app.yaml entry point as python backend/src/app.py (uses the if __name__ block which calls uvicorn.run() internally). This is the current working pattern.
  • Add per-target config.env overrides in databricks.yaml targets for environment-varying values only: DATABRICKS_CATALOG, DATABRICKS_SCHEMA, PGSCHEMA.
  • Verify DAB merge behavior: does target-level config.env merge with app.yaml env by name, or replace entirely? This determines whether overrides work or need the full env list.

Resources in databricks.yaml

  • Required resources (in base config): sql-warehouse, volume
  • Optional resource (in base config with default): serving-endpoint with default name databricks-meta-llama-3-3-70b-instruct. Targets without LLM can override LLM_ENABLED=False in their config.env.
  • Database resource NOT declared in bundle — DAB cannot reference Lakebase by name, only by opaque ID. Instead, pass LAKEBASE_INSTANCE_NAME as an env var in bundle targets.
  • The database resource spec remains in manifest.yaml for Marketplace installs where users select the instance via UI.
  • New variables: serving_endpoint_name (default: databricks-meta-llama-3-3-70b-instruct), lakebase_instance_name (no default — must be set per target).
  • Resource names must match manifest.yaml names exactly: sql-warehouse, serving-endpoint, volume.

Database Connectivity: Dual-Path Support

The backend must support two database connection paths:

  1. Marketplace path (resource injection): get_lakebase_instance_name() reads the database resource's instance_name via ws_client.apps.get(app_name) → resolves host + credentials via SDK. This already works.

  2. DAB path (env var fallback): When no database resource exists on the app, get_lakebase_instance_name() falls back to settings.LAKEBASE_INSTANCE_NAME. The env var is set per-target in databricks.yaml.

Backend code change in src/backend/src/common/database.py:

  • Modify get_lakebase_instance_name() (line 67) to check settings.LAKEBASE_INSTANCE_NAME as fallback when app resource lookup returns None.
  • This also requires get_lakebase_instance_name() to accept settings as a parameter (or access it globally).
  • Additionally, PGHOST must be resolvable without the database resource. Currently get_db_url() requires settings.PGHOST — for the DAB path, the host must be derived from the instance name via SDK (e.g., ws_client.database.endpoints.get() or similar).

app.yaml Cleanup

  • Remove FRONTEND_STATIC_DIR (dead env var — backend hardcodes Path(__file__).parent.parent / "static").
  • Add DATABRICKS_CATALOG and DATABRICKS_SCHEMA with sensible defaults (will be overridden by bundle targets).
  • Add LAKEBASE_INSTANCE_NAME with empty default (set per-target in bundle, or auto-resolved via database resource for Marketplace).
  • Keep all other env vars as-is (PYTHONPATH, PGSCHEMA, APP_ADMIN_DEFAULT_GROUPS, etc.).

Documentation Fixes

  • README: Fix variable names in deploy command, remove databricks apps deploy, add target flags, add note about auto frontend build, improve Quick Start local dev instructions.
  • CONFIGURING.md: Replace databricks apps deploy <app-name> with databricks bundle deploy -t prod, align app.yaml example, document the two database connectivity paths (Marketplace vs DAB).
  • pyproject.toml: Fix deploy script variable names (catalogcatalog_name, schemaschema_name), fix app name (app_ontosontos), add target flag.

Config Consistency Validation

  • Add a script (e.g., src/scripts/validate_config.py) that:
    1. Parses databricks.yaml, app.yaml, and manifest.yaml
    2. Checks that every valueFrom reference in app.yaml has a matching resource name in both databricks.yaml and manifest.yaml
    3. Checks that databricks.yaml resource names match manifest.yaml resource spec names
    4. Optionally runs databricks bundle validate if CLI is available
  • Can be run in CI or manually before deploy.

Testing Decisions

Good tests for this work verify external behavior (does the config parse correctly, are resources consistent, does the fallback logic work) not implementation details (specific YAML formatting).

What to test

  1. get_lakebase_instance_name() fallback logic: Unit test that when app resource lookup returns None, the function falls back to settings.LAKEBASE_INSTANCE_NAME. Test both paths: resource found (Marketplace) and resource not found + env var set (DAB).

  2. Config consistency check (src/scripts/validate_config.py): A Python script that parses all three YAML files and asserts:

    • Every valueFrom reference in app.yaml has a matching resource in databricks.yaml or manifest.yaml
    • Every resource in databricks.yaml has a matching spec in manifest.yaml
    • Resource names are consistent across all files
  3. Bundle validation: Run databricks bundle validate -t dev and databricks bundle validate -t prod to check YAML syntax and variable resolution.

Prior art

  • The project has existing Python test infrastructure via pytest in src/backend/src/tests/
  • The validation script follows the pattern of src/scripts/build_static.sh — a standalone utility script in the scripts directory

Out of Scope

  • Lakebase provisioning automation: The PRD does not cover auto-creating the Lakebase database instance. Users still need to create it manually per CONFIGURING.md.
  • CI/CD pipeline setup: No GitHub Actions or CI pipeline changes. The validation script is meant to be run manually or integrated into CI separately.
  • Frontend build pipeline changes: The build_static.sh and vite.config.ts are correct and unchanged.
  • app.yaml variable interpolation: app.yaml does not support DAB variables. Environment-specific values must come from bundle target overrides.
  • Lobbying Databricks for name-based database references in DAB: The root cause of the chicken-and-egg problem is a platform limitation. This PRD works around it.

Further Notes

  • The DAB config.env merge behavior (merge-by-name vs full-replace) is a critical unknown. If it replaces the entire env list, the per-target override approach won't work and we'll need to duplicate all env vars in the bundle config. This must be verified against Databricks documentation or by testing before implementation.
  • The manifest.yaml and databricks.yaml serve complementary roles: manifest defines what resource types the app can consume (platform contract), while the bundle provisions specific instances of those resources. Both are needed.
  • The serving-endpoint default name (databricks-meta-llama-3-3-70b-instruct) should be validated against what's actually available in the target workspace. Consider making it a required variable with no default to force explicit configuration.
  • The Lakebase chicken-and-egg problem was reported by a DAB user. Their workaround was to skip the database resource entirely and use SDK-based auth. Our approach is similar but cleaner: keep the database spec in manifest.yaml for Marketplace, use LAKEBASE_INSTANCE_NAME env var as fallback for DAB.
  • PGHOST resolution for the DAB path needs investigation: can the SDK resolve the Lakebase host from the instance name alone, or do we need additional config? The workflows already use generate_database_credential() with instance names, so this pattern is proven.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingdocumentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions