Skip to content

feat(deploy): production deployment preparation, kernel fixes, and submodule sync#58

Draft
Tjemmmic wants to merge 107 commits into
masterfrom
misc-fixes
Draft

feat(deploy): production deployment preparation, kernel fixes, and submodule sync#58
Tjemmmic wants to merge 107 commits into
masterfrom
misc-fixes

Conversation

@Tjemmmic
Copy link
Copy Markdown
Contributor

Summary

Prepares the repo for production deployment: adds a production Docker Compose stack, Caddy reverse proxy, deploy script, and a CI validate workflow. Also fixes a backend kernel bug where list_nodes did not return the full descendant tree when depth was unspecified, bumps both submodule pointers to pull in a large batch of UI/UX fixes and the new FileSource::ByteContents browser file-transfer variant, and adds architecture / deployment / test documentation.

What's included

Production deployment infrastructure

  • docker-compose.production.yml — production compose stack
  • docker/caddy/Caddyfile — reverse proxy config
  • deploy.sh — deployment helper script
  • .env.example — environment template
  • Dockerfile updates for internal-service, ui, workspace-server
  • docker/workspace-server/kernel.toml additions

CI

  • .github/workflows/validate.yml — integration validation workflow

Backend fixes (citadel-workspace-server-kernel)

  • fix: return all tree descendants when depth is unspecified in list_nodes — kernel now walks the full subtree when no depth is passed instead of returning only direct children
  • Adjustments to async_kernel.rs, transaction/mod.rs, transaction/backend_ops_simple.rs, and handlers/domain/async_ops/async_node_ops.rs
  • Log level tuning (info) in main.rs

Submodule bumps

  • citadel-internal-service → pulls in the FileSource::ByteContents variant + typescript-client cleanup
  • citadel-workspaces → pulls in the large UI/UX fix branch (see its PR)

Documentation

  • docs/CODEMAP.md — full route/component inventory
  • docs/INTERACTION_MATRIX.md — navigation/feature matrix
  • docs/PRODUCTION_DEPLOYMENT.md — production readiness analysis
  • docs/TEST_RESULTS.md — systematic UI/UX test results
  • docs/test-screenshots/ — annotated screenshots

Commits

  • ceefaa3 chore: update caddyfile and bump workspace submodule
  • 0424227 wip!: production deployment preparation
  • e98060e chore: update citadel-workspaces submodule pointer
  • 6aa7be5 chore: update citadel-workspaces submodule pointer
  • fc1b508 docs: add codemap, interaction matrix, deployment analysis, and test results
  • f74e0d2 fix(ui): bump citadel-workspaces submodule with UI/UX fixes
  • 8cf1c10 chore: update submodule references
  • b15177b chore: update submodule references
  • 557bd93 fix: return all tree descendants when depth is unspecified in list_nodes
  • 45cc5c2 feat: change UI dev server port from 5173 to 5291 and fix circular dependency TDZ
  • b2eee43 chore: update citadel-workspaces submodule (feature audit fixes)
  • 646d76f chore: update citadel-workspaces submodule (feature hookups + UX fixes)
  • 23b9f76 fix: adjust log level to info, fix Dockerfile vite cache ownership, update submodule pointers

Tjemmmic added 13 commits March 2, 2026 13:29
Previously depth=None defaulted to 0 (direct children only), causing
room nodes (depth 2) to never be returned to the frontend. Now
depth=None returns all descendants at any depth, while depth=Some(0)
retains the direct-children-only behavior.
- citadel-internal-service: add ByteContents FileSource variant
- citadel-workspaces: add Playwright test infrastructure
- auto-claim first active session on direct URL navigation
- wire content header icons to actions in OfficeLayout
- enlarge avatar upload area and add brand mark to landing
- disable workspace-specific settings tabs when disconnected
- add password visibility toggle to login form
- add step progress indicator to registration wizard
- add Escape key dismissal to login and join overlays
- rename 'Add a New Workspace' to 'Join Workspace'
- constrain landing page to viewport height
- add Join Workspace CTA to empty Manage Accounts dialog
- wrap UserDirectory in AppLayout for consistent navigation
- improve login error messages with actionable feedback
- populate User Directory with workspace members
- persist workspace init modal dismissal across navigations
…results

- CODEMAP.md: complete route map and component inventory
- INTERACTION_MATRIX.md: exhaustive navigation and feature matrix
- PRODUCTION_DEPLOYMENT.md: production readiness analysis
- TEST_RESULTS.md: systematic UI/UX browser test results
- test-screenshots/: annotated screenshots from test runs
Points to UI/UX polish commits: placeholder removal, button casing,
accent color unification, TooltipProvider cleanup, and theme standardization.
Points to latest commits including MDX strikethrough fix,
localStorage server fallback, postAuthSetup consolidation,
UX permission feedback, and assorted bug fixes.
@Tjemmmic
Copy link
Copy Markdown
Contributor Author

@Tjemmmic

This comment was marked as outdated.

Tjemmmic added 10 commits April 19, 2026 22:06
Adds a visited set to the breadth-first search in `async_node_ops` to protect against genuine cycles and duplicate child references that could lead to exponential expansion and out-of-memory errors.
…store migration

Defers the backend migration sentinel initialization until `NodeRemote` is available, ensuring the real backend is initialized rather than the in-memory test storage.
Ensures that the per-workspace password key is also deleted from the backend when a workspace is removed, preventing secret leakage and potential re-association issues.
Introduces an `index_write_mutex` to serialize read-modify-write operations on index keys, preventing concurrent connections from racing and silently dropping entities from the index.
…ssing

Updates the production image validation in the CI workflow to fail if EITHER the server or internal-service image is missing, rather than only failing if BOTH are missing.
Removes the Caddyfile and updates deployment scripts and docker-compose configurations to expose the nginx UI directly on port 80.
@Tjemmmic

This comment was marked as outdated.

- Add `migration_tests` in `BackendTransactionManager` to verify the legacy-collection to per-entity-key migration.
- Update `ListNodes` documentation to explicitly document the depth semantics, especially that `None` equates to unlimited depth for the frontend's lazy-loading strategy.
@Tjemmmic

This comment was marked as outdated.

- Extract rate limiter from per-connection task local state into a
  shared `RateLimiter` component.
- This prevents a single actor from circumventing the budget by opening
  multiple concurrent connections (which previously granted N independent
  token buckets).
- Move state into `parking_lot::Mutex<HashMap<u64, Bucket>>` for fast,
  synchronous updates across connection tasks.
- Add comprehensive test suite covering multi-caller deduplication.
- Explain the ephemeral-by-default development contract for the backend config.
- Document the environment variables (`WORKSPACE_BACKEND`, `WORKSPACE_DATA_DIR`)
  required to enable persistent storage, which production deployments inject.
@Tjemmmic

This comment was marked as outdated.

@Tjemmmic

This comment was marked as outdated.

@Tjemmmic

This comment was marked as outdated.

@Tjemmmic

This comment was marked as outdated.

@Tjemmmic

This comment was marked as outdated.

@Tjemmmic

This comment was marked as outdated.

@Tjemmmic

This comment was marked as outdated.

@Tjemmmic
Copy link
Copy Markdown
Contributor Author

🤖 AI Code Review

Summary

This PR prepares the Citadel Workspace for production deployment by adding persistent filesystem backend support (opt-in via env vars), a production Docker Compose with nginx-served UI and Cloudflare tunnel support, a deploy script with rolling restarts, iterative BFS tree building to replace recursive implementation, and submodule syncs. The changes are extensive but well-structured with thorough inline documentation, good security practices (password redaction in Debug, CSP headers, non-root containers, CHANGE_ME placeholder validation), and solid test coverage for the core algorithm changes.

Issues Found

5 total — 0 P1 (blocking) · 1 P2 (should fix) · 4 P3 (nice to have)

⚠️ P2 — No healthcheck on the ui service in docker-compose.production.yml

  • File: docker-compose.production.yml:L129-L154
  • Problem: The server and internal-service services both have healthchecks, but the ui service does not. The cloudflared service depends on ui with condition: service_started, which only confirms the container exists — not that nginx is actually serving traffic. A misconfigured nginx config, failed Vite build producing an empty dist/, or a port conflict would go undetected until an operator manually tests the endpoint.
  • Fix: Add a healthcheck to the ui service:
  ui:
    # ... existing config ...
    healthcheck:
      test: ["CMD", "wget", "-q", "-O", "/dev/null", "http://localhost:8080/"]
      interval: 15s
      timeout: 5s
      retries: 3
      start_period: 10s

Then update cloudflared to use condition: service_healthy instead of condition: service_started.


ℹ️ P3 — deploy.sh .env parser doesn't trim whitespace from values, diverging from docker-compose behavior

  • File: deploy.sh:L89-L111
  • Problem: The line IFS='=' read -r key value correctly splits on the first =, but if .env contains KEY = value (spaces around =), the resulting value will be value with a leading space. Docker Compose's env-file loader strips leading/trailing whitespace from values. The key is sanitized via ${key// /} but the value is exported as-is. This means INTERNAL_SERVICE_PORT = 12346 in .env would cause wait_for_port to probe 12346 (with leading space) while docker-compose uses 12346. The deploy would time out waiting for the port with a confusing error.
  • Fix: After line 102 (quote stripping), add whitespace trimming for the value:
    # Trim leading/trailing whitespace to match docker-compose behavior
    value="${value## }"
    value="${value%% }"

ℹ️ P3 — No resource limits on cloudflared service in production compose

  • File: docker-compose.production.yml:L175-L200
  • Problem: Every other service in docker-compose.production.yml has deploy.resources.limits (server: 2G/2CPU, internal-service: 2G/2CPU, ui: 256M/0.5CPU), but cloudflared has none. A memory-leaking or CPU-spinning cloudflared process would be unconstrained on the host.
  • Fix: Add resource limits to the cloudflared service:
    deploy:
      resources:
        limits:
          memory: 256M
          cpus: '0.5'

ℹ️ P3 — deploy.sh doesn't validate CHANGE_ME password before building images

  • File: deploy.sh:L82-L86
  • Problem: The server binary validates that workspace_master_password doesn't contain __CHANGE_ME__ at startup (main.rs:L52), but deploy.sh only checks that .env exists — it doesn't pre-validate the password. If the operator forgot to edit .env, the deploy would pull code, rebuild all images (minutes), start the server, and only THEN fail with a clear error. A pre-build check would save significant time on this common misconfiguration.
  • Fix: After the .env existence check at line 84, add:
if grep -q '__CHANGE_ME__' .env 2>/dev/null; then
    echo "ERROR: .env still contains the __CHANGE_ME__ placeholder."
    echo "  Replace WORKSPACE_MASTER_PASSWORD with a real value."
    exit 1
fi

ℹ️ P3 — select_backend_type duplicated across two crates with no mechanical enforcement of sync

  • File: citadel-workspace-internal-service/src/main.rs:L79-L113
  • Problem: The select_backend_type function is duplicated between citadel-workspace-internal-service/src/main.rs:79-113 and citadel-workspace-server-kernel/src/lib.rs:519-553. Both share identical logic (env > cli/config > default, empty-string-as-unset, same default "./data"). The "KEEP IN SYNC WITH" comments and mirrored test suites mitigate drift, but adding a third backend type or changing precedence would require updating both independently. The kernel has 9 tests while the internal service has 7, so edge-case coverage already has a parity gap.
  • Fix: Extract the shared backend selection logic into citadel-workspace-types (already a shared dependency of both crates) as a public function, and call it from both binaries. Alternatively, add a CI check that diffs the two implementations.

✅ APPROVE

No P1 issues found. One P2 (missing UI healthcheck) and four P3s. The core algorithm changes (iterative BFS, cycle protection) are well-tested with dedicated integration tests. Security practices are solid (password redaction, CSP, non-root containers, placeholder validation). The production infrastructure is thoughtfully structured with good comments explaining design rationale. The P2 is a low-risk observability gap that can be addressed in a follow-up.

Quick Reference

  • P2: No healthcheck on the ui service in docker-compose.production.yml
  • P3: deploy.sh .env parser doesn't trim whitespace from values, diverging from docker-compose behavior
  • P3: No resource limits on cloudflared service in production compose
  • P3: deploy.sh doesn't validate CHANGE_ME password before building images
  • P3: select_backend_type duplicated across two crates with no mechanical enforcement of sync

Reviewed by Sokuza AI (zai-coding-plan/glm-5.1, via opencode)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant