Doorman Architecture — Hierarchical LLM Routing (OSS)

Cut AI inference cost 3–10× with routing intelligence, not bigger models. A tiny, practical framework that treats inference as triage + routing + verification:

Cache → Doorman (small model) → Redirector (policy) → Specialists (tools) → Big Model (fallback).

Why this exists (the problem)

Most systems send every query to a large, expensive LLM. That’s wasteful:

60–90% of production queries are repetitive or easy (FAQs, simple lookups).
Expensive models handle trivial work that a cache, a calculator, or a tiny model could answer.
Result: high bills, higher latency, and avoidable environmental cost.

Doorman Architecture makes cheap, fast paths the default—and escalates only when needed.

What this does (the solution)

L0 – Cache & rules (instant): exact/semantic cache hit? return.
L1 – Doorman AI (10–25ms): tiny model classifies intent, difficulty, confidence.
L2 – Redirector (1ms): deterministic policy decides next hop based on L1.
L3 – Specialists (tools): calculator, search/RAG, code runner, etc.
L4 – Big model (rare): only for novel/ambiguous/high-stakes asks; result is cached.

Key idea: early exits + tool use + tiny models absorb most traffic; heavy models are for the tail.

Diagram (Mermaid)

flowchart LR
    Q[User Query] --> C{Cache fresh?}
    C -- yes --> A[Answer (cached)]
    C -- no --> D[Doorman (small LLM)\nintent/difficulty/confidence]
    D --> R[Redirector Policy]
    R -->|faq easy high-conf| M[micro-LLM answer + verifier]
    R -->|math/date/units| T[Tool (calc/regex)]
    R -->|knowledge| G[RAG/retrieval + cite]
    R -->|hard/low-conf/sensitive| B[Big Model + verifier]
    M --> AC[Cache & return]
    T --> AC
    G --> AC
    B --> AC
    AC --> A[Answer]

Quick start (no local setup)

Option A: Replit (1-file demo, file cache)

Create a new Python Repl.
Add packages: flask, openai, numpy.
Set Secrets: OPENAI_API_KEY, ADMIN_KEY.
Paste the minimal demo from examples/replit_demo/app.py (this repo).
Run → you get a public URL: https://<your-repl>/v1/chat/completions.

Test:

curl -s https://<your-repl>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"What is 2+2?"}]}'

Ask again → should be cached: true in the JSON.

Replit demo is perfect for validation. For reliability & scale, use Option B.

Option B: Render + Neon (always-on, vector DB)

Neon (managed Postgres):

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE IF NOT EXISTS cache (
  id BIGSERIAL PRIMARY KEY,
  tenant_id TEXT NOT NULL,
  query TEXT NOT NULL,
  embedding vector(1536) NOT NULL,
  response TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  ttl_seconds INT NOT NULL DEFAULT 604800
);

CREATE INDEX IF NOT EXISTS cache_vec_idx
  ON cache USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 100);

CREATE INDEX IF NOT EXISTS cache_tenant_time_idx
  ON cache (tenant_id, created_at DESC);

Render (Web Service):

Repo contains app.py and requirements.txt.
Environment:
- OPENAI_API_KEY, DATABASE_URL, SIM_THRESHOLD=0.85, DEFAULT_TTL_SECONDS=604800, ADMIN_KEY.
Start command:

gunicorn app:app --preload --workers 2 --threads 4 --timeout 120

Test:

curl -s https://<your-service>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Api-Key: <your-tenant-key>" \
  -d '{"messages":[{"role":"user","content":"Explain HTTPS simply."}]}'

Minimal policy (copy/paste)

def route(intent, difficulty, confidence):
    # fast lanes
    if intent in {"math","units","datetime"}: return "TOOL"
    if intent == "faq" and confidence >= 0.90: return "MICRO"

    # cautious lanes
    if difficulty == "hard" or confidence < 0.60 or intent in {"policy_sensitive","medical","legal"}:
        return "BIG"

    # default
    return "MICRO"

MICRO → small/cheap model (128–256 tokens cap) + verifier
TOOL → calculator/regex/date utils
BIG → large model + verifier (cache result)

Reference pseudocode

# 1) Cache check
hit = cache.find_similar(query_emb, sim>=SIM_THRESHOLD, fresh_within=TTL)
if hit: return hit.response

# 2) Doorman (tiny model)
intent, difficulty, confidence = doorman.classify(query)

# 3) Redirect
decision = route(intent, difficulty, confidence)

# 4) Execute
if decision == "TOOL":
    ans = run_tool(intent, query)
elif decision == "MICRO":
    ans = micro_llm.generate(query, cap=256)
    if not verifier.passes(ans, intent):  # schema/citation/tests
        ans = big_llm.generate(query, cap=1024)
else:  # BIG
    ans = big_llm.generate(query, cap=1024)

# 5) Cache & return
cache.store(query, query_emb, ans, ttl=TTL)
return ans

Verifiers (keep them lightweight)

JSON/schema checks for structured outputs.
Math/units: compute and compare.
Citations: require at least one source URL/id for RAG answers.
Code: run unit tests in a sandbox.
Safety: basic content filters on final outputs.

Metrics to track (day one)

Hit rate by tier: % served by Cache / Micro / Tools / Big.
QACST: Quality-Adjusted Cost per Solved Task (north-star).
P95 latency per intent.
Verifier catch rate: % bad drafts blocked before user.
Escalation precision/recall: are we escalating the right queries?

Repo structure

/
├─ README.md
├─ LICENSE
├─ app.py                  # simple Flask reference (DB-backed)
├─ requirements.txt
├─ examples/
│  └─ replit_demo/
│     └─ app.py           # 1-file file-cache demo
├─ docs/
│  ├─ doorman_policy.md   # routing policies & thresholds
│  ├─ verifiers.md        # schema checks, tests, safety
│  └─ metrics.md          # QACST & dashboards
└─ diagrams/
   └─ doorman.mmd         # mermaid source

Naming & framing

Doorman Architecture (memorable)
Hierarchical LLM Routing (descriptive)
Tagline: “Smarter paths, smaller bills.”

Roadmap (OSS core first, SaaS optional)

v0.1 (this repo)

Cache + Doorman + Redirector + sample verifiers
Replit & Render quickstarts
OpenAI-compatible /v1/chat/completions

v0.2

Multi-tenant API keys + basic rate limits
Savings & hit-rate metrics endpoint
Pluggable tool registry (calc, RAG, code)

v0.3

Router “learning from escalations”
Pinned retrieval snapshots & TTL policies per intent
Starter dashboards (simple HTML or CSV export)

Hosted option (optional, parallel)

Drop-in proxy + savings dashboard; shared-savings pricing

License

Apache 2.0 (permissive, encourages adoption).
If you ship a hosted version, keep enterprise bits (SSO, SLAs, advanced dashboards) separate.

Contributing

PRs: new verifiers, tools, routing policies, benchmarks.
Issues: include a minimal repro + intent/difficulty/confidence if possible.
We welcome production stories (cost saved, latency gains) → helps guide defaults.

FAQ

Isn’t this what big labs will do anyway? Maybe. This project is vendor-agnostic, runs today, and focuses on operational excellence (routing + verification + governance) that vendors often under-prioritize.

Does this reduce quality? No—verifier + escalation preserves quality. Easy questions take cheap lanes; hard ones still hit the big model.

What if my queries need personalization? Cache content separately from style; have a micro-LLM rephrase cached content in the user’s tone.

What about staleness? Use TTL + provenance; nightlies can revalidate top cached answers.

Acknowledgments

This project stands on the shoulders of the broader AI community exploring sparse activation, test-time compute, RAG, and verification. The contribution here is a practical, production-oriented framing and a set of minimal, copy-pasteable building blocks.

Plant the flag. If you share the belief that efficiency beats brute force, please ⭐ the repo, try the quickstart, and open an issue with your use case.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
examples/replit_demo		examples/replit_demo
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Doorman Architecture — Hierarchical LLM Routing (OSS)

Why this exists (the problem)

What this does (the solution)

Diagram (Mermaid)

Quick start (no local setup)

Option A: Replit (1-file demo, file cache)

Option B: Render + Neon (always-on, vector DB)

Minimal policy (copy/paste)

Reference pseudocode

Verifiers (keep them lightweight)

Metrics to track (day one)

Repo structure

Naming & framing

Roadmap (OSS core first, SaaS optional)

License

Contributing

FAQ

Acknowledgments

About

Uh oh!

Releases

Packages

License

TheAIOldtimer/AI-Doorman-Architecture

Folders and files

Latest commit

History

Repository files navigation

Doorman Architecture — Hierarchical LLM Routing (OSS)

Why this exists (the problem)

What this does (the solution)

Diagram (Mermaid)

Quick start (no local setup)

Option A: Replit (1-file demo, file cache)

Option B: Render + Neon (always-on, vector DB)

Minimal policy (copy/paste)

Reference pseudocode

Verifiers (keep them lightweight)

Metrics to track (day one)

Repo structure

Naming & framing

Roadmap (OSS core first, SaaS optional)

License

Contributing

FAQ

Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages