Cut AI inference cost 3–10× with routing intelligence, not bigger models. A tiny, practical framework that treats inference as triage + routing + verification:
Cache → Doorman (small model) → Redirector (policy) → Specialists (tools) → Big Model (fallback).
Most systems send every query to a large, expensive LLM. That’s wasteful:
- 60–90% of production queries are repetitive or easy (FAQs, simple lookups).
- Expensive models handle trivial work that a cache, a calculator, or a tiny model could answer.
- Result: high bills, higher latency, and avoidable environmental cost.
Doorman Architecture makes cheap, fast paths the default—and escalates only when needed.
- L0 – Cache & rules (instant): exact/semantic cache hit? return.
- L1 – Doorman AI (10–25ms): tiny model classifies intent, difficulty, confidence.
- L2 – Redirector (1ms): deterministic policy decides next hop based on L1.
- L3 – Specialists (tools): calculator, search/RAG, code runner, etc.
- L4 – Big model (rare): only for novel/ambiguous/high-stakes asks; result is cached.
Key idea: early exits + tool use + tiny models absorb most traffic; heavy models are for the tail.
flowchart LR
Q[User Query] --> C{Cache fresh?}
C -- yes --> A[Answer (cached)]
C -- no --> D[Doorman (small LLM)\nintent/difficulty/confidence]
D --> R[Redirector Policy]
R -->|faq easy high-conf| M[micro-LLM answer + verifier]
R -->|math/date/units| T[Tool (calc/regex)]
R -->|knowledge| G[RAG/retrieval + cite]
R -->|hard/low-conf/sensitive| B[Big Model + verifier]
M --> AC[Cache & return]
T --> AC
G --> AC
B --> AC
AC --> A[Answer]
- Create a new Python Repl.
- Add packages:
flask
,openai
,numpy
. - Set Secrets:
OPENAI_API_KEY
,ADMIN_KEY
. - Paste the minimal demo from
examples/replit_demo/app.py
(this repo). - Run → you get a public URL:
https://<your-repl>/v1/chat/completions
.
Test:
curl -s https://<your-repl>/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"What is 2+2?"}]}'
Ask again → should be cached: true in the JSON.
Replit demo is perfect for validation. For reliability & scale, use Option B.
- Neon (managed Postgres):
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE IF NOT EXISTS cache (
id BIGSERIAL PRIMARY KEY,
tenant_id TEXT NOT NULL,
query TEXT NOT NULL,
embedding vector(1536) NOT NULL,
response TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
ttl_seconds INT NOT NULL DEFAULT 604800
);
CREATE INDEX IF NOT EXISTS cache_vec_idx
ON cache USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
CREATE INDEX IF NOT EXISTS cache_tenant_time_idx
ON cache (tenant_id, created_at DESC);
- Render (Web Service):
-
Repo contains
app.py
andrequirements.txt
. -
Environment:
OPENAI_API_KEY
,DATABASE_URL
,SIM_THRESHOLD=0.85
,DEFAULT_TTL_SECONDS=604800
,ADMIN_KEY
.
-
Start command:
gunicorn app:app --preload --workers 2 --threads 4 --timeout 120
Test:
curl -s https://<your-service>/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-Api-Key: <your-tenant-key>" \
-d '{"messages":[{"role":"user","content":"Explain HTTPS simply."}]}'
def route(intent, difficulty, confidence):
# fast lanes
if intent in {"math","units","datetime"}: return "TOOL"
if intent == "faq" and confidence >= 0.90: return "MICRO"
# cautious lanes
if difficulty == "hard" or confidence < 0.60 or intent in {"policy_sensitive","medical","legal"}:
return "BIG"
# default
return "MICRO"
- MICRO → small/cheap model (128–256 tokens cap) + verifier
- TOOL → calculator/regex/date utils
- BIG → large model + verifier (cache result)
# 1) Cache check
hit = cache.find_similar(query_emb, sim>=SIM_THRESHOLD, fresh_within=TTL)
if hit: return hit.response
# 2) Doorman (tiny model)
intent, difficulty, confidence = doorman.classify(query)
# 3) Redirect
decision = route(intent, difficulty, confidence)
# 4) Execute
if decision == "TOOL":
ans = run_tool(intent, query)
elif decision == "MICRO":
ans = micro_llm.generate(query, cap=256)
if not verifier.passes(ans, intent): # schema/citation/tests
ans = big_llm.generate(query, cap=1024)
else: # BIG
ans = big_llm.generate(query, cap=1024)
# 5) Cache & return
cache.store(query, query_emb, ans, ttl=TTL)
return ans
- JSON/schema checks for structured outputs.
- Math/units: compute and compare.
- Citations: require at least one source URL/id for RAG answers.
- Code: run unit tests in a sandbox.
- Safety: basic content filters on final outputs.
- Hit rate by tier: % served by Cache / Micro / Tools / Big.
- QACST: Quality-Adjusted Cost per Solved Task (north-star).
- P95 latency per intent.
- Verifier catch rate: % bad drafts blocked before user.
- Escalation precision/recall: are we escalating the right queries?
/
├─ README.md
├─ LICENSE
├─ app.py # simple Flask reference (DB-backed)
├─ requirements.txt
├─ examples/
│ └─ replit_demo/
│ └─ app.py # 1-file file-cache demo
├─ docs/
│ ├─ doorman_policy.md # routing policies & thresholds
│ ├─ verifiers.md # schema checks, tests, safety
│ └─ metrics.md # QACST & dashboards
└─ diagrams/
└─ doorman.mmd # mermaid source
- Doorman Architecture (memorable)
- Hierarchical LLM Routing (descriptive)
- Tagline: “Smarter paths, smaller bills.”
v0.1 (this repo)
- Cache + Doorman + Redirector + sample verifiers
- Replit & Render quickstarts
- OpenAI-compatible
/v1/chat/completions
v0.2
- Multi-tenant API keys + basic rate limits
- Savings & hit-rate metrics endpoint
- Pluggable tool registry (calc, RAG, code)
v0.3
- Router “learning from escalations”
- Pinned retrieval snapshots & TTL policies per intent
- Starter dashboards (simple HTML or CSV export)
Hosted option (optional, parallel)
- Drop-in proxy + savings dashboard; shared-savings pricing
- Apache 2.0 (permissive, encourages adoption).
- If you ship a hosted version, keep enterprise bits (SSO, SLAs, advanced dashboards) separate.
- PRs: new verifiers, tools, routing policies, benchmarks.
- Issues: include a minimal repro + intent/difficulty/confidence if possible.
- We welcome production stories (cost saved, latency gains) → helps guide defaults.
Isn’t this what big labs will do anyway? Maybe. This project is vendor-agnostic, runs today, and focuses on operational excellence (routing + verification + governance) that vendors often under-prioritize.
Does this reduce quality? No—verifier + escalation preserves quality. Easy questions take cheap lanes; hard ones still hit the big model.
What if my queries need personalization? Cache content separately from style; have a micro-LLM rephrase cached content in the user’s tone.
What about staleness? Use TTL + provenance; nightlies can revalidate top cached answers.
This project stands on the shoulders of the broader AI community exploring sparse activation, test-time compute, RAG, and verification. The contribution here is a practical, production-oriented framing and a set of minimal, copy-pasteable building blocks.
Plant the flag. If you share the belief that efficiency beats brute force, please ⭐ the repo, try the quickstart, and open an issue with your use case.