Flow-Next Token-Optimized Development Specification
Complete End-to-End System Design
Table of Contents
- Executive Summary
- System Overview
- Resource Inventory
- Phase 1: Interview
- Phase 2: Plan
- Phase 3: Work
- Phase 4: Review
- Local Model Infrastructure
- Multi-Source Orchestration
- Configuration Reference
- Implementation Roadmap
- Appendices
1. Executive Summary
1.1 The Problem
Flow-next's Ralph loops burn through premium model quotas (Codex 5.2, Opus 4.5) in 2 days instead of lasting a full week. Token consumption occurs across all phases:
| Phase |
Current Token Burn |
Primary Waste |
| Interview |
Low (human-paced) |
None significant |
| Plan |
Medium (scout parallelism) |
Redundant codebase scanning |
| Work |
High (implementation) |
Full context on every task |
| Review |
Very High |
Re-reviewing unchanged code, verbose prompts |
1.2 The Solution
Deploy GLM-4.7-Flash (59.2% SWE-Bench) locally to handle preprocessing across all phases while premium models (Opus 4.5, Codex 5.2, Gemini 3 Pro) make final decisions via multi-source quota pooling.
1.3 Key Principles
- Premium models always make final decisions — No quality compromise
- Local handles grunt work — Pre-screening, classification, context preparation
- Multi-source pooling — 4 premium quota pools (OpenAI, Claude, Copilot, Antigravity)
- No lossy compression — Full context for security/complex reviews
- Phase-appropriate optimization — Each phase optimized differently
1.4 Expected Outcomes
| Metric |
Current |
Target |
| Weekly quota duration |
2 days |
7+ days |
| Premium calls per task |
2-4 |
0.3-0.5 |
| Review quality |
100% |
100% (unchanged) |
| Local pre-screening catch rate |
0% |
60-80% |
2. System Overview
2.1 End-to-End Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ FLOW-NEXT OPTIMIZED PIPELINE │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ PHASE 1: INTERVIEW │
│ /flow-next:interview │
│ │
│ Human-paced, low token burn │
│ Local: Question generation, spec drafting │
│ Premium: Complex decision trees (when needed) │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PHASE 2: PLAN │
│ /flow-next:plan │
│ │
│ Scout parallelism, moderate token burn │
│ Local: repo-scout, practice-scout, docs-scout (GLM-4.7-Flash) │
│ Premium: Epic spec writing, architectural decisions │
│ Optimization: Scout result caching, incremental updates │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PHASE 3: WORK │
│ /flow-next:work │
│ │
│ Implementation, high token burn │
│ Local: Worker pre-anchor summaries, implementation assistance │
│ Premium: Complex implementation decisions, unfamiliar patterns │
│ Optimization: Task-scoped context, memory system │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PHASE 4: REVIEW │
│ /flow-next:impl-review, /flow-next:plan-review │
│ │
│ Verification, very high token burn │
│ Local: Pre-screening, fingerprinting, classification │
│ Premium: Final SHIP/NEEDS_WORK judgment │
│ Optimization: Multi-source routing, lean prompts │
└─────────────────────────────────────────────────────────────────────────────┘
2.2 Model Allocation by Phase
| Phase |
Local Model |
Role |
Premium Model |
Role |
| Interview |
GLM-4.7-Flash |
Question drafting |
Opus 4.5 |
Complex decisions |
| Plan |
GLM-4.7-Flash |
Scout agents |
Opus 4.5 |
Epic spec review |
| Work |
GLM-4.7-Flash |
Re-anchor, assistance |
Opus 4.5 / Codex 5.2 |
Implementation |
| Review |
GLM-4.7-Flash |
Pre-screen, classify |
Codex 5.2 / Opus 4.5 |
Final verdict |
2.3 Data Flow
┌─────────────────────────────────────────────────────────────────────────────┐
│ STATE MANAGEMENT │
└─────────────────────────────────────────────────────────────────────────────┘
.flow/
├── epics/ # Epic metadata (JSON)
├── tasks/ # Task metadata (JSON)
├── specs/ # Epic specs (Markdown)
│ └── fn-N-xxx.md
├── tasks/ # Task specs (Markdown)
│ └── fn-N-xxx.M.md
├── config.json # Project configuration
├── meta.json # Flow metadata
│
├── review_state.json # NEW: Fingerprints, blockers
├── quota_state.json # NEW: Multi-source quota tracking
├── metrics.json # NEW: Telemetry
│
├── memory/ # Existing memory system
│ ├── pitfalls.md
│ ├── conventions.md
│ └── decisions.md
│
└── cache/ # NEW: Scout result caching
├── repo-scout/
├── practice-scout/
└── docs-scout/
3. Resource Inventory
3.1 Premium Sources
| Source |
Subscription |
Models |
Best For |
Access |
| OpenAI Pro |
$200/mo |
Codex 5.2, GPT-4o |
Security review |
API |
| Claude Max |
$200/mo |
Opus 4.5, Sonnet 4 |
Complex reasoning |
API, Claude Code |
| Copilot Pro+ |
$39/mo |
Opus 4.5, Codex 5.2, Sonnet 4 |
Flexible routing |
Coding Agent |
| Antigravity |
~$25/mo |
Opus 4.5, Gemini 3 Pro |
Additional capacity |
CLI |
3.2 Local Hardware
| Component |
Specification |
Role |
| GPU |
NVIDIA RTX 5090 32GB VRAM |
Model inference |
| RAM |
128GB DDR5 |
Extended context |
| Storage |
Gen5 NVMe 14.9 GB/s |
Fast model loading |
3.3 Local Models
GLM-4.7-Flash (Primary)
| Attribute |
Value |
| Model |
GLM-4.7-Flash-UD-Q4_K_XL |
| SWE-Bench |
59.2% (best local) |
| Parameters |
30B total, ~3.6B active (MoE) |
| Context |
Up to 200K tokens |
| VRAM |
23-30GB (context dependent) |
| Speed |
60-100 tok/s |
VRAM by Context:
| Context |
VRAM |
7B Colocate? |
| 32K |
~20 GB |
✅ Yes |
| 65K |
~23 GB |
✅ Yes |
| 131K |
~30 GB |
❌ No |
Qwen2.5-Coder-7B (Pre-filter)
| Attribute |
Value |
| Model |
qwen2.5-coder:7b-instruct-q8_0 |
| Context |
32K tokens |
| VRAM |
~8 GB |
| Speed |
100-120 tok/s |
| Role |
Fast pre-filter |
4. Phase 1: Interview
4.1 Overview
Interview is human-paced and low token burn by nature. Optimization focuses on question quality and spec drafting assistance.
┌─────────────────────────────────────────────────────────────────────────────┐
│ INTERVIEW OPTIMIZATION │
└─────────────────────────────────────────────────────────────────────────────┘
Current: All questions generated by premium model
Optimized: Local generates question candidates, premium refines
Token Savings: 30-50% (interview is already low burn)
Quality Impact: None (human validates all questions)
4.2 Optimization Strategy
4.2.1 Local Question Generation
GLM-4.7-Flash generates initial question batches based on input:
def generate_interview_questions(
input_type: str, # epic, task, file, idea
content: str,
category: str, # scope, technical, edge_cases, etc.
) -> list[str]:
"""Generate interview questions locally."""
prompt = f"""Generate interview questions for a {input_type}.
## Content
{content}
## Category: {category}
## Question Guidelines
- Dig deep on hidden complexity
- Surface assumptions
- Identify edge cases
- Group related questions (2-4 per batch)
Generate 5-10 questions for this category.
"""
return run_glm47_flash(prompt)
4.2.2 Premium Refinement
Premium model reviews and refines questions only when:
- Questions seem superficial
- Domain requires specialized knowledge
- User requests deeper exploration
4.3 Interview Flow (Optimized)
┌─────────────────────────────────────────────────────────────────────────────┐
│ INTERVIEW FLOW │
└─────────────────────────────────────────────────────────────────────────────┘
Input (epic/task/file/idea)
│
▼
┌─────────────────┐
│ GLM-4.7-Flash │ ← LOCAL: Parse input, identify question categories
│ Question Gen │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Human Q&A Loop │ ← Questions via AskUserQuestion tool
│ (40+ questions) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ GLM-4.7-Flash │ ← LOCAL: Draft refined spec from answers
│ Spec Drafting │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Human Review │ ← User approves/edits
└────────┬────────┘
│
▼
┌─────────────────┐
│ Write to │ ← flowctl epic set-plan / task set-spec
│ .flow/specs/ │
└─────────────────┘
4.4 Configuration
{
"interview": {
"question_generation": {
"backend": "local",
"model": "GLM-4.7-Flash",
"fallback_to_premium": true
},
"spec_drafting": {
"backend": "local",
"model": "GLM-4.7-Flash"
},
"premium_refinement": {
"enabled": true,
"trigger": "user_request_or_shallow_questions"
}
}
}
5. Phase 2: Plan
5.1 Overview
Planning involves parallel scout agents that analyze the codebase. Current implementation runs all scouts via premium models. Optimization runs scouts locally with premium for final spec writing.
┌─────────────────────────────────────────────────────────────────────────────┐
│ PLAN OPTIMIZATION │
└─────────────────────────────────────────────────────────────────────────────┘
Current Scout Agents (all premium):
├── repo-scout - Codebase analysis
├── practice-scout - Best practices research
├── docs-scout - Documentation analysis
├── github-scout - GitHub issues/PRs
├── epic-scout - Epic dependencies
└── docs-gap-scout - Documentation gaps
Optimized:
├── repo-scout - LOCAL (GLM-4.7-Flash)
├── practice-scout - LOCAL (GLM-4.7-Flash)
├── docs-scout - LOCAL (GLM-4.7-Flash)
├── github-scout - PREMIUM (needs API access)
├── epic-scout - LOCAL (GLM-4.7-Flash)
└── docs-gap-scout - LOCAL (GLM-4.7-Flash)
Token Savings: 60-70% of planning phase
Quality Impact: Minimal (scouts gather info, premium synthesizes)
5.2 Scout Agent Optimization
5.2.1 Local Scout Execution
class LocalScoutRunner:
"""Run scout agents locally via GLM-4.7-Flash."""
def __init__(self, glm_backend: LlamaCppBackend):
self.glm = glm_backend
self.cache_dir = Path(".flow/cache")
def run_repo_scout(self, feature_description: str) -> ScoutResult:
"""Analyze codebase for relevant patterns."""
# Check cache first
cache_key = self._cache_key("repo", feature_description)
if cached := self._get_cache(cache_key):
return cached
# Gather codebase info
file_tree = self._get_file_tree()
recent_commits = self._get_recent_commits(30)
prompt = f"""Analyze this codebase for implementing: {feature_description}
## File Structure
{file_tree}
## Recent Commits
{recent_commits}
## Tasks
1. Identify files likely to be modified
2. Find similar patterns in the codebase
3. Note relevant imports/dependencies
4. Flag potential conflicts
Output structured findings.
"""
result = self.glm.generate(prompt)
self._set_cache(cache_key, result)
return result
def run_practice_scout(self, feature_description: str) -> ScoutResult:
"""Research best practices for the feature."""
prompt = f"""Research best practices for: {feature_description}
Consider:
1. Common implementation patterns
2. Security considerations
3. Performance implications
4. Testing strategies
5. Error handling approaches
Base recommendations on industry standards and common patterns.
"""
return self.glm.generate(prompt)
def run_epic_scout(self, new_epic_id: str) -> ScoutResult:
"""Find dependencies on existing epics."""
# Read all existing epics
epics = self._load_all_epics()
prompt = f"""Analyze dependencies for new epic {new_epic_id}.
## Existing Epics
{self._format_epics(epics)}
## New Epic
{self._load_epic(new_epic_id)}
Identify:
1. Which existing epics this depends on
2. Specific tasks/components used
3. Potential conflicts
"""
return self.glm.generate(prompt)
5.2.2 Scout Result Caching
class ScoutCache:
"""Cache scout results to avoid redundant analysis."""
def __init__(self, cache_dir: Path):
self.cache_dir = cache_dir
self.ttl_hours = 24 # Results valid for 24 hours
def get(self, scout_type: str, key: str) -> Optional[ScoutResult]:
cache_file = self.cache_dir / scout_type / f"{self._hash(key)}.json"
if not cache_file.exists():
return None
data = json.loads(cache_file.read_text())
# Check TTL
cached_at = datetime.fromisoformat(data["cached_at"])
if datetime.now() - cached_at > timedelta(hours=self.ttl_hours):
return None
# Check if codebase changed significantly
if data.get("git_hash") != self._current_git_hash():
# Invalidate if significant changes
changed_files = self._get_changed_files(data["git_hash"])
if self._significant_changes(changed_files, data.get("relevant_files", [])):
return None
return ScoutResult.from_dict(data["result"])
def set(self, scout_type: str, key: str, result: ScoutResult):
cache_file = self.cache_dir / scout_type / f"{self._hash(key)}.json"
cache_file.parent.mkdir(parents=True, exist_ok=True)
data = {
"cached_at": datetime.now().isoformat(),
"git_hash": self._current_git_hash(),
"relevant_files": result.relevant_files,
"result": result.to_dict(),
}
cache_file.write_text(json.dumps(data, indent=2))
5.3 Planning Flow (Optimized)
┌─────────────────────────────────────────────────────────────────────────────┐
│ PLANNING FLOW │
└─────────────────────────────────────────────────────────────────────────────┘
Input (feature idea or epic ID)
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PARALLEL SCOUT EXECUTION │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ repo-scout │ │ practice- │ │ docs-scout │ │ epic-scout │ │
│ │ LOCAL │ │ scout LOCAL │ │ LOCAL │ │ LOCAL │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │ │
│ └────────────────┴────────────────┴────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Scout Results │ │
│ │ Aggregation │ │
│ └────────┬────────┘ │
└───────────────────────────────────┼─────────────────────────────────────────┘
│
▼
┌─────────────────┐
│ GLM-4.7-Flash │ ← LOCAL: Draft epic spec
│ Spec Drafting │
└────────┬────────┘
│
▼
┌─────────────────┐
│ PREMIUM │ ← Opus 4.5: Architectural review
│ Spec Review │ (optional, on request)
└────────┬────────┘
│
▼
┌─────────────────┐
│ Task Breakdown │ ← LOCAL: Create task specs
│ GLM-4.7-Flash │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Write to .flow/ │
└─────────────────┘
5.4 Plan Review Optimization
Plan reviews follow the same optimization as impl reviews:
def review_plan(epic_id: str) -> ReviewResult:
"""Review epic plan with optimized pipeline."""
# 1. Local pre-screen
prescreen = local_prescreen_plan(epic_id)
if prescreen.verdict == "NEEDS_WORK":
return prescreen # Fix locally
# 2. Classify complexity
complexity = classify_plan_complexity(epic_id)
# 3. Route to appropriate premium
if complexity == "SIMPLE":
# Sonnet 4 or Gemini 3 Pro
return route_to_cheapest_premium(epic_id, "plan")
else:
# Opus 4.5 for complex architectural review
return route_to_opus(epic_id, "plan")
5.5 Configuration
{
"plan": {
"scouts": {
"repo_scout": {"backend": "local", "cache_ttl_hours": 24},
"practice_scout": {"backend": "local", "cache_ttl_hours": 168},
"docs_scout": {"backend": "local", "cache_ttl_hours": 24},
"github_scout": {"backend": "premium", "model": "sonnet-4"},
"epic_scout": {"backend": "local", "cache_ttl_hours": 24},
"docs_gap_scout": {"backend": "local", "cache_ttl_hours": 24}
},
"spec_drafting": {
"backend": "local",
"model": "GLM-4.7-Flash"
},
"task_breakdown": {
"backend": "local",
"model": "GLM-4.7-Flash"
},
"plan_review": {
"simple": {"backend": "sonnet-4"},
"complex": {"backend": "opus-4.5"}
}
}
}
6. Phase 3: Work
6.1 Overview
Work phase is the highest token consumer due to implementation complexity. Current worker agents get full context from premium models. Optimization uses local models for re-anchoring and routine implementation while premium handles complex decisions.
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORK OPTIMIZATION │
└─────────────────────────────────────────────────────────────────────────────┘
Current Worker Flow:
Premium spawns worker → Premium implements → Premium reviews
Optimized Worker Flow:
Local re-anchor → Local/Premium implements (based on complexity) → Local pre-screen → Premium reviews
Token Savings: 40-60% of work phase
Quality Impact: Minimal (premium for complex decisions)
6.2 Worker Agent Optimization
6.2.1 Complexity-Based Model Selection
class WorkerModelSelector:
"""Select model for worker based on task complexity."""
def select_model(self, task_id: str) -> ModelConfig:
task = load_task(task_id)
epic = load_epic(task.epic_id)
# Analyze task complexity
complexity = self.analyze_complexity(task, epic)
if complexity.level == "LOW":
# Simple tasks: Local GLM-4.7-Flash
return ModelConfig(
backend="local",
model="GLM-4.7-Flash",
reason="Simple task, local model sufficient"
)
elif complexity.level == "MEDIUM":
# Moderate tasks: Hybrid approach
return ModelConfig(
backend="hybrid",
local_model="GLM-4.7-Flash",
premium_model="sonnet-4",
strategy="local_first_premium_fallback",
reason="Moderate complexity, try local first"
)
else: # HIGH
# Complex tasks: Premium model
return ModelConfig(
backend="premium",
model=self.select_premium_model(complexity),
reason="Complex task requires premium reasoning"
)
def analyze_complexity(self, task: Task, epic: Epic) -> TaskComplexity:
"""Analyze task complexity for model selection."""
indicators = {
"security_sensitive": self.is_security_sensitive(task),
"architectural": self.is_architectural(task),
"concurrency": self.involves_concurrency(task),
"unfamiliar_patterns": self.uses_unfamiliar_patterns(task),
"large_scope": task.size in ["L", "XL"],
"many_files": len(task.estimated_files) > 5,
"cross_module": self.is_cross_module(task),
}
high_indicators = sum(indicators.values())
if high_indicators >= 3 or indicators["security_sensitive"]:
return TaskComplexity(level="HIGH", indicators=indicators)
elif high_indicators >= 1:
return TaskComplexity(level="MEDIUM", indicators=indicators)
else:
return TaskComplexity(level="LOW", indicators=indicators)
6.2.2 Local Re-Anchor Summaries
Workers re-anchor before implementation. Instead of full premium processing, local model creates summaries:
def local_reanchor(task_id: str, epic_id: str) -> ReanchorContext:
"""Create re-anchor context using local model."""
# Read specs
task_spec = flowctl_cat(task_id)
epic_spec = flowctl_cat(epic_id)
# Read memory
memory = read_memory_files()
# Check git state
git_status = run("git status")
git_log = run("git log -5 --oneline")
# Local model creates focused summary
prompt = f"""Create implementation context summary for task {task_id}.
## Task Spec
{task_spec}
## Epic Context
{epic_spec}
## Memory (pitfalls, conventions)
{memory}
## Git State
{git_status}
{git_log}
Create a focused summary:
1. Key acceptance criteria (bullet points)
2. Files likely to modify
3. Relevant patterns from memory
4. Dependencies to be aware of
5. Test requirements
"""
summary = run_glm47_flash(prompt)
return ReanchorContext(
task_id=task_id,
summary=summary,
task_spec=task_spec, # Full spec still available
epic_spec=epic_spec,
)
6.2.3 Hybrid Implementation Strategy
class HybridWorker:
"""Worker that uses local for routine code, premium for complex decisions."""
def implement(self, task_id: str, context: ReanchorContext) -> ImplementResult:
model_config = self.model_selector.select_model(task_id)
if model_config.backend == "local":
return self.implement_local(task_id, context)
elif model_config.backend == "hybrid":
# Try local first
result = self.implement_local(task_id, context)
if result.confidence < 0.8 or result.needs_premium_decision:
# Escalate to premium for specific decisions
return self.escalate_to_premium(task_id, context, result)
return result
else: # premium
return self.implement_premium(task_id, context, model_config)
def implement_local(self, task_id: str, context: ReanchorContext) -> ImplementResult:
"""Implement using GLM-4.7-Flash."""
prompt = f"""Implement task {task_id}.
## Context
{context.summary}
## Full Task Spec (reference)
{context.task_spec}
## Instructions
1. Read relevant code files
2. Implement changes following existing patterns
3. Add tests if spec requires
4. Keep changes focused and minimal
If you encounter:
- Security-sensitive decisions → Flag for premium review
- Unfamiliar patterns → Flag for premium assistance
- Architectural choices → Flag for premium decision
Output your implementation plan, then implement.
"""
return self.glm.implement(prompt)
6.3 Work Flow (Optimized)
┌─────────────────────────────────────────────────────────────────────────────┐
│ WORK FLOW │
└─────────────────────────────────────────────────────────────────────────────┘
/flow-next:work fn-1
│
▼
┌─────────────────┐
│ Setup Questions │
│ (branch, review)│
└────────┬────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ TASK LOOP │
│ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ flowctl ready --epic fn-1 → next task │ │
│ └──────────────────────────────────┬──────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ SPAWN WORKER (subagent) │ │
│ │ │ │
│ │ ┌─────────────────┐ │ │
│ │ │ LOCAL RE-ANCHOR │ ← GLM-4.7-Flash creates focused summary │ │
│ │ └────────┬────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────┐ │ │
│ │ │ CLASSIFY TASK │ ← Determine LOCAL / HYBRID / PREMIUM │ │
│ │ └────────┬────────┘ │ │
│ │ │ │ │
│ │ ┌─────┴─────┐ │ │
│ │ ▼ ▼ │ │
│ │ ┌──────┐ ┌──────────┐ │ │
│ │ │LOCAL │ │PREMIUM │ │ │
│ │ │impl │ │impl │ │ │
│ │ └──┬───┘ └────┬─────┘ │ │
│ │ │ │ │ │
│ │ └─────┬─────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────┐ │ │
│ │ │ COMMIT │ │ │
│ │ └────────┬────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────┐ │ │
│ │ │ REVIEW │ ← See Phase 4 │ │
│ │ │ (if enabled) │ │ │
│ │ └────────┬────────┘ │ │
│ │ │ │ │
│ │ ▼ │ │
│ │ ┌─────────────────┐ │ │
│ │ │ COMPLETE │ ← flowctl done │ │
│ │ └─────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ Next task or Quality phase │
└──────────────────────────────────��──────────────────────────────────────────┘
6.4 Configuration
{
"work": {
"model_selection": {
"strategy": "complexity_based",
"low_complexity": {
"backend": "local",
"model": "GLM-4.7-Flash"
},
"medium_complexity": {
"backend": "hybrid",
"local_model": "GLM-4.7-Flash",
"premium_model": "sonnet-4",
"strategy": "local_first_premium_fallback"
},
"high_complexity": {
"backend": "premium",
"security": "codex-5.2",
"architectural": "opus-4.5",
"default": "opus-4.5"
}
},
"reanchor": {
"backend": "local",
"model": "GLM-4.7-Flash",
"include_memory": true
},
"worker": {
"subagent": true,
"fresh_context": true
}
}
}
7. Phase 4: Review
7.1 Overview
Review phase has the highest token burn due to:
- Full context sent to premium models
- Multiple review iterations per task
- Re-reviewing unchanged code
- Verbose prompts
This is the primary optimization target.
7.2 Review Pipeline
┌─────────────────────────────────────────────────────────────────────────────┐
│ REVIEW PIPELINE │
└─────────────────────────────────────────────────────────────────────────────┘
Review Request
│
▼
┌─────────────────┐
│ STAGE 1: GATES │ ← FREE (deterministic)
│ lint/build/tsc │
└────────┬────────┘
│
▼
┌─────────────────┐
│ STAGE 2: │ ← FREE (GLM-4.7-Flash)
│ PRE-SCREEN │
│ │
│ Find obvious │
│ issues, fix │
│ locally │
└────────┬────────┘
│
▼
┌───────��─────────┐
│ STAGE 3: │ ← FREE (SHA-256)
│ FINGERPRINT │
│ │
│ Skip if diff │
│ unchanged │
└────────┬────────┘
│
▼
┌─────────────────┐
│ STAGE 4: │ ← FREE (GLM-4.7-Flash)
│ CLASSIFY │
│ │
│ SECURITY / │
│ COMPLEX / │
│ ROUTINE │
└────────┬────────┘
│
▼
┌─────────────────┐
│ STAGE 5: │ ← Multi-source pooling
│ ROUTE │
│ │
│ Select best │
│ available src │
└────────┬────────┘
│
┌───────────────────────┼───────────────────────┐
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ SECURITY │ │ COMPLEX │ │ ROUTINE │
│ │ │ │ │ │
│ Codex 5.2 │ │ Opus 4.5 │ │ Sonnet 4 │
│ preferred │ │ preferred │ │ or lowest │
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
└──────────────────────┴──────────────────────┘
│
▼
┌─────────────────┐
│ STAGE 6: │
│ PREMIUM REVIEW │
│ │
│ Full context │
│ Lean prompt │
│ Final verdict │
└────────┬────────┘
│
▼
SHIP / NEEDS_WORK
7.3 Stage Details
See the detailed specifications in Section 6 of the earlier review-focused spec. Key points:
7.3.1 Stage 2: Pre-Screen
GLM-4.7-Flash catches obvious issues:
- Missing imports
- Undefined variables
- Obvious null access
- Logic errors
- Missing returns
Expected catch rate: 40-60% of NEEDS_WORK issues
7.3.2 Stage 3: Fingerprint
Prevent re-reviewing unchanged code:
fingerprint = sha256(normalize(git diff base..HEAD))
# Skip if:
# - Same fingerprint as last NEEDS_WORK
# - Blocker files not touched
Expected skip rate: 20-40% of reviews
7.3.3 Stage 4: Classification
Route to appropriate premium model:
| Type |
Patterns |
Model |
| SECURITY |
auth, crypto, password, token |
Codex 5.2 |
| COMPLEX |
core, architecture, concurrent |
Opus 4.5 |
| ROUTINE |
tests, docs, simple fixes |
Sonnet 4 |
7.3.4 Stage 5: Quota-Aware Routing
Select from 4 premium sources based on availability:
| Source |
Models |
Status Tracking |
| OpenAI Pro |
Codex 5.2, GPT-4o |
.flow/quota_state.json |
| Claude Max |
Opus 4.5, Sonnet 4 |
.flow/quota_state.json |
| Copilot |
Opus, Codex, Sonnet |
.flow/quota_state.json |
| Antigravity |
Opus, Gemini 3 Pro |
.flow/quota_state.json |
7.3.5 Stage 6: Premium Review
CRITICAL: No lossy compression. Premium sees full context:
- Changed files (full content)
- Direct imports (full content)
- Test files (full content)
- Diff (full)
Savings come from:
- Pre-screening (fewer reviews)
- Fingerprinting (skip unchanged)
- Lean prompts (~500 tokens vs 3000)
- Multi-source pooling (capacity)
7.4 Configuration
{
"review": {
"gates": {
"enabled": true,
"checks": ["lint", "build", "typecheck"]
},
"prescreen": {
"enabled": true,
"backend": "local",
"model": "GLM-4.7-Flash",
"max_iterations": 3
},
"fingerprint": {
"enabled": true,
"skip_unchanged": true,
"require_blocker_change": true
},
"classification": {
"security_patterns": ["**/auth/**", "**/crypto/**", "**/*password*"],
"complex_patterns": ["**/core/**", "**/architect*"]
},
"routing": {
"security": ["openai_pro:codex-5.2", "copilot:codex-5.2"],
"complex": ["claude_max:opus-4.5", "copilot:opus-4.5"],
"routine": ["quota_lowest", "copilot:sonnet-4"]
},
"premium_sources": {
"openai_pro": {"enabled": true, "api_key_env": "OPENAI_API_KEY"},
"claude_max": {"enabled": true, "api_key_env": "ANTHROPIC_API_KEY"},
"copilot": {"enabled": true, "access_method": "coding_agent"},
"antigravity": {"enabled": true, "access_method": "cli"}
}
}
}
8. Local Model Infrastructure
8.1 Architecture
┌─────────────────────────────────────────────────────────────────────────────┐
│ LOCAL MODEL INFRASTRUCTURE │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ llama-swap (Port 8080) │
│ │
│ Proxy that manages model loading/unloading for llama.cpp │
│ - Routes requests to appropriate model │
│ - Frees GPU memory when idle │
│ - Enables model hot-swapping │
│ │
│ Endpoints: │
│ /v1/chat/completions → Anthropic/OpenAI compatible │
│ /completion → llama.cpp native │
│ /health → Health check │
│ /swap → Model swap control │
└─────────────────────────────────────────────────────────────────────────────┘
│
┌───────────────┴───────────────┐
▼ ▼
┌─────────────────────────┐ ┌─────────────────────────┐
│ llama-server │ │ Ollama (Port 11434) │
│ GLM-4.7-Flash │ │ Qwen2.5-Coder-7B │
│ │ │ │
│ Context: 65K default │ │ Always hot (keep_alive) │
│ VRAM: ~23GB │ │ VRAM: ~8GB │
│ Speed: 60-100 tok/s │ │ Speed: 100+ tok/s │
└─────────────────────────┘ └─────────────────────────┘
8.2 llama-swap Configuration
Based on the tammam.io guide, llama-swap enables:
- Model hot-swapping without restart
- Memory management (free GPU when idle)
- Anthropic API compatibility (for Claude Code integration)
# llama-swap.yaml
listen: 0.0.0.0:8080
models:
glm-4.7-flash:
cmd: >
llama-server
-m /models/GLM-4.7-Flash-UD-Q4_K_XL.gguf
-ngl 99
-c 65536
--cache-type-k q4_0
--cache-type-v q4_0
-fa
--mmap
-t 8
ttl: 300 # Unload after 5 min idle
glm-4.7-flash-large:
cmd: >
llama-server
-m /models/GLM-4.7-Flash-UD-Q4_K_XL.gguf
-ngl 99
-c 131072
--cache-type-k q4_0
--cache-type-v q4_0
-fa
--mmap
-t 8
ttl: 300
aliases:
claude-3-5-sonnet-20241022: glm-4.7-flash # Claude Code compatibility
gpt-4: glm-4.7-flash # OpenAI compatibility
healthcheck:
interval: 30
timeout: 5
8.3 Startup Script
#!/bin/bash
# start-local-stack.sh
set -e
MODELS_DIR="${MODELS_DIR:-$HOME/models}"
LLAMA_SWAP_CONFIG="${LLAMA_SWAP_CONFIG:-./llama-swap.yaml}"
echo "=== Starting Local Model Stack ==="
# 1. Start Ollama for 7B pre-filter
echo "[1/3] Starting Ollama..."
if ! pgrep -x "ollama" > /dev/null; then
ollama serve &
sleep 2
fi
# Pull and warm up 7B model
ollama pull qwen2.5-coder:7b-instruct-q8_0 2>/dev/null || true
curl -s http://localhost:11434/api/generate -d '{
"model": "qwen2.5-coder:7b-instruct-q8_0",
"prompt": "ready",
"keep_alive": "24h"
}' > /dev/null
echo " ✓ Ollama ready (7B pre-filter, ~8GB VRAM)"
# 2. Start llama-swap
echo "[2/3] Starting llama-swap..."
if ! pgrep -x "llama-swap" > /dev/null; then
llama-swap -c "$LLAMA_SWAP_CONFIG" &
sleep 2
fi
# Wait for llama-swap
until curl -s http://localhost:8080/health > /dev/null 2>&1; do
sleep 1
done
echo " ✓ llama-swap ready (port 8080)"
# 3. Pre-warm GLM-4.7-Flash
echo "[3/3] Pre-warming GLM-4.7-Flash..."
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "glm-4.7-flash", "messages": [{"role": "user", "content": "ready"}], "max_tokens": 1}' \
> /dev/null
echo " ✓ GLM-4.7-Flash ready (~23GB VRAM)"
echo ""
echo "=== Local Stack Ready ==="
echo " Pre-filter (7B): http://localhost:11434"
echo " GLM-4.7-Flash: http://localhost:8080"
echo ""
echo "VRAM Usage: ~31GB (7B: 8GB + GLM: 23GB)"
8.4 Integration with Claude Code
Using the Anthropic-compatible endpoint from llama-swap:
# Set environment for Claude Code to use local model
export ANTHROPIC_BASE_URL=http://localhost:8080/v1
export ANTHROPIC_API_KEY=local # Dummy key for llama-swap
# Now Claude Code uses local GLM-4.7-Flash
claude "Review this code..."
8.5 Integration with Codex CLI
# Codex can also use the OpenAI-compatible endpoint
export OPENAI_BASE_URL=http://localhost:8080/v1
export OPENAI_API_KEY=local
# For specific commands, use local
codex --api-base http://localhost:8080/v1 "..."
9. Multi-Source Orchestration
9.1 Quota Tracking
class QuotaTracker:
"""Track quota usage across all premium sources."""
def __init__(self, state_file: Path = Path(".flow/quota_state.json")):
self.state_file = state_file
self.load()
def get_status(self, source: str) -> QuotaStatus:
"""Get current quota status for a source."""
state = self.state.get("sources", {}).get(source, {})
return QuotaStatus(
source=source,
remaining_pct=state.get("remaining_pct", 1.0),
is_throttled=state.get("is_throttled", False),
last_request=datetime.fromisoformat(state.get("last_request", "1970-01-01")),
requests_since_reset=state.get("requests_since_reset", 0),
)
def record_usage(self, source: str, tokens: int, was_throttled: bool = False):
"""Record usage for quota tracking."""
if "sources" not in self.state:
self.state["sources"] = {}
if source not in self.state["sources"]:
self.state["sources"][source] = {
"remaining_pct": 1.0,
"requests_since_reset": 0,
"tokens_since_reset": 0,
}
s = self.state["sources"][source]
s["last_request"] = datetime.now().isoformat()
s["requests_since_reset"] = s.get("requests_since_reset", 0) + 1
s["tokens_since_reset"] = s.get("tokens_since_reset", 0) + tokens
s["is_throttled"] = was_throttled
# Estimate remaining (heuristic based on usage patterns)
s["remaining_pct"] = self._estimate_remaining(source, s)
self.save()
def get_best_for_task_type(self, task_type: str) -> list[tuple[str, str]]:
"""Get best available sources for task type, sorted by preference."""
routing = self._get_routing_config()
candidates = routing.get(task_type, [])
available = []
for candidate in candidates:
if candidate == "quota_lowest":
# Special: select source with most remaining quota
best = self._get_highest_remaining()
if best:
available.append(best)
else:
source, model = candidate.split(":")
status = self.get_status(source)
if not status.is_throttled and status.remaining_pct > 0.05:
available.append((source, model))
return available
9.2 Backend Implementations
9.2.1 OpenAI Backend
class OpenAIBackend(PremiumBackend):
"""OpenAI Pro backend (Codex 5.2, GPT-4o)."""
source_name = "openai_pro"
def __init__(self):
self.api_key = os.environ.get("OPENAI_API_KEY")
self.client = OpenAI(api_key=self.api_key)
def review(self, context: ReviewContext, task_type: str) -> ReviewResult:
model = "gpt-5.2" if task_type == "security" else "gpt-4o"
response = self.client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": context.prompt}],
temperature=0.1,
max_tokens=2048,
)
return ReviewResult(
source=self.source_name,
model=model,
content=response.choices[0].message.content,
tokens_input=response.usage.prompt_tokens,
tokens_output=response.usage.completion_tokens,
)
9.2.2 Claude Backend
class ClaudeBackend(PremiumBackend):
"""Claude Max backend (Opus 4.5, Sonnet 4)."""
source_name = "claude_max"
def __init__(self):
self.api_key = os.environ.get("ANTHROPIC_API_KEY")
self.client = Anthropic(api_key=self.api_key)
def review(self, context: ReviewContext, task_type: str) -> ReviewResult:
model = "claude-opus-4-5-20250514" if task_type in ["security", "complex"] else "claude-sonnet-4-20250514"
response = self.client.messages.create(
model=model,
max_tokens=2048,
messages=[{"role": "user", "content": context.prompt}],
)
return ReviewResult(
source=self.source_name,
model=model,
content=response.content[0].text,
tokens_input=response.usage.input_tokens,
tokens_output=response.usage.output_tokens,
)
9.2.3 Copilot Backend
class CopilotBackend(PremiumBackend):
"""GitHub Copilot Pro+ backend."""
source_name = "copilot"
def review(self, context: ReviewContext, task_type: str) -> ReviewResult:
# Copilot accessed via coding agent
model = self._select_model(task_type)
# Use GitHub Copilot Coding Agent API
result = self._invoke_coding_agent(context, model)
return ReviewResult(
source=self.source_name,
model=model,
content=result.content,
tokens_input=result.tokens_input,
tokens_output=result.tokens_output,
)
def _select_model(self, task_type: str) -> str:
if task_type == "security":
return "codex-5.2"
elif task_type == "complex":
return "claude-opus-4-5"
else:
return "claude-sonnet-4"
9.2.4 Antigravity Backend
class AntigravityBackend(PremiumBackend):
"""Google Antigravity backend (Opus 4.5, Gemini 3 Pro)."""
source_name = "antigravity"
def review(self, context: ReviewContext, task_type: str) -> ReviewResult:
model = "claude-opus-4-5" if task_type in ["security", "complex"] else "gemini-3-pro"
# Use Antigravity CLI
result = self._invoke_antigravity(context, model)
return ReviewResult(
source=self.source_name,
model=model,
content=result.content,
tokens_input=result.tokens_input,
tokens_output=result.tokens_output,
)
10. Configuration Reference
10.1 Complete .flow/config.json
{
"schema_version": 2,
"interview": {
"question_generation": {
"backend": "local",
"model": "GLM-4.7-Flash"
},
"spec_drafting": {
"backend": "local"
}
},
"plan": {
"scouts": {
"repo_scout": {"backend": "local", "cache_ttl_hours": 24},
"practice_scout": {"backend": "local", "cache_ttl_hours": 168},
"docs_scout": {"backend": "local", "cache_ttl_hours": 24},
"github_scout": {"backend": "premium", "model": "sonnet-4"},
"epic_scout": {"backend": "local"},
"docs_gap_scout": {"backend": "local"}
},
"spec_drafting": {"backend": "local"},
"task_breakdown": {"backend": "local"}
},
"work": {
"model_selection": {
"strategy": "complexity_based",
"low_complexity": {"backend": "local"},
"medium_complexity": {"backend": "hybrid"},
"high_complexity": {"backend": "premium"}
},
"reanchor": {"backend": "local"}
},
"review": {
"gates": {
"enabled": true,
"checks": ["lint", "build", "typecheck"]
},
"prescreen": {
"enabled": true,
"backend": "local",
"max_iterations": 3
},
"fingerprint": {
"enabled": true,
"skip_unchanged": true,
"require_blocker_change": true
},
"classification": {
"security_patterns": [
"**/auth/**", "**/crypto/**", "**/security/**",
"**/*password*", "**/*secret*", "**/*token*"
],
"complex_patterns": [
"**/core/**", "**/architect*", "**/concurrent*"
]
},
"routing": {
"security": ["openai_pro:codex-5.2", "copilot:codex-5.2"],
"complex": ["claude_max:opus-4.5", "copilot:opus-4.5", "antigravity:opus-4.5"],
"routine": ["quota_lowest", "copilot:sonnet-4", "antigravity:gemini-3-pro"]
}
},
"premium_sources": {
"openai_pro": {
"enabled": true,
"api_key_env": "OPENAI_API_KEY",
"models": ["gpt-5.2", "codex-5.2", "gpt-4o"]
},
"claude_max": {
"enabled": true,
"api_key_env": "ANTHROPIC_API_KEY",
"models": ["claude-opus-4-5-20250514", "claude-sonnet-4-20250514"]
},
"copilot": {
"enabled": true,
"access_method": "coding_agent",
"models": ["claude-opus-4-5", "codex-5.2", "claude-sonnet-4"]
},
"antigravity": {
"enabled": true,
"access_method": "cli",
"models": ["claude-opus-4-5", "gemini-3-pro"]
}
},
"local": {
"primary_model": "GLM-4.7-Flash-UD-Q4_K_XL",
"primary_endpoint": "http://localhost:8080",
"context_default": 65536,
"context_max": 131072,
"prefilter_model": "qwen2.5-coder:7b-instruct-q8_0",
"prefilter_endpoint": "http://localhost:11434"
},
"memory": {
"enabled": true
}
}
10.2 Environment Variables (config.env for Ralph)
# === FLOW-NEXT TOKEN-OPTIMIZED CONFIGURATION ===
# --- LOCAL MODELS ---
LOCAL_ENABLED=1
LOCAL_ENDPOINT=http://localhost:8080
LOCAL_MODEL=GLM-4.7-Flash-UD-Q4_K_XL
LOCAL_CONTEXT_DEFAULT=65536
PREFILTER_ENABLED=1
PREFILTER_ENDPOINT=http://localhost:11434
PREFILTER_MODEL=qwen2.5-coder:7b-instruct-q8_0
# --- PREMIUM SOURCES ---
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_PRO_ENABLED=1
CLAUDE_MAX_ENABLED=1
COPILOT_ENABLED=1
ANTIGRAVITY_ENABLED=1
# --- REVIEW PIPELINE ---
GATES_ENABLED=1
PRESCREEN_ENABLED=1
PRESCREEN_MAX_ITERATIONS=3
FINGERPRINT_ENABLED=1
FINGERPRINT_SKIP_UNCHANGED=1
# --- WORK PIPELINE ---
WORK_
Flow-Next Token-Optimized Development Specification
Complete End-to-End System Design
Table of Contents
1. Executive Summary
1.1 The Problem
Flow-next's Ralph loops burn through premium model quotas (Codex 5.2, Opus 4.5) in 2 days instead of lasting a full week. Token consumption occurs across all phases:
1.2 The Solution
Deploy GLM-4.7-Flash (59.2% SWE-Bench) locally to handle preprocessing across all phases while premium models (Opus 4.5, Codex 5.2, Gemini 3 Pro) make final decisions via multi-source quota pooling.
1.3 Key Principles
1.4 Expected Outcomes
2. System Overview
2.1 End-to-End Architecture
2.2 Model Allocation by Phase
2.3 Data Flow
3. Resource Inventory
3.1 Premium Sources
3.2 Local Hardware
3.3 Local Models
GLM-4.7-Flash (Primary)
VRAM by Context:
Qwen2.5-Coder-7B (Pre-filter)
4. Phase 1: Interview
4.1 Overview
Interview is human-paced and low token burn by nature. Optimization focuses on question quality and spec drafting assistance.
4.2 Optimization Strategy
4.2.1 Local Question Generation
GLM-4.7-Flash generates initial question batches based on input:
4.2.2 Premium Refinement
Premium model reviews and refines questions only when:
4.3 Interview Flow (Optimized)
4.4 Configuration
{ "interview": { "question_generation": { "backend": "local", "model": "GLM-4.7-Flash", "fallback_to_premium": true }, "spec_drafting": { "backend": "local", "model": "GLM-4.7-Flash" }, "premium_refinement": { "enabled": true, "trigger": "user_request_or_shallow_questions" } } }5. Phase 2: Plan
5.1 Overview
Planning involves parallel scout agents that analyze the codebase. Current implementation runs all scouts via premium models. Optimization runs scouts locally with premium for final spec writing.
5.2 Scout Agent Optimization
5.2.1 Local Scout Execution
5.2.2 Scout Result Caching
5.3 Planning Flow (Optimized)
5.4 Plan Review Optimization
Plan reviews follow the same optimization as impl reviews:
5.5 Configuration
{ "plan": { "scouts": { "repo_scout": {"backend": "local", "cache_ttl_hours": 24}, "practice_scout": {"backend": "local", "cache_ttl_hours": 168}, "docs_scout": {"backend": "local", "cache_ttl_hours": 24}, "github_scout": {"backend": "premium", "model": "sonnet-4"}, "epic_scout": {"backend": "local", "cache_ttl_hours": 24}, "docs_gap_scout": {"backend": "local", "cache_ttl_hours": 24} }, "spec_drafting": { "backend": "local", "model": "GLM-4.7-Flash" }, "task_breakdown": { "backend": "local", "model": "GLM-4.7-Flash" }, "plan_review": { "simple": {"backend": "sonnet-4"}, "complex": {"backend": "opus-4.5"} } } }6. Phase 3: Work
6.1 Overview
Work phase is the highest token consumer due to implementation complexity. Current worker agents get full context from premium models. Optimization uses local models for re-anchoring and routine implementation while premium handles complex decisions.
6.2 Worker Agent Optimization
6.2.1 Complexity-Based Model Selection
6.2.2 Local Re-Anchor Summaries
Workers re-anchor before implementation. Instead of full premium processing, local model creates summaries:
6.2.3 Hybrid Implementation Strategy
6.3 Work Flow (Optimized)
6.4 Configuration
{ "work": { "model_selection": { "strategy": "complexity_based", "low_complexity": { "backend": "local", "model": "GLM-4.7-Flash" }, "medium_complexity": { "backend": "hybrid", "local_model": "GLM-4.7-Flash", "premium_model": "sonnet-4", "strategy": "local_first_premium_fallback" }, "high_complexity": { "backend": "premium", "security": "codex-5.2", "architectural": "opus-4.5", "default": "opus-4.5" } }, "reanchor": { "backend": "local", "model": "GLM-4.7-Flash", "include_memory": true }, "worker": { "subagent": true, "fresh_context": true } } }7. Phase 4: Review
7.1 Overview
Review phase has the highest token burn due to:
This is the primary optimization target.
7.2 Review Pipeline
7.3 Stage Details
See the detailed specifications in Section 6 of the earlier review-focused spec. Key points:
7.3.1 Stage 2: Pre-Screen
GLM-4.7-Flash catches obvious issues:
Expected catch rate: 40-60% of NEEDS_WORK issues
7.3.2 Stage 3: Fingerprint
Prevent re-reviewing unchanged code:
Expected skip rate: 20-40% of reviews
7.3.3 Stage 4: Classification
Route to appropriate premium model:
7.3.4 Stage 5: Quota-Aware Routing
Select from 4 premium sources based on availability:
.flow/quota_state.json.flow/quota_state.json.flow/quota_state.json.flow/quota_state.json7.3.5 Stage 6: Premium Review
CRITICAL: No lossy compression. Premium sees full context:
Savings come from:
7.4 Configuration
{ "review": { "gates": { "enabled": true, "checks": ["lint", "build", "typecheck"] }, "prescreen": { "enabled": true, "backend": "local", "model": "GLM-4.7-Flash", "max_iterations": 3 }, "fingerprint": { "enabled": true, "skip_unchanged": true, "require_blocker_change": true }, "classification": { "security_patterns": ["**/auth/**", "**/crypto/**", "**/*password*"], "complex_patterns": ["**/core/**", "**/architect*"] }, "routing": { "security": ["openai_pro:codex-5.2", "copilot:codex-5.2"], "complex": ["claude_max:opus-4.5", "copilot:opus-4.5"], "routine": ["quota_lowest", "copilot:sonnet-4"] }, "premium_sources": { "openai_pro": {"enabled": true, "api_key_env": "OPENAI_API_KEY"}, "claude_max": {"enabled": true, "api_key_env": "ANTHROPIC_API_KEY"}, "copilot": {"enabled": true, "access_method": "coding_agent"}, "antigravity": {"enabled": true, "access_method": "cli"} } } }8. Local Model Infrastructure
8.1 Architecture
8.2 llama-swap Configuration
Based on the tammam.io guide, llama-swap enables:
8.3 Startup Script
8.4 Integration with Claude Code
Using the Anthropic-compatible endpoint from llama-swap:
8.5 Integration with Codex CLI
9. Multi-Source Orchestration
9.1 Quota Tracking
9.2 Backend Implementations
9.2.1 OpenAI Backend
9.2.2 Claude Backend
9.2.3 Copilot Backend
9.2.4 Antigravity Backend
10. Configuration Reference
10.1 Complete
.flow/config.json{ "schema_version": 2, "interview": { "question_generation": { "backend": "local", "model": "GLM-4.7-Flash" }, "spec_drafting": { "backend": "local" } }, "plan": { "scouts": { "repo_scout": {"backend": "local", "cache_ttl_hours": 24}, "practice_scout": {"backend": "local", "cache_ttl_hours": 168}, "docs_scout": {"backend": "local", "cache_ttl_hours": 24}, "github_scout": {"backend": "premium", "model": "sonnet-4"}, "epic_scout": {"backend": "local"}, "docs_gap_scout": {"backend": "local"} }, "spec_drafting": {"backend": "local"}, "task_breakdown": {"backend": "local"} }, "work": { "model_selection": { "strategy": "complexity_based", "low_complexity": {"backend": "local"}, "medium_complexity": {"backend": "hybrid"}, "high_complexity": {"backend": "premium"} }, "reanchor": {"backend": "local"} }, "review": { "gates": { "enabled": true, "checks": ["lint", "build", "typecheck"] }, "prescreen": { "enabled": true, "backend": "local", "max_iterations": 3 }, "fingerprint": { "enabled": true, "skip_unchanged": true, "require_blocker_change": true }, "classification": { "security_patterns": [ "**/auth/**", "**/crypto/**", "**/security/**", "**/*password*", "**/*secret*", "**/*token*" ], "complex_patterns": [ "**/core/**", "**/architect*", "**/concurrent*" ] }, "routing": { "security": ["openai_pro:codex-5.2", "copilot:codex-5.2"], "complex": ["claude_max:opus-4.5", "copilot:opus-4.5", "antigravity:opus-4.5"], "routine": ["quota_lowest", "copilot:sonnet-4", "antigravity:gemini-3-pro"] } }, "premium_sources": { "openai_pro": { "enabled": true, "api_key_env": "OPENAI_API_KEY", "models": ["gpt-5.2", "codex-5.2", "gpt-4o"] }, "claude_max": { "enabled": true, "api_key_env": "ANTHROPIC_API_KEY", "models": ["claude-opus-4-5-20250514", "claude-sonnet-4-20250514"] }, "copilot": { "enabled": true, "access_method": "coding_agent", "models": ["claude-opus-4-5", "codex-5.2", "claude-sonnet-4"] }, "antigravity": { "enabled": true, "access_method": "cli", "models": ["claude-opus-4-5", "gemini-3-pro"] } }, "local": { "primary_model": "GLM-4.7-Flash-UD-Q4_K_XL", "primary_endpoint": "http://localhost:8080", "context_default": 65536, "context_max": 131072, "prefilter_model": "qwen2.5-coder:7b-instruct-q8_0", "prefilter_endpoint": "http://localhost:11434" }, "memory": { "enabled": true } }10.2 Environment Variables (
config.envfor Ralph)