Skip to content

Flow-Next Token-Optimized Development Specification #80

@clairernovotny

Description

@clairernovotny

Flow-Next Token-Optimized Development Specification

Complete End-to-End System Design


Table of Contents

  1. Executive Summary
  2. System Overview
  3. Resource Inventory
  4. Phase 1: Interview
  5. Phase 2: Plan
  6. Phase 3: Work
  7. Phase 4: Review
  8. Local Model Infrastructure
  9. Multi-Source Orchestration
  10. Configuration Reference
  11. Implementation Roadmap
  12. Appendices

1. Executive Summary

1.1 The Problem

Flow-next's Ralph loops burn through premium model quotas (Codex 5.2, Opus 4.5) in 2 days instead of lasting a full week. Token consumption occurs across all phases:

Phase Current Token Burn Primary Waste
Interview Low (human-paced) None significant
Plan Medium (scout parallelism) Redundant codebase scanning
Work High (implementation) Full context on every task
Review Very High Re-reviewing unchanged code, verbose prompts

1.2 The Solution

Deploy GLM-4.7-Flash (59.2% SWE-Bench) locally to handle preprocessing across all phases while premium models (Opus 4.5, Codex 5.2, Gemini 3 Pro) make final decisions via multi-source quota pooling.

1.3 Key Principles

  1. Premium models always make final decisions — No quality compromise
  2. Local handles grunt work — Pre-screening, classification, context preparation
  3. Multi-source pooling — 4 premium quota pools (OpenAI, Claude, Copilot, Antigravity)
  4. No lossy compression — Full context for security/complex reviews
  5. Phase-appropriate optimization — Each phase optimized differently

1.4 Expected Outcomes

Metric Current Target
Weekly quota duration 2 days 7+ days
Premium calls per task 2-4 0.3-0.5
Review quality 100% 100% (unchanged)
Local pre-screening catch rate 0% 60-80%

2. System Overview

2.1 End-to-End Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                         FLOW-NEXT OPTIMIZED PIPELINE                         │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ PHASE 1: INTERVIEW                                                          │
│ /flow-next:interview                                                        │
│                                                                             │
│ Human-paced, low token burn                                                 │
│ Local: Question generation, spec drafting                                   │
│ Premium: Complex decision trees (when needed)                               │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PHASE 2: PLAN                                                               │
│ /flow-next:plan                                                             │
│                                                                             │
│ Scout parallelism, moderate token burn                                      │
│ Local: repo-scout, practice-scout, docs-scout (GLM-4.7-Flash)              │
│ Premium: Epic spec writing, architectural decisions                         │
│ Optimization: Scout result caching, incremental updates                     │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PHASE 3: WORK                                                               │
│ /flow-next:work                                                             │
│                                                                             │
│ Implementation, high token burn                                             │
│ Local: Worker pre-anchor summaries, implementation assistance               │
│ Premium: Complex implementation decisions, unfamiliar patterns              │
│ Optimization: Task-scoped context, memory system                            │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PHASE 4: REVIEW                                                             │
│ /flow-next:impl-review, /flow-next:plan-review                              │
│                                                                             │
│ Verification, very high token burn                                          │
│ Local: Pre-screening, fingerprinting, classification                        │
│ Premium: Final SHIP/NEEDS_WORK judgment                                     │
│ Optimization: Multi-source routing, lean prompts                            │
└─────────────────────────────────────────────────────────────────────────────┘

2.2 Model Allocation by Phase

Phase Local Model Role Premium Model Role
Interview GLM-4.7-Flash Question drafting Opus 4.5 Complex decisions
Plan GLM-4.7-Flash Scout agents Opus 4.5 Epic spec review
Work GLM-4.7-Flash Re-anchor, assistance Opus 4.5 / Codex 5.2 Implementation
Review GLM-4.7-Flash Pre-screen, classify Codex 5.2 / Opus 4.5 Final verdict

2.3 Data Flow

┌─────────────────────────────────────────────────────────────────────────────┐
│                              STATE MANAGEMENT                                │
└─────────────────────────────────────────────────────────────────────────────┘

.flow/
├── epics/                    # Epic metadata (JSON)
├── tasks/                    # Task metadata (JSON)
├── specs/                    # Epic specs (Markdown)
│   └── fn-N-xxx.md
├── tasks/                    # Task specs (Markdown)
│   └── fn-N-xxx.M.md
├── config.json               # Project configuration
├── meta.json                 # Flow metadata
│
├── review_state.json         # NEW: Fingerprints, blockers
├── quota_state.json          # NEW: Multi-source quota tracking
├── metrics.json              # NEW: Telemetry
│
├── memory/                   # Existing memory system
│   ├── pitfalls.md
│   ├── conventions.md
│   └── decisions.md
│
└── cache/                    # NEW: Scout result caching
    ├── repo-scout/
    ├── practice-scout/
    └── docs-scout/

3. Resource Inventory

3.1 Premium Sources

Source Subscription Models Best For Access
OpenAI Pro $200/mo Codex 5.2, GPT-4o Security review API
Claude Max $200/mo Opus 4.5, Sonnet 4 Complex reasoning API, Claude Code
Copilot Pro+ $39/mo Opus 4.5, Codex 5.2, Sonnet 4 Flexible routing Coding Agent
Antigravity ~$25/mo Opus 4.5, Gemini 3 Pro Additional capacity CLI

3.2 Local Hardware

Component Specification Role
GPU NVIDIA RTX 5090 32GB VRAM Model inference
RAM 128GB DDR5 Extended context
Storage Gen5 NVMe 14.9 GB/s Fast model loading

3.3 Local Models

GLM-4.7-Flash (Primary)

Attribute Value
Model GLM-4.7-Flash-UD-Q4_K_XL
SWE-Bench 59.2% (best local)
Parameters 30B total, ~3.6B active (MoE)
Context Up to 200K tokens
VRAM 23-30GB (context dependent)
Speed 60-100 tok/s

VRAM by Context:

Context VRAM 7B Colocate?
32K ~20 GB ✅ Yes
65K ~23 GB ✅ Yes
131K ~30 GB ❌ No

Qwen2.5-Coder-7B (Pre-filter)

Attribute Value
Model qwen2.5-coder:7b-instruct-q8_0
Context 32K tokens
VRAM ~8 GB
Speed 100-120 tok/s
Role Fast pre-filter

4. Phase 1: Interview

4.1 Overview

Interview is human-paced and low token burn by nature. Optimization focuses on question quality and spec drafting assistance.

┌─────────────────────────────────────────────────────────────────────────────┐
│                         INTERVIEW OPTIMIZATION                               │
└─────────────────────────────────────────────────────────────────────────────┘

Current: All questions generated by premium model
Optimized: Local generates question candidates, premium refines

Token Savings: 30-50% (interview is already low burn)
Quality Impact: None (human validates all questions)

4.2 Optimization Strategy

4.2.1 Local Question Generation

GLM-4.7-Flash generates initial question batches based on input:

def generate_interview_questions(
    input_type: str,  # epic, task, file, idea
    content: str,
    category: str,  # scope, technical, edge_cases, etc.
) -> list[str]:
    """Generate interview questions locally."""
    
    prompt = f"""Generate interview questions for a {input_type}.

## Content
{content}

## Category: {category}

## Question Guidelines
- Dig deep on hidden complexity
- Surface assumptions
- Identify edge cases
- Group related questions (2-4 per batch)

Generate 5-10 questions for this category.
"""
    
    return run_glm47_flash(prompt)

4.2.2 Premium Refinement

Premium model reviews and refines questions only when:

  • Questions seem superficial
  • Domain requires specialized knowledge
  • User requests deeper exploration

4.3 Interview Flow (Optimized)

┌─────────────────────────────────────────────────────────────────────────────┐
│                         INTERVIEW FLOW                                       │
└─────────────────────────────────────────────────────────────────────────────┘

Input (epic/task/file/idea)
          │
          ▼
┌─────────────────┐
│ GLM-4.7-Flash   │  ← LOCAL: Parse input, identify question categories
│ Question Gen    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Human Q&A Loop  │  ← Questions via AskUserQuestion tool
│ (40+ questions) │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ GLM-4.7-Flash   │  ← LOCAL: Draft refined spec from answers
│ Spec Drafting   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Human Review    │  ← User approves/edits
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Write to        │  ← flowctl epic set-plan / task set-spec
│ .flow/specs/    │
└─────────────────┘

4.4 Configuration

{
  "interview": {
    "question_generation": {
      "backend": "local",
      "model": "GLM-4.7-Flash",
      "fallback_to_premium": true
    },
    "spec_drafting": {
      "backend": "local",
      "model": "GLM-4.7-Flash"
    },
    "premium_refinement": {
      "enabled": true,
      "trigger": "user_request_or_shallow_questions"
    }
  }
}

5. Phase 2: Plan

5.1 Overview

Planning involves parallel scout agents that analyze the codebase. Current implementation runs all scouts via premium models. Optimization runs scouts locally with premium for final spec writing.

┌─────────────────────────────────────────────────────────────────────────────┐
│                           PLAN OPTIMIZATION                                  │
└─────────────────────────────────────────────────────────────────────────────┘

Current Scout Agents (all premium):
├── repo-scout      - Codebase analysis
├── practice-scout  - Best practices research
├── docs-scout      - Documentation analysis
├── github-scout    - GitHub issues/PRs
├── epic-scout      - Epic dependencies
└── docs-gap-scout  - Documentation gaps

Optimized:
├── repo-scout      - LOCAL (GLM-4.7-Flash)
├── practice-scout  - LOCAL (GLM-4.7-Flash)
├── docs-scout      - LOCAL (GLM-4.7-Flash)
├── github-scout    - PREMIUM (needs API access)
├── epic-scout      - LOCAL (GLM-4.7-Flash)
└── docs-gap-scout  - LOCAL (GLM-4.7-Flash)

Token Savings: 60-70% of planning phase
Quality Impact: Minimal (scouts gather info, premium synthesizes)

5.2 Scout Agent Optimization

5.2.1 Local Scout Execution

class LocalScoutRunner:
    """Run scout agents locally via GLM-4.7-Flash."""
    
    def __init__(self, glm_backend: LlamaCppBackend):
        self.glm = glm_backend
        self.cache_dir = Path(".flow/cache")
    
    def run_repo_scout(self, feature_description: str) -> ScoutResult:
        """Analyze codebase for relevant patterns."""
        
        # Check cache first
        cache_key = self._cache_key("repo", feature_description)
        if cached := self._get_cache(cache_key):
            return cached
        
        # Gather codebase info
        file_tree = self._get_file_tree()
        recent_commits = self._get_recent_commits(30)
        
        prompt = f"""Analyze this codebase for implementing: {feature_description}

## File Structure
{file_tree}

## Recent Commits
{recent_commits}

## Tasks
1. Identify files likely to be modified
2. Find similar patterns in the codebase
3. Note relevant imports/dependencies
4. Flag potential conflicts

Output structured findings.
"""
        
        result = self.glm.generate(prompt)
        self._set_cache(cache_key, result)
        return result
    
    def run_practice_scout(self, feature_description: str) -> ScoutResult:
        """Research best practices for the feature."""
        
        prompt = f"""Research best practices for: {feature_description}

Consider:
1. Common implementation patterns
2. Security considerations
3. Performance implications
4. Testing strategies
5. Error handling approaches

Base recommendations on industry standards and common patterns.
"""
        
        return self.glm.generate(prompt)
    
    def run_epic_scout(self, new_epic_id: str) -> ScoutResult:
        """Find dependencies on existing epics."""
        
        # Read all existing epics
        epics = self._load_all_epics()
        
        prompt = f"""Analyze dependencies for new epic {new_epic_id}.

## Existing Epics
{self._format_epics(epics)}

## New Epic
{self._load_epic(new_epic_id)}

Identify:
1. Which existing epics this depends on
2. Specific tasks/components used
3. Potential conflicts
"""
        
        return self.glm.generate(prompt)

5.2.2 Scout Result Caching

class ScoutCache:
    """Cache scout results to avoid redundant analysis."""
    
    def __init__(self, cache_dir: Path):
        self.cache_dir = cache_dir
        self.ttl_hours = 24  # Results valid for 24 hours
    
    def get(self, scout_type: str, key: str) -> Optional[ScoutResult]:
        cache_file = self.cache_dir / scout_type / f"{self._hash(key)}.json"
        
        if not cache_file.exists():
            return None
        
        data = json.loads(cache_file.read_text())
        
        # Check TTL
        cached_at = datetime.fromisoformat(data["cached_at"])
        if datetime.now() - cached_at > timedelta(hours=self.ttl_hours):
            return None
        
        # Check if codebase changed significantly
        if data.get("git_hash") != self._current_git_hash():
            # Invalidate if significant changes
            changed_files = self._get_changed_files(data["git_hash"])
            if self._significant_changes(changed_files, data.get("relevant_files", [])):
                return None
        
        return ScoutResult.from_dict(data["result"])
    
    def set(self, scout_type: str, key: str, result: ScoutResult):
        cache_file = self.cache_dir / scout_type / f"{self._hash(key)}.json"
        cache_file.parent.mkdir(parents=True, exist_ok=True)
        
        data = {
            "cached_at": datetime.now().isoformat(),
            "git_hash": self._current_git_hash(),
            "relevant_files": result.relevant_files,
            "result": result.to_dict(),
        }
        
        cache_file.write_text(json.dumps(data, indent=2))

5.3 Planning Flow (Optimized)

┌─────────────────────────────────────────────────────────────────────────────┐
│                           PLANNING FLOW                                      │
└─────────────────────────────────────────────────────────────────────────────┘

Input (feature idea or epic ID)
          │
          ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ PARALLEL SCOUT EXECUTION                                                    │
│                                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐       │
│  │ repo-scout  │  │ practice-   │  │ docs-scout  │  │ epic-scout  │       │
│  │   LOCAL     │  │ scout LOCAL │  │   LOCAL     │  │   LOCAL     │       │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘       │
│         │                │                │                │               │
│         └────────────────┴────────────────┴────────────────┘               │
│                                    │                                        │
│                                    ▼                                        │
│                          ┌─────────────────┐                               │
│                          │ Scout Results   │                               │
│                          │ Aggregation     │                               │
│                          └────────┬────────┘                               │
└───────────────────────────────────┼─────────────────────────────────────────┘
                                    │
                                    ▼
                          ┌─────────────────┐
                          │ GLM-4.7-Flash   │  ← LOCAL: Draft epic spec
                          │ Spec Drafting   │
                          └────────┬────────┘
                                   │
                                   ▼
                          ┌─────────────────┐
                          │ PREMIUM         │  ← Opus 4.5: Architectural review
                          │ Spec Review     │     (optional, on request)
                          └────────┬────────┘
                                   │
                                   ▼
                          ┌─────────────────┐
                          │ Task Breakdown  │  ← LOCAL: Create task specs
                          │ GLM-4.7-Flash   │
                          └────────┬────────┘
                                   │
                                   ▼
                          ┌─────────────────┐
                          │ Write to .flow/ │
                          └─────────────────┘

5.4 Plan Review Optimization

Plan reviews follow the same optimization as impl reviews:

def review_plan(epic_id: str) -> ReviewResult:
    """Review epic plan with optimized pipeline."""
    
    # 1. Local pre-screen
    prescreen = local_prescreen_plan(epic_id)
    if prescreen.verdict == "NEEDS_WORK":
        return prescreen  # Fix locally
    
    # 2. Classify complexity
    complexity = classify_plan_complexity(epic_id)
    
    # 3. Route to appropriate premium
    if complexity == "SIMPLE":
        # Sonnet 4 or Gemini 3 Pro
        return route_to_cheapest_premium(epic_id, "plan")
    else:
        # Opus 4.5 for complex architectural review
        return route_to_opus(epic_id, "plan")

5.5 Configuration

{
  "plan": {
    "scouts": {
      "repo_scout": {"backend": "local", "cache_ttl_hours": 24},
      "practice_scout": {"backend": "local", "cache_ttl_hours": 168},
      "docs_scout": {"backend": "local", "cache_ttl_hours": 24},
      "github_scout": {"backend": "premium", "model": "sonnet-4"},
      "epic_scout": {"backend": "local", "cache_ttl_hours": 24},
      "docs_gap_scout": {"backend": "local", "cache_ttl_hours": 24}
    },
    "spec_drafting": {
      "backend": "local",
      "model": "GLM-4.7-Flash"
    },
    "task_breakdown": {
      "backend": "local",
      "model": "GLM-4.7-Flash"
    },
    "plan_review": {
      "simple": {"backend": "sonnet-4"},
      "complex": {"backend": "opus-4.5"}
    }
  }
}

6. Phase 3: Work

6.1 Overview

Work phase is the highest token consumer due to implementation complexity. Current worker agents get full context from premium models. Optimization uses local models for re-anchoring and routine implementation while premium handles complex decisions.

┌─────────────────────────────────────────────────────────────────────────────┐
│                           WORK OPTIMIZATION                                  │
└─────────────────────────────────────────────────────────────────────────────┘

Current Worker Flow:
  Premium spawns worker → Premium implements → Premium reviews

Optimized Worker Flow:
  Local re-anchor → Local/Premium implements (based on complexity) → Local pre-screen → Premium reviews

Token Savings: 40-60% of work phase
Quality Impact: Minimal (premium for complex decisions)

6.2 Worker Agent Optimization

6.2.1 Complexity-Based Model Selection

class WorkerModelSelector:
    """Select model for worker based on task complexity."""
    
    def select_model(self, task_id: str) -> ModelConfig:
        task = load_task(task_id)
        epic = load_epic(task.epic_id)
        
        # Analyze task complexity
        complexity = self.analyze_complexity(task, epic)
        
        if complexity.level == "LOW":
            # Simple tasks: Local GLM-4.7-Flash
            return ModelConfig(
                backend="local",
                model="GLM-4.7-Flash",
                reason="Simple task, local model sufficient"
            )
        
        elif complexity.level == "MEDIUM":
            # Moderate tasks: Hybrid approach
            return ModelConfig(
                backend="hybrid",
                local_model="GLM-4.7-Flash",
                premium_model="sonnet-4",
                strategy="local_first_premium_fallback",
                reason="Moderate complexity, try local first"
            )
        
        else:  # HIGH
            # Complex tasks: Premium model
            return ModelConfig(
                backend="premium",
                model=self.select_premium_model(complexity),
                reason="Complex task requires premium reasoning"
            )
    
    def analyze_complexity(self, task: Task, epic: Epic) -> TaskComplexity:
        """Analyze task complexity for model selection."""
        
        indicators = {
            "security_sensitive": self.is_security_sensitive(task),
            "architectural": self.is_architectural(task),
            "concurrency": self.involves_concurrency(task),
            "unfamiliar_patterns": self.uses_unfamiliar_patterns(task),
            "large_scope": task.size in ["L", "XL"],
            "many_files": len(task.estimated_files) > 5,
            "cross_module": self.is_cross_module(task),
        }
        
        high_indicators = sum(indicators.values())
        
        if high_indicators >= 3 or indicators["security_sensitive"]:
            return TaskComplexity(level="HIGH", indicators=indicators)
        elif high_indicators >= 1:
            return TaskComplexity(level="MEDIUM", indicators=indicators)
        else:
            return TaskComplexity(level="LOW", indicators=indicators)

6.2.2 Local Re-Anchor Summaries

Workers re-anchor before implementation. Instead of full premium processing, local model creates summaries:

def local_reanchor(task_id: str, epic_id: str) -> ReanchorContext:
    """Create re-anchor context using local model."""
    
    # Read specs
    task_spec = flowctl_cat(task_id)
    epic_spec = flowctl_cat(epic_id)
    
    # Read memory
    memory = read_memory_files()
    
    # Check git state
    git_status = run("git status")
    git_log = run("git log -5 --oneline")
    
    # Local model creates focused summary
    prompt = f"""Create implementation context summary for task {task_id}.

## Task Spec
{task_spec}

## Epic Context
{epic_spec}

## Memory (pitfalls, conventions)
{memory}

## Git State
{git_status}
{git_log}

Create a focused summary:
1. Key acceptance criteria (bullet points)
2. Files likely to modify
3. Relevant patterns from memory
4. Dependencies to be aware of
5. Test requirements
"""
    
    summary = run_glm47_flash(prompt)
    
    return ReanchorContext(
        task_id=task_id,
        summary=summary,
        task_spec=task_spec,  # Full spec still available
        epic_spec=epic_spec,
    )

6.2.3 Hybrid Implementation Strategy

class HybridWorker:
    """Worker that uses local for routine code, premium for complex decisions."""
    
    def implement(self, task_id: str, context: ReanchorContext) -> ImplementResult:
        model_config = self.model_selector.select_model(task_id)
        
        if model_config.backend == "local":
            return self.implement_local(task_id, context)
        
        elif model_config.backend == "hybrid":
            # Try local first
            result = self.implement_local(task_id, context)
            
            if result.confidence < 0.8 or result.needs_premium_decision:
                # Escalate to premium for specific decisions
                return self.escalate_to_premium(task_id, context, result)
            
            return result
        
        else:  # premium
            return self.implement_premium(task_id, context, model_config)
    
    def implement_local(self, task_id: str, context: ReanchorContext) -> ImplementResult:
        """Implement using GLM-4.7-Flash."""
        
        prompt = f"""Implement task {task_id}.

## Context
{context.summary}

## Full Task Spec (reference)
{context.task_spec}

## Instructions
1. Read relevant code files
2. Implement changes following existing patterns
3. Add tests if spec requires
4. Keep changes focused and minimal

If you encounter:
- Security-sensitive decisions → Flag for premium review
- Unfamiliar patterns → Flag for premium assistance
- Architectural choices → Flag for premium decision

Output your implementation plan, then implement.
"""
        
        return self.glm.implement(prompt)

6.3 Work Flow (Optimized)

┌─────────────────────────────────────────────────────────────────────────────┐
│                            WORK FLOW                                         │
└─────────────────────────────────────────────────────────────────────────────┘

                        /flow-next:work fn-1
                               │
                               ▼
                    ┌─────────────────┐
                    │ Setup Questions │
                    │ (branch, review)│
                    └────────┬────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ TASK LOOP                                                                   │
│                                                                             │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ flowctl ready --epic fn-1 → next task                               │   │
│  └──────────────────────────────────┬──────────────────────────────────┘   │
│                                     │                                       │
│                                     ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐   │
│  │ SPAWN WORKER (subagent)                                             │   │
│  │                                                                     │   │
│  │  ┌─────────────────┐                                                │   │
│  │  │ LOCAL RE-ANCHOR │ ← GLM-4.7-Flash creates focused summary       │   │
│  │  └────────┬────────┘                                                │   │
│  │           │                                                         │   │
│  │           ▼                                                         │   │
│  │  ┌─────────────────┐                                                │   │
│  │  │ CLASSIFY TASK   │ ← Determine LOCAL / HYBRID / PREMIUM          │   │
│  │  └────────┬────────┘                                                │   │
│  │           │                                                         │   │
│  │     ┌─────┴─────┐                                                   │   │
│  │     ▼           ▼                                                   │   │
│  │  ┌──────┐  ┌──────────┐                                             │   │
│  │  │LOCAL │  │PREMIUM   │                                             │   │
│  │  │impl  │  │impl      │                                             │   │
│  │  └──┬───┘  └────┬─────┘                                             │   │
│  │     │           │                                                   │   │
│  │     └─────┬─────┘                                                   │   │
│  │           │                                                         │   │
│  │           ▼                                                         │   │
│  │  ┌─────────────────┐                                                │   │
│  │  │ COMMIT          │                                                │   │
│  │  └────────┬────────┘                                                │   │
│  │           │                                                         │   │
│  │           ▼                                                         │   │
│  │  ┌─────────────────┐                                                │   │
│  │  │ REVIEW          │ ← See Phase 4                                  │   │
│  │  │ (if enabled)    │                                                │   │
│  │  └────────┬────────┘                                                │   │
│  │           │                                                         │   │
│  │           ▼                                                         │   │
│  │  ┌─────────────────┐                                                │   │
│  │  │ COMPLETE        │ ← flowctl done                                 │   │
│  │  └─────────────────┘                                                │   │
│  └─────────────────────────────────────────────────────────────────────┘   │
│                                     │                                       │
│                                     ▼                                       │
│                          Next task or Quality phase                         │
└──────────────────────────────────��──────────────────────────────────────────┘

6.4 Configuration

{
  "work": {
    "model_selection": {
      "strategy": "complexity_based",
      "low_complexity": {
        "backend": "local",
        "model": "GLM-4.7-Flash"
      },
      "medium_complexity": {
        "backend": "hybrid",
        "local_model": "GLM-4.7-Flash",
        "premium_model": "sonnet-4",
        "strategy": "local_first_premium_fallback"
      },
      "high_complexity": {
        "backend": "premium",
        "security": "codex-5.2",
        "architectural": "opus-4.5",
        "default": "opus-4.5"
      }
    },
    "reanchor": {
      "backend": "local",
      "model": "GLM-4.7-Flash",
      "include_memory": true
    },
    "worker": {
      "subagent": true,
      "fresh_context": true
    }
  }
}

7. Phase 4: Review

7.1 Overview

Review phase has the highest token burn due to:

  1. Full context sent to premium models
  2. Multiple review iterations per task
  3. Re-reviewing unchanged code
  4. Verbose prompts

This is the primary optimization target.

7.2 Review Pipeline

┌─────────────────────────────────────────────────────────────────────────────┐
│                         REVIEW PIPELINE                                      │
└─────────────────────────────────────────────────────────────────────────────┘

                              Review Request
                                    │
                                    ▼
                         ┌─────────────────┐
                         │ STAGE 1: GATES  │  ← FREE (deterministic)
                         │ lint/build/tsc  │
                         └────────┬────────┘
                                  │
                                  ▼
                         ┌─────────────────┐
                         │ STAGE 2:        │  ← FREE (GLM-4.7-Flash)
                         │ PRE-SCREEN      │
                         │                 │
                         │ Find obvious    │
                         │ issues, fix     │
                         │ locally         │
                         └────────┬────────┘
                                  │
                                  ▼
                         ┌───────��─────────┐
                         │ STAGE 3:        │  ← FREE (SHA-256)
                         │ FINGERPRINT     │
                         │                 │
                         │ Skip if diff    │
                         │ unchanged       │
                         └────────┬────────┘
                                  │
                                  ▼
                         ┌─────────────────┐
                         │ STAGE 4:        │  ← FREE (GLM-4.7-Flash)
                         │ CLASSIFY        │
                         │                 │
                         │ SECURITY /      │
                         │ COMPLEX /       │
                         │ ROUTINE         │
                         └────────┬────────┘
                                  │
                                  ▼
                         ┌─────────────────┐
                         │ STAGE 5:        │  ← Multi-source pooling
                         │ ROUTE           │
                         │                 │
                         │ Select best     │
                         │ available src   │
                         └────────┬────────┘
                                  │
          ┌───────────────────────┼───────────────────────┐
          ▼                       ▼                       ▼
    ┌───────────┐          ┌───────────┐          ┌───────────┐
    │ SECURITY  │          │ COMPLEX   │          │ ROUTINE   │
    │           │          │           │          │           │
    │ Codex 5.2 │          │ Opus 4.5  │          │ Sonnet 4  │
    │ preferred │          │ preferred │          │ or lowest │
    └─────┬─────┘          └─────┬─────┘          └─────┬─────┘
          │                      │                      │
          └──────────────────────┴──────────────────────┘
                                 │
                                 ▼
                         ┌─────────────────┐
                         │ STAGE 6:        │
                         │ PREMIUM REVIEW  │
                         │                 │
                         │ Full context    │
                         │ Lean prompt     │
                         │ Final verdict   │
                         └────────┬────────┘
                                  │
                                  ▼
                           SHIP / NEEDS_WORK

7.3 Stage Details

See the detailed specifications in Section 6 of the earlier review-focused spec. Key points:

7.3.1 Stage 2: Pre-Screen

GLM-4.7-Flash catches obvious issues:

  • Missing imports
  • Undefined variables
  • Obvious null access
  • Logic errors
  • Missing returns

Expected catch rate: 40-60% of NEEDS_WORK issues

7.3.2 Stage 3: Fingerprint

Prevent re-reviewing unchanged code:

fingerprint = sha256(normalize(git diff base..HEAD))

# Skip if:
# - Same fingerprint as last NEEDS_WORK
# - Blocker files not touched

Expected skip rate: 20-40% of reviews

7.3.3 Stage 4: Classification

Route to appropriate premium model:

Type Patterns Model
SECURITY auth, crypto, password, token Codex 5.2
COMPLEX core, architecture, concurrent Opus 4.5
ROUTINE tests, docs, simple fixes Sonnet 4

7.3.4 Stage 5: Quota-Aware Routing

Select from 4 premium sources based on availability:

Source Models Status Tracking
OpenAI Pro Codex 5.2, GPT-4o .flow/quota_state.json
Claude Max Opus 4.5, Sonnet 4 .flow/quota_state.json
Copilot Opus, Codex, Sonnet .flow/quota_state.json
Antigravity Opus, Gemini 3 Pro .flow/quota_state.json

7.3.5 Stage 6: Premium Review

CRITICAL: No lossy compression. Premium sees full context:

  • Changed files (full content)
  • Direct imports (full content)
  • Test files (full content)
  • Diff (full)

Savings come from:

  • Pre-screening (fewer reviews)
  • Fingerprinting (skip unchanged)
  • Lean prompts (~500 tokens vs 3000)
  • Multi-source pooling (capacity)

7.4 Configuration

{
  "review": {
    "gates": {
      "enabled": true,
      "checks": ["lint", "build", "typecheck"]
    },
    "prescreen": {
      "enabled": true,
      "backend": "local",
      "model": "GLM-4.7-Flash",
      "max_iterations": 3
    },
    "fingerprint": {
      "enabled": true,
      "skip_unchanged": true,
      "require_blocker_change": true
    },
    "classification": {
      "security_patterns": ["**/auth/**", "**/crypto/**", "**/*password*"],
      "complex_patterns": ["**/core/**", "**/architect*"]
    },
    "routing": {
      "security": ["openai_pro:codex-5.2", "copilot:codex-5.2"],
      "complex": ["claude_max:opus-4.5", "copilot:opus-4.5"],
      "routine": ["quota_lowest", "copilot:sonnet-4"]
    },
    "premium_sources": {
      "openai_pro": {"enabled": true, "api_key_env": "OPENAI_API_KEY"},
      "claude_max": {"enabled": true, "api_key_env": "ANTHROPIC_API_KEY"},
      "copilot": {"enabled": true, "access_method": "coding_agent"},
      "antigravity": {"enabled": true, "access_method": "cli"}
    }
  }
}

8. Local Model Infrastructure

8.1 Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                      LOCAL MODEL INFRASTRUCTURE                              │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│ llama-swap (Port 8080)                                                      │
│                                                                             │
│ Proxy that manages model loading/unloading for llama.cpp                    │
│ - Routes requests to appropriate model                                       │
│ - Frees GPU memory when idle                                                │
│ - Enables model hot-swapping                                                │
│                                                                             │
│ Endpoints:                                                                   │
│   /v1/chat/completions  → Anthropic/OpenAI compatible                       │
│   /completion           → llama.cpp native                                  │
│   /health               → Health check                                      │
│   /swap                 → Model swap control                                │
└─────────────────────────────────────────────────────────────────────────────┘
                              │
              ┌───────────────┴───────────────┐
              ▼                               ▼
┌─────────────────────────┐     ┌─────────────────────────┐
│ llama-server            │     │ Ollama (Port 11434)     │
│ GLM-4.7-Flash           │     │ Qwen2.5-Coder-7B        │
│                         │     │                         │
│ Context: 65K default    │     │ Always hot (keep_alive) │
│ VRAM: ~23GB             │     │ VRAM: ~8GB              │
│ Speed: 60-100 tok/s     │     │ Speed: 100+ tok/s       │
└─────────────────────────┘     └─────────────────────────┘

8.2 llama-swap Configuration

Based on the tammam.io guide, llama-swap enables:

  • Model hot-swapping without restart
  • Memory management (free GPU when idle)
  • Anthropic API compatibility (for Claude Code integration)
# llama-swap.yaml
listen: 0.0.0.0:8080

models:
  glm-4.7-flash:
    cmd: >
      llama-server
      -m /models/GLM-4.7-Flash-UD-Q4_K_XL.gguf
      -ngl 99
      -c 65536
      --cache-type-k q4_0
      --cache-type-v q4_0
      -fa
      --mmap
      -t 8
    ttl: 300  # Unload after 5 min idle
    
  glm-4.7-flash-large:
    cmd: >
      llama-server
      -m /models/GLM-4.7-Flash-UD-Q4_K_XL.gguf
      -ngl 99
      -c 131072
      --cache-type-k q4_0
      --cache-type-v q4_0
      -fa
      --mmap
      -t 8
    ttl: 300

aliases:
  claude-3-5-sonnet-20241022: glm-4.7-flash  # Claude Code compatibility
  gpt-4: glm-4.7-flash  # OpenAI compatibility

healthcheck:
  interval: 30
  timeout: 5

8.3 Startup Script

#!/bin/bash
# start-local-stack.sh

set -e

MODELS_DIR="${MODELS_DIR:-$HOME/models}"
LLAMA_SWAP_CONFIG="${LLAMA_SWAP_CONFIG:-./llama-swap.yaml}"

echo "=== Starting Local Model Stack ==="

# 1. Start Ollama for 7B pre-filter
echo "[1/3] Starting Ollama..."
if ! pgrep -x "ollama" > /dev/null; then
    ollama serve &
    sleep 2
fi

# Pull and warm up 7B model
ollama pull qwen2.5-coder:7b-instruct-q8_0 2>/dev/null || true
curl -s http://localhost:11434/api/generate -d '{
  "model": "qwen2.5-coder:7b-instruct-q8_0",
  "prompt": "ready",
  "keep_alive": "24h"
}' > /dev/null

echo "  ✓ Ollama ready (7B pre-filter, ~8GB VRAM)"

# 2. Start llama-swap
echo "[2/3] Starting llama-swap..."
if ! pgrep -x "llama-swap" > /dev/null; then
    llama-swap -c "$LLAMA_SWAP_CONFIG" &
    sleep 2
fi

# Wait for llama-swap
until curl -s http://localhost:8080/health > /dev/null 2>&1; do
    sleep 1
done

echo "  ✓ llama-swap ready (port 8080)"

# 3. Pre-warm GLM-4.7-Flash
echo "[3/3] Pre-warming GLM-4.7-Flash..."
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "glm-4.7-flash", "messages": [{"role": "user", "content": "ready"}], "max_tokens": 1}' \
  > /dev/null

echo "  ✓ GLM-4.7-Flash ready (~23GB VRAM)"

echo ""
echo "=== Local Stack Ready ==="
echo "  Pre-filter (7B):   http://localhost:11434"
echo "  GLM-4.7-Flash:     http://localhost:8080"
echo ""
echo "VRAM Usage: ~31GB (7B: 8GB + GLM: 23GB)"

8.4 Integration with Claude Code

Using the Anthropic-compatible endpoint from llama-swap:

# Set environment for Claude Code to use local model
export ANTHROPIC_BASE_URL=http://localhost:8080/v1
export ANTHROPIC_API_KEY=local  # Dummy key for llama-swap

# Now Claude Code uses local GLM-4.7-Flash
claude "Review this code..."

8.5 Integration with Codex CLI

# Codex can also use the OpenAI-compatible endpoint
export OPENAI_BASE_URL=http://localhost:8080/v1
export OPENAI_API_KEY=local

# For specific commands, use local
codex --api-base http://localhost:8080/v1 "..."

9. Multi-Source Orchestration

9.1 Quota Tracking

class QuotaTracker:
    """Track quota usage across all premium sources."""
    
    def __init__(self, state_file: Path = Path(".flow/quota_state.json")):
        self.state_file = state_file
        self.load()
    
    def get_status(self, source: str) -> QuotaStatus:
        """Get current quota status for a source."""
        state = self.state.get("sources", {}).get(source, {})
        
        return QuotaStatus(
            source=source,
            remaining_pct=state.get("remaining_pct", 1.0),
            is_throttled=state.get("is_throttled", False),
            last_request=datetime.fromisoformat(state.get("last_request", "1970-01-01")),
            requests_since_reset=state.get("requests_since_reset", 0),
        )
    
    def record_usage(self, source: str, tokens: int, was_throttled: bool = False):
        """Record usage for quota tracking."""
        if "sources" not in self.state:
            self.state["sources"] = {}
        
        if source not in self.state["sources"]:
            self.state["sources"][source] = {
                "remaining_pct": 1.0,
                "requests_since_reset": 0,
                "tokens_since_reset": 0,
            }
        
        s = self.state["sources"][source]
        s["last_request"] = datetime.now().isoformat()
        s["requests_since_reset"] = s.get("requests_since_reset", 0) + 1
        s["tokens_since_reset"] = s.get("tokens_since_reset", 0) + tokens
        s["is_throttled"] = was_throttled
        
        # Estimate remaining (heuristic based on usage patterns)
        s["remaining_pct"] = self._estimate_remaining(source, s)
        
        self.save()
    
    def get_best_for_task_type(self, task_type: str) -> list[tuple[str, str]]:
        """Get best available sources for task type, sorted by preference."""
        routing = self._get_routing_config()
        candidates = routing.get(task_type, [])
        
        available = []
        for candidate in candidates:
            if candidate == "quota_lowest":
                # Special: select source with most remaining quota
                best = self._get_highest_remaining()
                if best:
                    available.append(best)
            else:
                source, model = candidate.split(":")
                status = self.get_status(source)
                if not status.is_throttled and status.remaining_pct > 0.05:
                    available.append((source, model))
        
        return available

9.2 Backend Implementations

9.2.1 OpenAI Backend

class OpenAIBackend(PremiumBackend):
    """OpenAI Pro backend (Codex 5.2, GPT-4o)."""
    
    source_name = "openai_pro"
    
    def __init__(self):
        self.api_key = os.environ.get("OPENAI_API_KEY")
        self.client = OpenAI(api_key=self.api_key)
    
    def review(self, context: ReviewContext, task_type: str) -> ReviewResult:
        model = "gpt-5.2" if task_type == "security" else "gpt-4o"
        
        response = self.client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": context.prompt}],
            temperature=0.1,
            max_tokens=2048,
        )
        
        return ReviewResult(
            source=self.source_name,
            model=model,
            content=response.choices[0].message.content,
            tokens_input=response.usage.prompt_tokens,
            tokens_output=response.usage.completion_tokens,
        )

9.2.2 Claude Backend

class ClaudeBackend(PremiumBackend):
    """Claude Max backend (Opus 4.5, Sonnet 4)."""
    
    source_name = "claude_max"
    
    def __init__(self):
        self.api_key = os.environ.get("ANTHROPIC_API_KEY")
        self.client = Anthropic(api_key=self.api_key)
    
    def review(self, context: ReviewContext, task_type: str) -> ReviewResult:
        model = "claude-opus-4-5-20250514" if task_type in ["security", "complex"] else "claude-sonnet-4-20250514"
        
        response = self.client.messages.create(
            model=model,
            max_tokens=2048,
            messages=[{"role": "user", "content": context.prompt}],
        )
        
        return ReviewResult(
            source=self.source_name,
            model=model,
            content=response.content[0].text,
            tokens_input=response.usage.input_tokens,
            tokens_output=response.usage.output_tokens,
        )

9.2.3 Copilot Backend

class CopilotBackend(PremiumBackend):
    """GitHub Copilot Pro+ backend."""
    
    source_name = "copilot"
    
    def review(self, context: ReviewContext, task_type: str) -> ReviewResult:
        # Copilot accessed via coding agent
        model = self._select_model(task_type)
        
        # Use GitHub Copilot Coding Agent API
        result = self._invoke_coding_agent(context, model)
        
        return ReviewResult(
            source=self.source_name,
            model=model,
            content=result.content,
            tokens_input=result.tokens_input,
            tokens_output=result.tokens_output,
        )
    
    def _select_model(self, task_type: str) -> str:
        if task_type == "security":
            return "codex-5.2"
        elif task_type == "complex":
            return "claude-opus-4-5"
        else:
            return "claude-sonnet-4"

9.2.4 Antigravity Backend

class AntigravityBackend(PremiumBackend):
    """Google Antigravity backend (Opus 4.5, Gemini 3 Pro)."""
    
    source_name = "antigravity"
    
    def review(self, context: ReviewContext, task_type: str) -> ReviewResult:
        model = "claude-opus-4-5" if task_type in ["security", "complex"] else "gemini-3-pro"
        
        # Use Antigravity CLI
        result = self._invoke_antigravity(context, model)
        
        return ReviewResult(
            source=self.source_name,
            model=model,
            content=result.content,
            tokens_input=result.tokens_input,
            tokens_output=result.tokens_output,
        )

10. Configuration Reference

10.1 Complete .flow/config.json

{
  "schema_version": 2,
  
  "interview": {
    "question_generation": {
      "backend": "local",
      "model": "GLM-4.7-Flash"
    },
    "spec_drafting": {
      "backend": "local"
    }
  },
  
  "plan": {
    "scouts": {
      "repo_scout": {"backend": "local", "cache_ttl_hours": 24},
      "practice_scout": {"backend": "local", "cache_ttl_hours": 168},
      "docs_scout": {"backend": "local", "cache_ttl_hours": 24},
      "github_scout": {"backend": "premium", "model": "sonnet-4"},
      "epic_scout": {"backend": "local"},
      "docs_gap_scout": {"backend": "local"}
    },
    "spec_drafting": {"backend": "local"},
    "task_breakdown": {"backend": "local"}
  },
  
  "work": {
    "model_selection": {
      "strategy": "complexity_based",
      "low_complexity": {"backend": "local"},
      "medium_complexity": {"backend": "hybrid"},
      "high_complexity": {"backend": "premium"}
    },
    "reanchor": {"backend": "local"}
  },
  
  "review": {
    "gates": {
      "enabled": true,
      "checks": ["lint", "build", "typecheck"]
    },
    "prescreen": {
      "enabled": true,
      "backend": "local",
      "max_iterations": 3
    },
    "fingerprint": {
      "enabled": true,
      "skip_unchanged": true,
      "require_blocker_change": true
    },
    "classification": {
      "security_patterns": [
        "**/auth/**", "**/crypto/**", "**/security/**",
        "**/*password*", "**/*secret*", "**/*token*"
      ],
      "complex_patterns": [
        "**/core/**", "**/architect*", "**/concurrent*"
      ]
    },
    "routing": {
      "security": ["openai_pro:codex-5.2", "copilot:codex-5.2"],
      "complex": ["claude_max:opus-4.5", "copilot:opus-4.5", "antigravity:opus-4.5"],
      "routine": ["quota_lowest", "copilot:sonnet-4", "antigravity:gemini-3-pro"]
    }
  },
  
  "premium_sources": {
    "openai_pro": {
      "enabled": true,
      "api_key_env": "OPENAI_API_KEY",
      "models": ["gpt-5.2", "codex-5.2", "gpt-4o"]
    },
    "claude_max": {
      "enabled": true,
      "api_key_env": "ANTHROPIC_API_KEY",
      "models": ["claude-opus-4-5-20250514", "claude-sonnet-4-20250514"]
    },
    "copilot": {
      "enabled": true,
      "access_method": "coding_agent",
      "models": ["claude-opus-4-5", "codex-5.2", "claude-sonnet-4"]
    },
    "antigravity": {
      "enabled": true,
      "access_method": "cli",
      "models": ["claude-opus-4-5", "gemini-3-pro"]
    }
  },
  
  "local": {
    "primary_model": "GLM-4.7-Flash-UD-Q4_K_XL",
    "primary_endpoint": "http://localhost:8080",
    "context_default": 65536,
    "context_max": 131072,
    "prefilter_model": "qwen2.5-coder:7b-instruct-q8_0",
    "prefilter_endpoint": "http://localhost:11434"
  },
  
  "memory": {
    "enabled": true
  }
}

10.2 Environment Variables (config.env for Ralph)

# === FLOW-NEXT TOKEN-OPTIMIZED CONFIGURATION ===

# --- LOCAL MODELS ---
LOCAL_ENABLED=1
LOCAL_ENDPOINT=http://localhost:8080
LOCAL_MODEL=GLM-4.7-Flash-UD-Q4_K_XL
LOCAL_CONTEXT_DEFAULT=65536

PREFILTER_ENABLED=1
PREFILTER_ENDPOINT=http://localhost:11434
PREFILTER_MODEL=qwen2.5-coder:7b-instruct-q8_0

# --- PREMIUM SOURCES ---
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

OPENAI_PRO_ENABLED=1
CLAUDE_MAX_ENABLED=1
COPILOT_ENABLED=1
ANTIGRAVITY_ENABLED=1

# --- REVIEW PIPELINE ---
GATES_ENABLED=1
PRESCREEN_ENABLED=1
PRESCREEN_MAX_ITERATIONS=3
FINGERPRINT_ENABLED=1
FINGERPRINT_SKIP_UNCHANGED=1

# --- WORK PIPELINE ---
WORK_

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions