Skip to content

Conversation

@kantord
Copy link
Member

@kantord kantord commented Dec 19, 2025

Summary

This PR adds E2E tests for the AI assistant chat feature using Ollama with a tiny local model (qwen2.5:1.5b), proving that real integration testing of LLM-powered features is feasible in CI without API keys or mocking.

This is an experiment to validate that we can test the chat feature end-to-end with a real (but small) language model, ensuring the full streaming pipeline works correctly.

Changes

  • Add ai-sdk-ollama provider with conditional switching in /api/chat/route.ts
  • Add Playwright tests for assistant chat (tests/e2e/assistant.spec.ts)
  • Configure E2E workflow to install and cache Ollama binary + models
  • Add POC workflow for benchmarking Ollama in GitHub Actions

Won't this make tests super slow?

No. With proper caching, Ollama adds minimal overhead:

  • Binary install: Cached, ~0s on subsequent runs
  • Model download: Cached (qwen2.5:1.5b is ~1GB), ~0s on subsequent runs
  • Model warmup: 2-8s (loading into memory)
  • Inference per test: 2-5s for simple prompts

Total added time: ~10-20s per CI run after caches are warm. Not done at this PR, but should be 100% parallelizable with other test cases, so effectively this should have no harmful effect on CI performance, especially when used sparingly for a few important test cases.

Won't this make tests super flaky?

The tests are designed to minimize flakiness:

  • Low temperature (0): More deterministic outputs
  • Flexible assertions: Tests check for patterns (e.g., Hello.*username) rather than exact strings
  • Warmup phase: Model is loaded before tests run, avoiding cold-start/performance-related flakiness

The model is small but capable enough for simple instruction-following. Tests don't rely on specific wording, just that the model echoes back identifiers we provide.

What about testing actual tool calls?

This PR intentionally skips MCP tool fetching in Ollama mode to focus on proving the concept first. However, testing real tool calls is absolutely possible. It should be easy to craft some prompts that would not result in test flakes but still prove the health of the tool call workflow.

Once this approach is approved, extending it to test the full MCP integration would be straightforward.

screenshot-2025-12-19_20-01-34

@github-actions github-actions bot added the size/S Small PR: 100-299 lines changed label Dec 19, 2025
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Dec 19, 2025
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Dec 19, 2025
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Dec 19, 2025
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/S Small PR: 100-299 lines changed labels Dec 19, 2025
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Dec 19, 2025
@kantord
Copy link
Member Author

kantord commented Dec 19, 2025

@claude can you please do a deep review of this pr

@claude
Copy link

claude bot commented Dec 19, 2025

Claude finished @kantord's task —— View job


Deep Review of PR #199

Todo List:

  • Read project guidelines (CLAUDE.md)
  • Review screenshot to understand UI context
  • Review changed workflow files
  • Review package.json and lock file changes
  • Review Playwright config changes
  • Review API route changes
  • Review new E2E test file
  • Analyze adherence to project standards
  • Provide comprehensive feedback

Overview

This PR adds Ollama integration for E2E testing of the AI assistant feature, enabling deterministic, API-key-free testing in CI. The implementation is well-structured and solves a real problem. Below is my detailed analysis.

🎯 Purpose & Value

What this PR accomplishes:

  • ✅ Enables E2E testing of the assistant chat feature without requiring OpenRouter API keys
  • ✅ Uses local Ollama (qwen2.5:1.5b model) for deterministic, reproducible test results
  • ✅ Implements aggressive caching to minimize CI run time and avoid repeated model downloads
  • ✅ Provides both a POC workflow and integration into the main E2E workflow

Business value:

  • Removes external API dependencies from CI pipeline
  • Reduces test flakiness from external API rate limits/failures
  • Enables offline development and testing
  • Significantly cheaper than cloud-based AI API calls in CI

📋 Code Quality Assessment

✅ Strengths

  1. Excellent caching strategy: The workflow caches both Ollama binary and models separately with clear versioning
  2. Good separation of concerns: POC workflow is separate from main E2E workflow
  3. Proper conditional logic: USE_OLLAMA flag cleanly switches between Ollama and OpenRouter
  4. Temperature control: Uses temperature: 0 in test mode for deterministic responses
  5. Clean test structure: Tests include warmup phase and proper async handling
  6. Good error handling: Proper timeout handling for model warmup

⚠️ Issues & Concerns

🔴 CRITICAL: Type Safety Violation

File: src/app/api/chat/route.ts (line 67)

tools[toolName] = toolDef as ToolSet[string];

Problem: Using as type assertion bypasses TypeScript's type checking. This violates the project's strict "no any" rule in spirit (type assertions are similar to any - they tell TypeScript "trust me").

Why this matters: If toolDef doesn't actually match ToolSet[string], this could cause runtime errors that TypeScript won't catch.

Recommendation:

// Option 1: Add proper type guard
function isValidToolDef(value: unknown): value is ToolSet[string] {
  // Add runtime checks here
  return typeof value === 'object' && value !== null && 'execute' in value;
}

for (const [toolName, toolDef] of Object.entries(serverTools)) {
  if (isValidToolDef(toolDef)) {
    tools[toolName] = toolDef;
  } else {
    console.warn(`Invalid tool definition for ${toolName}`);
  }
}

// Option 2: Fix the type at the source
// If serverTools is known to be ToolSet, update the McpClient type definition

Note: This issue exists in the base code, not introduced by this PR, but should be addressed.


🟡 MODERATE: Missing Dependencies

File: package.json

Problem: The PR adds ai-sdk-ollama but doesn't show that ollama binary is properly installed/available in dev environments.

Recommendation:

  • Add documentation to README/CLAUDE.md about installing Ollama locally for development
  • Add a check in pnpm dev to warn if Ollama isn't available but USE_OLLAMA=true
  • Consider adding a pnpm ollama:install script that provides instructions

🟡 MODERATE: Incomplete Test Coverage

File: tests/e2e/assistant.spec.ts

Observations:

  1. Only tests basic chat functionality - doesn't test MCP tool integration
  2. No tests for error scenarios (e.g., Ollama not available)
  3. Tests rely on regex matching which might be fragile across different model responses

Current tests:

  • ✅ Basic greeting response (checks for username in response)
  • ✅ Streaming response (checks for numbers)

Missing tests:

  • ❌ MCP server selection and tool usage (though this is skipped in Ollama mode, should still have tests)
  • ❌ Error handling (API failures, timeouts)
  • ❌ Message history and context preservation
  • ❌ Model selection UI
  • ❌ Clear chat functionality

Recommendation:

test("handles missing Ollama gracefully", async ({ authenticatedPage }) => {
  // Test error state when Ollama is unavailable
});

test("preserves conversation context", async ({ authenticatedPage }) => {
  // Send multiple messages and verify context is maintained
});

test("allows model selection", async ({ authenticatedPage }) => {
  // Test the model selector dropdown
});

🟡 MODERATE: Hardcoded Values

Files: Multiple

Problem: Model name qwen2.5:1.5b is hardcoded in several places:

  • .github/workflows/e2e.yml (line 30)
  • .github/workflows/ollama-poc.yml (lines 36, 60)
  • playwright.config.mts (line 53)
  • src/app/api/chat/route.ts (line 20)
  • tests/e2e/assistant.spec.ts (line 4)
  • package.json (lines 19, 20)

Recommendation:
Create a constants file or centralize in one config location:

// config/testing.ts
export const E2E_TEST_CONFIG = {
  OLLAMA_MODEL: "qwen2.5:1.5b",
  OLLAMA_BASE_URL: "http://localhost:11434",
  OLLAMA_CACHE_VERSION: "v3",
} as const;

Then import from this single source of truth.


🟢 MINOR: Workflow Efficiency

File: .github/workflows/e2e.yml

Observation: The workflow runs model pull every time, even with caching. While the cache works, it would be cleaner to check if model exists before pulling.

Current approach:

- name: Pull Ollama model
  run: OLLAMA_MODELS=/home/runner/.ollama/models ollama pull qwen2.5:1.5b

Suggested improvement:

- name: Pull Ollama model (if needed)
  run: |
    OLLAMA_MODELS=/home/runner/.ollama/models ollama list | grep -q "qwen2.5:1.5b" || \
    OLLAMA_MODELS=/home/runner/.ollama/models ollama pull qwen2.5:1.5b

This is minor since ollama pull is smart about not re-downloading, but explicit checks are clearer.


🟢 MINOR: Test Timeout Strategy

File: tests/e2e/assistant.spec.ts

Observation: Tests use 60-second timeouts for AI responses. While reasonable, this could be optimized.

Suggestions:

  1. Consider making timeout configurable based on model size
  2. Add timeout monitoring to detect when model performance degrades
  3. Document expected response times in comments
// Expected: 5-15s for first response (model already warmed)
// Timeout: 60s to account for CI resource variance
await expect(
  authenticatedPage.getByText(new RegExp(`Hello.*${testUsername}`, "i")),
).toBeVisible({
  timeout: 60_000,
});

🟢 MINOR: Console Logging

File: src/app/api/chat/route.ts

Observation: Uses console.log() and console.error() directly. The project should standardize on a logging approach.

Current:

console.log(`[Chat API] Using Ollama model: ${OLLAMA_MODEL}`);
console.error("[Chat API] OPENROUTER_API_KEY not configured");

Suggestion: Consider a structured logging library or at least consistent prefixing throughout the codebase.


🏗️ Architecture & Design

✅ Good Architectural Decisions

  1. Conditional execution: Using USE_OLLAMA env var is clean and doesn't affect production code paths
  2. Fallback logic: Gracefully handles missing MCP servers in test mode
  3. Temperature control: Setting temperature: 0 for deterministic test results is correct
  4. Model warmup: Pre-warming the model in beforeAll is smart and prevents first-test timeouts

🤔 Architectural Questions

  1. Why skip MCP tools entirely in Ollama mode?

    • The code skips MCP tool fetching when USE_OLLAMA=true
    • This means tests don't exercise the full integration path
    • Consider: Could you mock MCP servers for more complete E2E coverage?
  2. Is qwen2.5:1.5b the right model choice?

    • Pro: Small, fast, good for CI
    • Con: May not represent production behavior accurately
    • Consider: Document why this model was chosen and what limitations it has
  3. Should Ollama integration be more than just testing?

    • Could this be a production feature? (e.g., for local/air-gapped deployments)
    • If yes: The current implementation is too test-focused
    • If no: Current design is appropriate

📚 Documentation & Maintainability

Missing Documentation

  1. No README/docs updates: The CLAUDE.md wasn't updated to mention Ollama setup
  2. No comments explaining the POC workflow purpose: Why have both e2e.yml and ollama-poc.yml?
  3. No migration guide: What happens to existing E2E tests? Do they still work with OpenRouter?

Recommended Documentation Additions

In CLAUDE.md:

### E2E Testing with Ollama

For E2E testing of the AI assistant feature, we use Ollama with the qwen2.5:1.5b model:

**Installation (for local development):**
```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull the test model
ollama pull qwen2.5:1.5b

# Start Ollama server (usually starts automatically)
ollama serve

Running tests with Ollama:

pnpm test:e2e:ollama     # Run all E2E tests with Ollama
pnpm test:e2e:ollama:ui  # Open Playwright UI with Ollama

Why Ollama?

  • No API keys required
  • Deterministic responses with temperature=0
  • Faster and cheaper than cloud APIs
  • Works offline

---

## 🔒 Security Considerations

### ✅ Secure Practices

1. Test-only credentials are clearly marked (e.g., `test-only-not-a-real-secret`)
2. No production secrets in workflow files
3. Proper authentication checks maintained in API route

### 🤔 Considerations

1. **Model supply chain**: Ollama downloads models from the internet. Consider:
   - Verifying model checksums
   - Documenting the source of truth for model versions
   - Having a backup plan if Ollama's registry is unavailable

---

## 🧪 Testing Strategy

### Current Test Philosophy

The tests are **behavioral/integration tests** rather than unit tests. This is appropriate for E2E, but consider:

1. **Test pyramid**: These E2E tests are slow and expensive. Ensure you also have:
   - Unit tests for individual functions
   - Integration tests for API routes
   - Component tests for UI elements

2. **Test data**: The tests use dynamic usernames and simple prompts. Consider:
   - Creating fixtures with known-good responses
   - Testing edge cases (very long messages, special characters, etc.)

---

## 🚀 Performance Impact

### CI Performance

**Positive:**
- Caching is well-implemented (binary + model caching)
- Model warmup happens once per test suite
- Parallel test execution should work fine

**Concerns:**
- First run (cache miss) will be slow (~2-3 minutes to download model)
- Model size (qwen2.5:1.5b) is ~900MB - ensure GitHub Actions has sufficient storage
- Each workflow run starts Ollama service - consider cost of this overhead

**Measurements from screenshot:**
- Test passed in 4.9s (after warmup)
- Warmup took 1.3s (with warm cache)
- Overall reasonable performance ✅

---

## 📊 Comparison: `e2e.yml` vs `ollama-poc.yml`

| Aspect | e2e.yml | ollama-poc.yml |
|--------|---------|----------------|
| **Purpose** | Main E2E tests (all features) | POC/demo for Ollama integration |
| **Trigger** | Push/PR to any branch | Manual + when workflow file changes |
| **Timeout** | 15 minutes | 10 minutes |
| **Tests** | All E2E tests | Only assistant.spec.ts |
| **Artifacts** | On failure only | On failure only |
| **Extras** | None | Benchmark output, debug info |

**Recommendation:** Clarify the purpose of having both workflows:
- Is `ollama-poc.yml` temporary (for this PR only)?
- Should it be merged into `e2e.yml` after testing?
- Or should it remain as a separate "smoke test" workflow?

---

## ✅ Adherence to Project Guidelines

Checking against CLAUDE.md requirements:

| Guideline | Status | Notes |
|-----------|--------|-------|
| **No `any` types** | ⚠️ PARTIAL | Uses `as` assertion (see Critical Issues above) |
| **Async/await over promises** | ✅ PASS | All async code uses async/await |
| **Server Components first** | ✅ N/A | No component changes in this PR |
| **Use hey-api hooks** | ✅ N/A | No API client changes |
| **Error handling** | ✅ PASS | Proper try/catch and error responses |
| **TypeScript strict mode** | ⚠️ PARTIAL | Type assertion violates spirit of strict mode |
| **Run linter** | ❓ UNKNOWN | Should verify with `pnpm lint` |
| **Write tests** | ✅ PASS | New tests added |
| **JSDoc for complex functions** | ⚠️ PARTIAL | `getMcpTools` and `connectToMcpRemote` are well-documented, but warmup function could use JSDoc |

---

## 🎨 Code Style & Consistency

### ✅ Good Practices

1. Consistent naming conventions (`useOllama`, `OLLAMA_MODEL`)
2. Clear function names (`warmupOllama`, `connectToMcpRemote`)
3. Proper async error handling
4. Good use of TypeScript interfaces

### 🟡 Minor Style Issues

1. **Inconsistent comment style:**
   ```typescript
   // Some comments are full sentences with periods.
   // Some comments dont have periods
  1. Magic numbers without constants:
    await expect(...).toBeVisible({ timeout: 60_000 }); // Why 60s specifically?

🔄 Suggested Improvements Priority

🔴 Must Fix (Before Merge)

  1. Address type assertion in connectToMcpRemote - add proper type guard or fix at source
  2. Run linter/type-check - Ensure pnpm lint and pnpm type-check pass
  3. Clarify workflow strategy - Document whether ollama-poc.yml is temporary or permanent

🟡 Should Fix (Near Term)

  1. Add documentation - Update CLAUDE.md with Ollama setup instructions
  2. Centralize constants - Extract hardcoded model names to config
  3. Expand test coverage - Add tests for error scenarios and edge cases
  4. Add JSDoc - Document complex functions, especially test helpers

🟢 Nice to Have (Future)

  1. Optimize workflow - Add explicit model existence checks
  2. Structured logging - Replace console.log with proper logging
  3. Test data fixtures - Create known-good test cases
  4. Model validation - Add checksum verification for downloaded models

🎯 Final Recommendation

Overall Assessment: This is a solid proof-of-concept that adds real value to the testing infrastructure. The core implementation is sound, but there are some issues to address before merging.

Verdict: ⚠️ APPROVE WITH CHANGES

Required Changes:

  1. Fix the type assertion issue (Critical)
  2. Run and pass pnpm lint and pnpm type-check (Critical)
  3. Add basic documentation to CLAUDE.md (Should Fix)
  4. Clarify the two-workflow strategy (Should Fix)

After These Changes:

This PR will be ready to merge and will provide a strong foundation for E2E testing without external API dependencies.


💬 Questions for the Author

  1. Is ollama-poc.yml intended to stay long-term, or should it be merged into e2e.yml after validation?
  2. Have you tested this on a clean CI environment (cache miss) to verify download times are acceptable?
  3. Are there plans to add MCP mock servers for more complete E2E coverage?
  4. What's the long-term vision - is Ollama just for testing, or could it become a production feature?

Great work on this implementation! The caching strategy is particularly well done, and the separation of concerns is clean. With the suggested fixes, this will be a valuable addition to the project. 🚀


View Job Run | View Branch

@kantord kantord changed the title poc test: end-to-end testing for chat features Dec 19, 2025
@github-actions github-actions bot added size/S Small PR: 100-299 lines changed and removed size/M Medium PR: 300-599 lines changed labels Dec 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Small PR: 100-299 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants