test: end-to-end testing for chat features #199

kantord · 2025-12-19T14:02:14Z

Summary

This PR adds E2E tests for the AI assistant chat feature using Ollama with a tiny local model (qwen2.5:1.5b), proving that real integration testing of LLM-powered features is feasible in CI without API keys or mocking.

This is an experiment to validate that we can test the chat feature end-to-end with a real (but small) language model, ensuring the full streaming pipeline works correctly.

Changes

Add ai-sdk-ollama provider with conditional switching in /api/chat/route.ts
Add Playwright tests for assistant chat (tests/e2e/assistant.spec.ts)
Configure E2E workflow to install and cache Ollama binary + models
Add POC workflow for benchmarking Ollama in GitHub Actions

Won't this make tests super slow?

No. With proper caching, Ollama adds minimal overhead:

Binary install: Cached, ~0s on subsequent runs
Model download: Cached (qwen2.5:1.5b is ~1GB), ~0s on subsequent runs
Model warmup: 2-8s (loading into memory)
Inference per test: 2-5s for simple prompts

Total added time: ~10-20s per CI run after caches are warm. Not done at this PR, but should be 100% parallelizable with other test cases, so effectively this should have no harmful effect on CI performance, especially when used sparingly for a few important test cases.

Won't this make tests super flaky?

The tests are designed to minimize flakiness:

Low temperature (0): More deterministic outputs
Flexible assertions: Tests check for patterns (e.g., Hello.*username) rather than exact strings
Warmup phase: Model is loaded before tests run, avoiding cold-start/performance-related flakiness

The model is small but capable enough for simple instruction-following. Tests don't rely on specific wording, just that the model echoes back identifiers we provide.

What about testing actual tool calls?

This PR intentionally skips MCP tool fetching in Ollama mode to focus on proving the concept first. However, testing real tool calls is absolutely possible. It should be easy to craft some prompts that would not result in test flakes but still prove the health of the tool call workflow.

Once this approach is approved, extending it to test the full MCP integration would be straightforward.

kantord · 2025-12-19T19:03:55Z

@claude can you please do a deep review of this pr

claude · 2025-12-19T19:04:06Z

Claude finished @kantord's task —— View job

Deep Review of PR #199

Todo List:

Read project guidelines (CLAUDE.md)
Review screenshot to understand UI context
Review changed workflow files
Review package.json and lock file changes
Review Playwright config changes
Review API route changes
Review new E2E test file
Analyze adherence to project standards
Provide comprehensive feedback

Overview

This PR adds Ollama integration for E2E testing of the AI assistant feature, enabling deterministic, API-key-free testing in CI. The implementation is well-structured and solves a real problem. Below is my detailed analysis.

🎯 Purpose & Value

What this PR accomplishes:

✅ Enables E2E testing of the assistant chat feature without requiring OpenRouter API keys
✅ Uses local Ollama (qwen2.5:1.5b model) for deterministic, reproducible test results
✅ Implements aggressive caching to minimize CI run time and avoid repeated model downloads
✅ Provides both a POC workflow and integration into the main E2E workflow

Business value:

Removes external API dependencies from CI pipeline
Reduces test flakiness from external API rate limits/failures
Enables offline development and testing
Significantly cheaper than cloud-based AI API calls in CI

📋 Code Quality Assessment

✅ Strengths

Excellent caching strategy: The workflow caches both Ollama binary and models separately with clear versioning
Good separation of concerns: POC workflow is separate from main E2E workflow
Proper conditional logic: USE_OLLAMA flag cleanly switches between Ollama and OpenRouter
Temperature control: Uses temperature: 0 in test mode for deterministic responses
Clean test structure: Tests include warmup phase and proper async handling
Good error handling: Proper timeout handling for model warmup

⚠️ Issues & Concerns

🔴 CRITICAL: Type Safety Violation

File: src/app/api/chat/route.ts (line 67)

tools[toolName] = toolDef as ToolSet[string];

Problem: Using as type assertion bypasses TypeScript's type checking. This violates the project's strict "no any" rule in spirit (type assertions are similar to any - they tell TypeScript "trust me").

Why this matters: If toolDef doesn't actually match ToolSet[string], this could cause runtime errors that TypeScript won't catch.

Recommendation:

// Option 1: Add proper type guard
function isValidToolDef(value: unknown): value is ToolSet[string] {
  // Add runtime checks here
  return typeof value === 'object' && value !== null && 'execute' in value;
}

for (const [toolName, toolDef] of Object.entries(serverTools)) {
  if (isValidToolDef(toolDef)) {
    tools[toolName] = toolDef;
  } else {
    console.warn(`Invalid tool definition for ${toolName}`);
  }
}

// Option 2: Fix the type at the source
// If serverTools is known to be ToolSet, update the McpClient type definition

Note: This issue exists in the base code, not introduced by this PR, but should be addressed.

🟡 MODERATE: Missing Dependencies

File: package.json

Problem: The PR adds ai-sdk-ollama but doesn't show that ollama binary is properly installed/available in dev environments.

Recommendation:

Add documentation to README/CLAUDE.md about installing Ollama locally for development
Add a check in pnpm dev to warn if Ollama isn't available but USE_OLLAMA=true
Consider adding a pnpm ollama:install script that provides instructions

🟡 MODERATE: Incomplete Test Coverage

File: tests/e2e/assistant.spec.ts

Observations:

Only tests basic chat functionality - doesn't test MCP tool integration
No tests for error scenarios (e.g., Ollama not available)
Tests rely on regex matching which might be fragile across different model responses

Current tests:

✅ Basic greeting response (checks for username in response)
✅ Streaming response (checks for numbers)

Missing tests:

❌ MCP server selection and tool usage (though this is skipped in Ollama mode, should still have tests)
❌ Error handling (API failures, timeouts)
❌ Message history and context preservation
❌ Model selection UI
❌ Clear chat functionality

Recommendation:

test("handles missing Ollama gracefully", async ({ authenticatedPage }) => {
  // Test error state when Ollama is unavailable
});

test("preserves conversation context", async ({ authenticatedPage }) => {
  // Send multiple messages and verify context is maintained
});

test("allows model selection", async ({ authenticatedPage }) => {
  // Test the model selector dropdown
});

🟡 MODERATE: Hardcoded Values

Files: Multiple

Problem: Model name qwen2.5:1.5b is hardcoded in several places:

.github/workflows/e2e.yml (line 30)
.github/workflows/ollama-poc.yml (lines 36, 60)
playwright.config.mts (line 53)
src/app/api/chat/route.ts (line 20)
tests/e2e/assistant.spec.ts (line 4)
package.json (lines 19, 20)

Recommendation:
Create a constants file or centralize in one config location:

// config/testing.ts
export const E2E_TEST_CONFIG = {
  OLLAMA_MODEL: "qwen2.5:1.5b",
  OLLAMA_BASE_URL: "http://localhost:11434",
  OLLAMA_CACHE_VERSION: "v3",
} as const;

Then import from this single source of truth.

🟢 MINOR: Workflow Efficiency

File: .github/workflows/e2e.yml

Observation: The workflow runs model pull every time, even with caching. While the cache works, it would be cleaner to check if model exists before pulling.

Current approach:

- name: Pull Ollama model
  run: OLLAMA_MODELS=/home/runner/.ollama/models ollama pull qwen2.5:1.5b

Suggested improvement:

- name: Pull Ollama model (if needed)
  run: |
    OLLAMA_MODELS=/home/runner/.ollama/models ollama list | grep -q "qwen2.5:1.5b" || \
    OLLAMA_MODELS=/home/runner/.ollama/models ollama pull qwen2.5:1.5b

This is minor since ollama pull is smart about not re-downloading, but explicit checks are clearer.

🟢 MINOR: Test Timeout Strategy

File: tests/e2e/assistant.spec.ts

Observation: Tests use 60-second timeouts for AI responses. While reasonable, this could be optimized.

Suggestions:

Consider making timeout configurable based on model size
Add timeout monitoring to detect when model performance degrades
Document expected response times in comments

// Expected: 5-15s for first response (model already warmed)
// Timeout: 60s to account for CI resource variance
await expect(
  authenticatedPage.getByText(new RegExp(`Hello.*${testUsername}`, "i")),
).toBeVisible({
  timeout: 60_000,
});

🟢 MINOR: Console Logging

File: src/app/api/chat/route.ts

Observation: Uses console.log() and console.error() directly. The project should standardize on a logging approach.

Current:

console.log(`[Chat API] Using Ollama model: ${OLLAMA_MODEL}`);
console.error("[Chat API] OPENROUTER_API_KEY not configured");

Suggestion: Consider a structured logging library or at least consistent prefixing throughout the codebase.

🏗️ Architecture & Design

✅ Good Architectural Decisions

Conditional execution: Using USE_OLLAMA env var is clean and doesn't affect production code paths
Fallback logic: Gracefully handles missing MCP servers in test mode
Temperature control: Setting temperature: 0 for deterministic test results is correct
Model warmup: Pre-warming the model in beforeAll is smart and prevents first-test timeouts

🤔 Architectural Questions

Why skip MCP tools entirely in Ollama mode?
- The code skips MCP tool fetching when USE_OLLAMA=true
- This means tests don't exercise the full integration path
- Consider: Could you mock MCP servers for more complete E2E coverage?
Is qwen2.5:1.5b the right model choice?
- Pro: Small, fast, good for CI
- Con: May not represent production behavior accurately
- Consider: Document why this model was chosen and what limitations it has
Should Ollama integration be more than just testing?
- Could this be a production feature? (e.g., for local/air-gapped deployments)
- If yes: The current implementation is too test-focused
- If no: Current design is appropriate

📚 Documentation & Maintainability

Missing Documentation

No README/docs updates: The CLAUDE.md wasn't updated to mention Ollama setup
No comments explaining the POC workflow purpose: Why have both e2e.yml and ollama-poc.yml?
No migration guide: What happens to existing E2E tests? Do they still work with OpenRouter?

Recommended Documentation Additions

In CLAUDE.md:

### E2E Testing with Ollama

For E2E testing of the AI assistant feature, we use Ollama with the qwen2.5:1.5b model:

**Installation (for local development):**
```bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull the test model
ollama pull qwen2.5:1.5b

# Start Ollama server (usually starts automatically)
ollama serve

Running tests with Ollama:

pnpm test:e2e:ollama     # Run all E2E tests with Ollama
pnpm test:e2e:ollama:ui  # Open Playwright UI with Ollama

Why Ollama?

No API keys required
Deterministic responses with temperature=0
Faster and cheaper than cloud APIs
Works offline


---

## 🔒 Security Considerations

### ✅ Secure Practices

1. Test-only credentials are clearly marked (e.g., `test-only-not-a-real-secret`)
2. No production secrets in workflow files
3. Proper authentication checks maintained in API route

### 🤔 Considerations

1. **Model supply chain**: Ollama downloads models from the internet. Consider:
   - Verifying model checksums
   - Documenting the source of truth for model versions
   - Having a backup plan if Ollama's registry is unavailable

---

## 🧪 Testing Strategy

### Current Test Philosophy

The tests are **behavioral/integration tests** rather than unit tests. This is appropriate for E2E, but consider:

1. **Test pyramid**: These E2E tests are slow and expensive. Ensure you also have:
   - Unit tests for individual functions
   - Integration tests for API routes
   - Component tests for UI elements

2. **Test data**: The tests use dynamic usernames and simple prompts. Consider:
   - Creating fixtures with known-good responses
   - Testing edge cases (very long messages, special characters, etc.)

---

## 🚀 Performance Impact

### CI Performance

**Positive:**
- Caching is well-implemented (binary + model caching)
- Model warmup happens once per test suite
- Parallel test execution should work fine

**Concerns:**
- First run (cache miss) will be slow (~2-3 minutes to download model)
- Model size (qwen2.5:1.5b) is ~900MB - ensure GitHub Actions has sufficient storage
- Each workflow run starts Ollama service - consider cost of this overhead

**Measurements from screenshot:**
- Test passed in 4.9s (after warmup)
- Warmup took 1.3s (with warm cache)
- Overall reasonable performance ✅

---

## 📊 Comparison: `e2e.yml` vs `ollama-poc.yml`

| Aspect | e2e.yml | ollama-poc.yml |
|--------|---------|----------------|
| **Purpose** | Main E2E tests (all features) | POC/demo for Ollama integration |
| **Trigger** | Push/PR to any branch | Manual + when workflow file changes |
| **Timeout** | 15 minutes | 10 minutes |
| **Tests** | All E2E tests | Only assistant.spec.ts |
| **Artifacts** | On failure only | On failure only |
| **Extras** | None | Benchmark output, debug info |

**Recommendation:** Clarify the purpose of having both workflows:
- Is `ollama-poc.yml` temporary (for this PR only)?
- Should it be merged into `e2e.yml` after testing?
- Or should it remain as a separate "smoke test" workflow?

---

## ✅ Adherence to Project Guidelines

Checking against CLAUDE.md requirements:

| Guideline | Status | Notes |
|-----------|--------|-------|
| **No `any` types** | ⚠️ PARTIAL | Uses `as` assertion (see Critical Issues above) |
| **Async/await over promises** | ✅ PASS | All async code uses async/await |
| **Server Components first** | ✅ N/A | No component changes in this PR |
| **Use hey-api hooks** | ✅ N/A | No API client changes |
| **Error handling** | ✅ PASS | Proper try/catch and error responses |
| **TypeScript strict mode** | ⚠️ PARTIAL | Type assertion violates spirit of strict mode |
| **Run linter** | ❓ UNKNOWN | Should verify with `pnpm lint` |
| **Write tests** | ✅ PASS | New tests added |
| **JSDoc for complex functions** | ⚠️ PARTIAL | `getMcpTools` and `connectToMcpRemote` are well-documented, but warmup function could use JSDoc |

---

## 🎨 Code Style & Consistency

### ✅ Good Practices

1. Consistent naming conventions (`useOllama`, `OLLAMA_MODEL`)
2. Clear function names (`warmupOllama`, `connectToMcpRemote`)
3. Proper async error handling
4. Good use of TypeScript interfaces

### 🟡 Minor Style Issues

1. **Inconsistent comment style:**
   ```typescript
   // Some comments are full sentences with periods.
   // Some comments dont have periods

Magic numbers without constants:

await expect(...).toBeVisible({ timeout: 60_000 }); // Why 60s specifically?

🔄 Suggested Improvements Priority

🔴 Must Fix (Before Merge)

Address type assertion in connectToMcpRemote - add proper type guard or fix at source
Run linter/type-check - Ensure pnpm lint and pnpm type-check pass
Clarify workflow strategy - Document whether ollama-poc.yml is temporary or permanent

🟡 Should Fix (Near Term)

Add documentation - Update CLAUDE.md with Ollama setup instructions
Centralize constants - Extract hardcoded model names to config
Expand test coverage - Add tests for error scenarios and edge cases
Add JSDoc - Document complex functions, especially test helpers

🟢 Nice to Have (Future)

Optimize workflow - Add explicit model existence checks
Structured logging - Replace console.log with proper logging
Test data fixtures - Create known-good test cases
Model validation - Add checksum verification for downloaded models

🎯 Final Recommendation

Overall Assessment: This is a solid proof-of-concept that adds real value to the testing infrastructure. The core implementation is sound, but there are some issues to address before merging.

Verdict: ⚠️ APPROVE WITH CHANGES

Required Changes:

Fix the type assertion issue (Critical)
Run and pass pnpm lint and pnpm type-check (Critical)
Add basic documentation to CLAUDE.md (Should Fix)
Clarify the two-workflow strategy (Should Fix)

After These Changes:

This PR will be ready to merge and will provide a strong foundation for E2E testing without external API dependencies.

💬 Questions for the Author

Is ollama-poc.yml intended to stay long-term, or should it be merged into e2e.yml after validation?
Have you tested this on a clean CI environment (cache miss) to verify download times are acceptable?
Are there plans to add MCP mock servers for more complete E2E coverage?
What's the long-term vision - is Ollama just for testing, or could it become a production feature?

Great work on this implementation! The caching strategy is particularly well done, and the separation of concerns is clean. With the suggested fixes, this will be a valuable addition to the project. 🚀

View Job Run | View Branch

.

e20801c

github-actions bot added the size/S Small PR: 100-299 lines changed label Dec 19, 2025

fix chache

e02cdf0

github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Dec 19, 2025

even more caching

8e98891

github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Dec 19, 2025

.

a275759

github-actions bot added size/S Small PR: 100-299 lines changed and removed size/S Small PR: 100-299 lines changed labels Dec 19, 2025

test using ollama

28018df

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/S Small PR: 100-299 lines changed labels Dec 19, 2025

make all tests run together

beff14d

github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Dec 19, 2025

kantord changed the title ~~poc~~ test: end-to-end testing for chat features Dec 19, 2025

.

b8b701d

github-actions bot added size/S Small PR: 100-299 lines changed and removed size/M Medium PR: 300-599 lines changed labels Dec 19, 2025

peppescg mentioned this pull request Dec 22, 2025

[Task] add unit test and e2e for the entire assistant feature #197

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: end-to-end testing for chat features #199

test: end-to-end testing for chat features #199

Uh oh!

kantord commented Dec 19, 2025 •

edited

Loading

Uh oh!

kantord commented Dec 19, 2025

Uh oh!

claude bot commented Dec 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

test: end-to-end testing for chat features #199

Are you sure you want to change the base?

test: end-to-end testing for chat features #199

Uh oh!

Conversation

kantord commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Won't this make tests super slow?

Won't this make tests super flaky?

What about testing actual tool calls?

Uh oh!

kantord commented Dec 19, 2025

Uh oh!

claude bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deep Review of PR #199

Overview

🎯 Purpose & Value

📋 Code Quality Assessment

✅ Strengths

⚠️ Issues & Concerns

🔴 CRITICAL: Type Safety Violation

🟡 MODERATE: Missing Dependencies

🟡 MODERATE: Incomplete Test Coverage

🟡 MODERATE: Hardcoded Values

🟢 MINOR: Workflow Efficiency

🟢 MINOR: Test Timeout Strategy

🟢 MINOR: Console Logging

🏗️ Architecture & Design

✅ Good Architectural Decisions

🤔 Architectural Questions

📚 Documentation & Maintainability

Missing Documentation

Recommended Documentation Additions

In CLAUDE.md:

🔄 Suggested Improvements Priority

🔴 Must Fix (Before Merge)

🟡 Should Fix (Near Term)

🟢 Nice to Have (Future)

🎯 Final Recommendation

Required Changes:

After These Changes:

💬 Questions for the Author

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kantord commented Dec 19, 2025 •

edited

Loading

claude bot commented Dec 19, 2025 •

edited

Loading