LongMemEval evaluation suite (ICLR 2025) for AI agent memory systems with MTEB and BEIR benchmark support.
Evaluates memory retrieval systems by loading conversation datasets, indexing them into vector stores, retrieving relevant contexts, generating LLM answers, and computing comprehensive metrics across 5 memory abilities: Information Extraction (IE), Multi-Session Reasoning (MR), Temporal Reasoning (TR), Knowledge Updates (KU), and Abstention (ABS).
- LongMemEval Pipeline: End-to-end evaluation across 5 memory abilities
- Dual Retrieval: ChromaDB (local) or Engram search service (hybrid search with reranking)
- 100+ LLMs: LiteLLM integration (OpenAI, Anthropic, Google, Ollama, etc.)
- Comprehensive Metrics: QA accuracy, retrieval quality (MRR, NDCG, Recall), abstention, latency, RAGAS
- Extended Benchmarks: MTEB (embedding) and BEIR (retrieval) evaluation
- Async Execution: Concurrent operations with configurable parallelism
cd packages/benchmark && uv sync && source .venv/bin/activateFor MTEB/BEIR: uv pip install -e ".[mteb]"
# Validate dataset
engram-benchmark validate data/longmemeval_oracle.json
# Run LongMemEval with ChromaDB (local)
engram-benchmark run \
--dataset data/longmemeval_oracle.json \
--model openai/gpt-4o-mini \
--embedding-model BAAI/bge-base-en-v1.5 \
--top-k 10 --limit 50
# Run with Engram retriever (production, hybrid search + reranking)
engram-benchmark run --retriever engram --search-strategy hybrid --rerank
# Run MTEB benchmark
engram-benchmark mteb --model BAAI/bge-base-en-v1.5 --tasks Banking77Classification
# Run BEIR benchmark
engram-benchmark beir --model BAAI/bge-base-en-v1.5 --datasets nfcorpus# Run tests
uv run pytest --cov=engram_benchmark
# Lint and format
uv run ruff check src tests
uv run ruff format src tests
# Type check
uv run mypy src/engram_benchmarkPipeline: Dataset → Indexing (ChromaDB/Engram) → Retrieval (top-k) → LLM Generation → Evaluation
Core Modules:
longmemeval/: Pipeline orchestration, retrieval, reader, temporal reasoningmetrics/: QA, retrieval (MRR, NDCG, Recall), abstention, latency, RAGASproviders/: LiteLLM (100+ models), embeddings (sentence-transformers), Engram APIbenchmarks/: MTEB and BEIR evaluation
Key Metrics:
- QA: Exact match or LLM-based accuracy per memory ability (IE, MR, TR, KU, ABS)
- Retrieval: MRR, NDCG@k, Recall@k, MAP, turn/session recall
- Abstention: Precision, Recall, F1 for unanswerable questions
- Latency: P50/P90/P95/P99 for retrieval and generation
See examples/ for usage patterns and pyproject.toml for full dependency list.