`langgraph-vs-crewai` ⚖️

The Open-Source Standard for Agentic Framework Benchmarking

"No opinions. No marketing. Just numbers." — Principal Engineer’s Manifesto

🔬 Project Overview

langgraph-vs-crewai is a production-grade, community-driven benchmark suite designed to objectively compare AI Agent frameworks (LangGraph, CrewAI, AutoGen, etc.) using rigorous scientific method.

Unlike traditional "vibes-based" comparisons, this project uses Welch's T-Tests, Cohen's D Effect Size, and Multi-Agent Simulation to provide data that survives peer review.

Key Pillars:

Impartial Interception: A custom Proxy layer captures raw API transactions, ensuring 100% token fairness and bypassing framework-reported bias.
Scientific Rigor: Significance masking (P-value < 0.05) ensures conclusions are driven by data, not noise.
Cost Economics: Versioned pricing models correlate token usage with real-world USD costs.
Operational Isolation: Dockerized execution environments prevent "polluted" benchmarks caused by framework global state leaks.

📊 The Task Registry (20 Total)

We measure frameworks across three distinct tiers:

🟢 Tier 1 (Simple - 7/7): Tool Call, JSON Extraction, Multi-Tool selection, Context Recall, Reasoning Logic.
🟡 Tier 2 (Moderate - 7/7): Multi-Step Research (Sequential), Long Context (50k+ tokens), Tool Use with Files.
🔴 Tier 3 (Complex - 6/6): Cascading Failures, Multi-Agent Debate, Multi-Day Scheduling, Human-in-the-loop.

🛠️ Quick Start

Installation

git clone https://github.com/Ismail-2001/langgraph-vs-crewai.git
cd langgraph-vs-crewai
pip install -r requirements.txt

Run a Benchmark

# Run 10 iterations of all Tier 1 tasks
python -m benchmark.cli run --n 10 --tasks tier-1 --output results/initial_launch

Review the Report

Once complete, check results/[output]/report.md for a comprehensive statistical comparison.

📐 Architecture

graph TD
    CLI[CLI Wrapper] --> Runner[Benchmark Runner]
    Runner --> Registry[Task Registry]
    Runner --> Adapter[Framework Adapter]
    Adapter --> Interceptor[LLM Interceptor]
    Interceptor --> LLM[Provider API]
    Runner --> DB[SQLite Persistence]
    DB --> Reporter[Markdown Reporter]

🛡️ Methodology

Statistical Significance: We use Welch’s t-test to account for unequal variances between frameworks.
Fairness: Every LLM call is intercepted at the network level via monkeypatching to ensure no framework "hopes" or "retries" are hidden from the final count.
Reproducibility: All raw traces are saved as JSON with unique UUIDs and high-fidelity timestamps.

🤝 Contributing

Contributions that improve the scientific validity or task coverage are welcome. Please ensure all new tasks include a clear rubric and expected JSON schema.

⚖️ License

MIT License. Created by Ismail Sajid.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
benchmark		benchmark
docker/base		docker/base
frameworks		frameworks
pricing		pricing
tasks		tasks
.gitignore		.gitignore
README.md		README.md
benchmark.yaml		benchmark.yaml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test_run.py		test_run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`langgraph-vs-crewai` ⚖️

The Open-Source Standard for Agentic Framework Benchmarking

🔬 Project Overview

Key Pillars:

📊 The Task Registry (20 Total)

🛠️ Quick Start

Installation

Run a Benchmark

Review the Report

📐 Architecture

🛡️ Methodology

🤝 Contributing

⚖️ License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

langgraph-vs-crewai ⚖️

The Open-Source Standard for Agentic Framework Benchmarking

🔬 Project Overview

Key Pillars:

📊 The Task Registry (20 Total)

🛠️ Quick Start

Installation

Run a Benchmark

Review the Report

📐 Architecture

🛡️ Methodology

🤝 Contributing

⚖️ License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`langgraph-vs-crewai` ⚖️

Packages