Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
488fe6b
feat: added launch of new evaluation
Nov 9, 2025
6c0bd10
refactoring 1
Nov 9, 2025
66fc521
refactoring 2
Nov 9, 2025
0944ee5
refactoring 3
Nov 9, 2025
abf4bb7
feat: evaluating is ready for launching
Nov 9, 2025
dace504
feat: added f1 score tool evaluator
Nov 12, 2025
d0c10b0
feat: launch f1
Nov 12, 2025
862a964
feat: added statistics
Nov 15, 2025
3e05acb
feat: added working statistics calculating
Nov 16, 2025
df80de8
feat: jsonl -> json in memory
Nov 16, 2025
e4b8544
feat: refactoring 1
Nov 17, 2025
68953a5
feat: added parsing new data type
Nov 17, 2025
afe523e
feat: added color logs, shuffled launch
Nov 18, 2025
7e263fb
benchmark + new algorithms
Dec 25, 2025
093b202
feat: rewrite from string prompt to list of messages (only baseline d…
Dec 27, 2025
8d2c76e
feat: added DialogueWithWeights
Dec 28, 2025
e54a59e
fix: style
Jan 6, 2026
50b3e80
feat: added langchain messages to memory algs. also changes by mypy, …
Jan 7, 2026
9e371d8
fix: ruff and mypy
Jan 7, 2026
2c75ff8
fix: some important constants move to .env
Jan 7, 2026
186d0b8
fix: mypy & ruff
Jan 7, 2026
1d5223c
fix: pytest version
Jan 7, 2026
50f6bbd
feat: added separation into a main model and the model for memory; ad…
Jan 24, 2026
d8834a1
fix: deleted extra file
Jan 24, 2026
3798e9c
feat: added feature for using local models in the dialogue_baseline
Jan 24, 2026
3936d9f
fix: memory storage
Jan 28, 2026
8963da4
fix: by ruff
Jan 28, 2026
e38c3d4
fix: removed comments & hardcode
Jan 29, 2026
0b395ed
feat: added docs to the main classes and methods
Jan 29, 2026
93f50fe
feat: project restructured, added makefile (now it doesn't work), pro…
Jan 31, 2026
04fca3d
fix: refactored
Jan 31, 2026
ba33cbc
fix: removed vestige prompt
Feb 1, 2026
8bba95f
fix: removed vestige prompt
Feb 1, 2026
99c0474
feat: refactoring, added more tests
Feb 4, 2026
e05c4b4
feat: added semantic similarity for f1
Feb 10, 2026
88dd410
feat: added calculating by logs, args similarities, the stub od the l…
Feb 11, 2026
8d194f0
feat: llm as a judge template is updated
Feb 11, 2026
dd37671
feat: llm as a judge refreshed
Feb 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .DS_Store
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove from git. it is doable

Binary file not shown.
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,13 @@
.mypy_cache/
.pytest_cache/
__pycache__/
*.DS_Store
.ipynb_checkpoints/
.hypothesis/
*.egg-info/
build/
.venv/
uv.lock
.env
src/benchmarking/agent_chat/logs
src/benchmark/simple_benchmarking/agent_chat/logs
src/benchmark/tool_plan_benchmarking/logs
Comment on lines +13 to +14
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smth like **/logs ? or *.log ? I do not remember correct syntax

65 changes: 65 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Contributor guide (AGENTS)

## Project structure

- `src/` – main package code.
- `src/summarize_algorithms/` – dialogue summarization implementations (e.g. `memory_bank/`, `recsum/`, shared `core/`).
- `src/benchmarking/` – evaluation scripts, metrics, log parsing, and plotting.
- `src/utils/` – small shared helpers (logging/config parsing).
- Entry point: `src/main.py` (also exposed as a script in `pyproject.toml`: `recapkt = "src.main:main"`).
- `tests/` – pytest suite (files follow `test_*.py`).
- `requirements.txt`, `requirements.dev.txt` – runtime/dev dependencies.

## Build, test, and development commands

This repo targets **Python >= 3.12** (see `pyproject.toml`). CI uses **uv**.

- Create env + install deps (recommended):
```bash
uv venv
uv pip install -r requirements.txt -r requirements.dev.txt
```
- Run the example entry point:
```bash
python -m src.main
# or
uv run recapkt
```
- Lint / format (Ruff):
```bash
ruff check .
ruff format .
```
- Type-check (Mypy):
```bash
uv run mypy
```
- Run tests:
```bash
uv run python -m pytest
```
- Tool-metrics benchmarking helper:
```bash
./run.sh <arg> # runs src/benchmark/tool_plan_benchmarking/run.py
```

## Code style and naming

- Formatting/linting: Ruff is the source of truth (line length **120**, double quotes).
- Typing: keep functions typed; the project configuration disallows untyped defs in `src/`.
- Naming:
- modules/files: `snake_case.py`
- classes: `CamelCase`
- tests: `tests/test_<unit>.py`, test functions `test_<behavior>()`

## VCS: commits and pull requests

- Commit messages follow a lightweight Conventional Commits style seen in history: `feat: ...`, `fix: ...`.
- PRs should:
- describe the change + rationale,
- include how to reproduce/verify (commands or a minimal snippet),
- keep CI green (GitHub Actions runs `ruff check`, `mypy`, `pytest` on PRs).

## Secrets and local config

- Don’t commit `.env`. If your change needs new settings, document them and keep defaults safe.
31 changes: 31 additions & 0 deletions Makefile
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why di you need this?

Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
.PHONY: help run-main run-tool-plan run-tool-metrics run-tool-metrics-sh test

# Prefer local venv if present, fall back to system python.
PYTHON := $(shell [ -x .venv/bin/python ] && echo .venv/bin/python || (command -v python3 >/dev/null 2>&1 && echo python3 || echo python))

# Optional args for some targets:
# make run-tool-metrics ARG=base_recsum
ARG ?= base_recsum

help:
@echo "Available targets:"
@echo " make run-main"
@echo " make run-tool-plan"
@echo " make run-tool-metrics ARG=<arg>"
@echo " make run-tool-metrics-sh ARG=<arg>"
@echo " make test"

run-main:
$(PYTHON) -m src.main

run-tool-plan:
cd src/benchmark/tool_plan_benchmarking && $(PYTHON) -m run.py

run-tool-metrics:
$(MAKE) run-tool-plan

run-tool-metrics-sh:
./run.sh $(ARG)

test:
$(PYTHON) -m pytest -q
34 changes: 31 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,32 @@ authors = [
{ name = "Mikhail Kharlamov" }
]
readme = "README.md"
requires-python = ">=3.8"
dependencies = []
requires-python = ">=3.12"
dependencies = [
"colorlog~=6.10.1",
"dataclasses-json~=0.6.7",
"datasets==4.0.0",
"faiss-cpu==1.11.0",
"jinja2==3.1.6",
"langchain>=1.1.0",
"langchain-community>=0.4.1",
"langchain-core>=1.1.0",
"langchain-ollama>=1.0.1",
"langchain-openai>=1.0.0",
"langgraph>=1.0.0",
"load-dotenv>=0.1.0",
"matplotlib~=3.10.7",
"numpy>=1.26.2",
"openai~=1.109.1",
"pandas~=2.3.3",
"pydantic~=2.11.9",
"pytest>=9.0.2",
"python-dotenv~=1.1.1",
"scikit-learn==1.5.2",
"seaborn~=0.13.2",
"tiktoken==0.9.0",
"transformers>=4.57.6",
]

[project.scripts]
recapkt = "src.main:main"
Expand Down Expand Up @@ -48,5 +72,9 @@ warn_no_return = "False"
no_implicit_optional = "False"

[tool.pytest.ini_options]
pythonpath = ["src"]
testpaths = ["tests"]
addopts = ["--color=yes", "-s"]
addopts = ["--color=yes", "-s"]

[dependency-groups]
dev = []
26 changes: 19 additions & 7 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,14 +1,26 @@
langchain-core>=0.3.72,<1.0.0
langchain-openai==0.3.28
langgraph==0.5.3
langchain>=0.3.27,<0.4.0
langchain-openai>=1.0.0
langchain-core>=1.1.0,<2.0.0
langchain-openai>=0.3.28
langchain-ollama>=1.0.1
langgraph>=1.0.0
langchain>=1.1.0,<2.0.0
langchain-community>=0.4.1

tiktoken==0.9.0
transformers
datasets==4.0.0
numpy>=1.26.2
scikit-learn==1.5.2
faiss-cpu==1.11.0
langchain-community~=0.3.31
pydantic~=2.11.9
pytest~=8.3.4
dataclasses-json~=0.6.7
openai~=1.109.1
openai~=1.109.1
jinja2==3.1.6

python-dotenv~=1.1.1

colorlog~=6.10.1
seaborn~=0.13.2
pandas~=2.3.3
matplotlib~=3.10.7
load_dotenv
11 changes: 11 additions & 0 deletions run.sh
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this?

Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
#!/bin/bash

set -euo pipefail

ROOT_DIR="$(cd "$(dirname "$0")" && pwd)"

source "$ROOT_DIR/.venv/bin/activate"
export PYTHONPATH="$ROOT_DIR"

cd "$ROOT_DIR/src/benchmark/tool_plan_benchmarking"
python run.py "${1:-}"
Binary file added src/.DS_Store
Binary file not shown.
File renamed without changes.
22 changes: 22 additions & 0 deletions src/algorithms/dialogue.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
from typing import Any, Protocol

from src.algorithms.summarize_algorithms.core.models import DialogueState, Session


class Dialogue(Protocol):
"""
Minimal public interface for a dialogue system used throughout benchmark.

Any implementation must expose a `system_name` and provide `process_dialogue()` returning a `DialogueState`.
"""

system_name: str

def process_dialogue(
self,
sessions: list[Session],
system_prompt: str,
structure: dict[str, Any] | None = None,
tools: list[dict[str, Any]] | None = None
) -> DialogueState:
...
31 changes: 31 additions & 0 deletions src/algorithms/simple_algorithms/dialog_short_tools.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
from typing import override

from langchain_core.messages import BaseMessage, ToolMessage

from src.algorithms.simple_algorithms.dialogue_baseline import DialogueBaseline
from src.algorithms.summarize_algorithms.core.models import Session


class DialogueWithShortTools(DialogueBaseline):
"""
Baseline variant that shortens tool messages.

Keeps tool call structure in the history but clears `ToolMessage.content` to reduce context length.
"""

@override
@staticmethod
def _compress(sessions: list[Session]) -> list[BaseMessage]:
"""
Compress sessions by clearing tool message contents.

:param sessions: past sessions.
:return: list[BaseMessage]: flattened history with shortened tool messages.
"""
messages: list[BaseMessage] = DialogueBaseline._get_context(sessions)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super()._get_context? if super changes, this will break


for message in messages:
if isinstance(message, ToolMessage):
message.content = ""
Comment on lines +27 to +29
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<compressed content> to signal that this was important
also, are tool call arguments being saved?


return messages
49 changes: 49 additions & 0 deletions src/algorithms/simple_algorithms/dialog_with_weights.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
from decimal import Decimal
from math import ceil
from typing import override

from langchain_core.messages import BaseMessage, HumanMessage

from src.algorithms.simple_algorithms.dialogue_baseline import DialogueBaseline
from src.algorithms.summarize_algorithms.core.models import Session


class DialogueWithWeights(DialogueBaseline):
"""
Baseline variant that compresses history by truncating message contents with a positional weight.

Messages closer to the center of the conversation get truncated more aggressively (triangle-shaped coefficient).
Human messages are preserved.
"""

@override
@staticmethod
def _compress(sessions: list[Session]) -> list[BaseMessage]:
"""
Compress sessions by truncating non-human messages based on their position.

:param sessions: past sessions.
:return: list[BaseMessage]: flattened history with weighted truncation applied.
"""
messages: list[BaseMessage] = DialogueBaseline._get_context(sessions)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super?

cropped_messages: list[BaseMessage] = []

mid: int = (len(messages) - 1) // 2
step: Decimal = Decimal(1) / Decimal(mid)
coefficient: Decimal = Decimal(1)

for i in range(len(messages)):
if coefficient > 0 and i != 0:
coefficient -= step
else:
coefficient += step

message = messages[i]
if isinstance(message, HumanMessage):
cropped_messages.append(message)
continue

message.content = message.content[:ceil(len(message.content) * coefficient)]
cropped_messages.append(message)
Comment on lines +46 to +47
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to signal that there was a cropping. insert ... at the end


return cropped_messages
Loading
Loading