Benchmark for Chat History Compression Algorithms in Agentic Systems with Function Calling by mikhail-kharlamov · Pull Request #13 · emnigma/RecapKt

mikhail-kharlamov · 2025-11-09T18:21:36Z

This PR introduces a benchmarking framework and experimental results for comparing chat history compression algorithms within an agentic system context. The primary goal is to evaluate how different memory management strategies affect an LLM agent's ability to correctly plan and execute tool calls (function calling) over extended dialogue sessions.

Key Features & Implementations
The benchmark compares the following compression strategies against a baseline (Full Context):

RecSum: Recursive summarization of the dialogue history.
MemoryBank: A memory-bank approach for storing and retrieving dialogue segments.
RecSum + RAG: Recursive summarization augmented with retrieval (RAG) specifically for the history of tool/function calls.
MemoryBank + RAG: MemoryBank augmented with RAG for tool/function calls.
V-shape: An algorithm that increases the compression ratio to the middle of the context and decreases it from the middle to the end
Short-Tools: An algorithm that removes all descriptions of the result of executing tools from the context.

Methodology

Data: 15 real-world user-agent interaction sessions collected from IDE tasks involving the Explyt agent.
Simulation: Since the proprietary agent code is restricted, the benchmark operates in a "planning mode." The LLM is tasked with generating a JSON-structured action plan based on a provided schema and the compressed context.
Metrics:
- F1 Score (Tool Names): Measures the accuracy of selected tools ignoring arguments.
- F1 Score (Strict): Measures the accuracy of selected tools including the correctness of their arguments.

The results of the experiments, including aggregated metrics tables and visualization graphs, are generated by the framework.

External link to results: https://github.com/mikhail-kharlamov/practice-benchmark-results

…ialogue)

…by ruff and in tests

…ded feature for using local models

emnigma

last commit review

emnigma · 2026-01-29T11:05:56Z

src/benchmarking/tool_metrics/run.py

+            if count_of_sessions == 3:
+                print("ye")


emnigma · 2026-01-29T11:06:15Z

src/benchmarking/tool_metrics/run.py


-    #runner = Runner()
+    runner = Runner()
    #runner.run(sys.argv[1])


emnigma · 2026-01-29T11:06:53Z

src/summarize_algorithms/core/base_dialogue_system.py

+            #self.memory_llm = ChatOpenAI(
+            #    model=OpenAIModels.GPT_5_MINI.value,
+            #    api_key=SecretStr(api_key)
+            #)


emnigma · 2026-01-29T11:07:27Z

src/summarize_algorithms/core/base_dialogue_system.py

+            #)
+            self.memory_llm = ChatOllama(
+                model=LocalModels.QWEN_2_5_14_B.value,
+                temperature=0.7,


why 0.7? set 0

emnigma · 2026-01-29T11:08:23Z

src/summarize_algorithms/core/base_dialogue_system.py

+            self.memory_llm = ChatOllama(
+                model=LocalModels.QWEN_2_5_14_B.value,


chatOllama is not isLocal ?

…mpt is divided, all of system messages are assembled into the one

emnigma · 2026-02-05T09:45:31Z

src/algorithms/simple_algorithms/dialog_short_tools.py

+        for message in messages:
+            if isinstance(message, ToolMessage):
+                message.content = ""


<compressed content> to signal that this was important
also, are tool call arguments being saved?

emnigma · 2026-02-05T09:46:01Z

src/algorithms/simple_algorithms/dialog_short_tools.py

+        :param sessions: past sessions.
+        :return: list[BaseMessage]: flattened history with shortened tool messages.
+        """
+        messages: list[BaseMessage] = DialogueBaseline._get_context(sessions)


super()._get_context? if super changes, this will break

emnigma · 2026-02-05T09:46:19Z

src/algorithms/simple_algorithms/dialog_with_weights.py

+        :param sessions: past sessions.
+        :return: list[BaseMessage]: flattened history with weighted truncation applied.
+        """
+        messages: list[BaseMessage] = DialogueBaseline._get_context(sessions)


emnigma · 2026-02-05T09:47:00Z

src/algorithms/simple_algorithms/dialog_with_weights.py

+            message.content = message.content[:ceil(len(message.content) * coefficient)]
+            cropped_messages.append(message)


we need to signal that there was a cropping. insert ... at the end

emnigma · 2026-02-05T09:51:39Z

src/algorithms/simple_algorithms/dialogue_baseline.py

+        @retry(
+            stop=stop_after_attempt(3),
+            wait=wait_exponential(multiplier=1, min=2, max=10),
+            retry=retry_if_exception_type(OutputParserException),


emnigma · 2026-02-05T10:04:46Z

src/algorithms/summarize_algorithms/core/memory_storage/models.py

+
+    @override
+    @classmethod
+    def from_block(cls, block: ToolCallBlock, session_id: int):


ToolMemoryFragment can be a generic with cls type. smth like ToolMemoryFragment(MemoryFragment[ToolBlock])

then override can also be generic and typed:

def from_block(block: ToolCallBlock, session_id: int) -> ToolBlock: ...

emnigma · 2026-02-05T10:06:23Z

.DS_Store

remove from git. it is doable

emnigma · 2026-02-05T10:06:41Z

run.sh

why do you need this?

emnigma · 2026-02-05T10:07:10Z

Makefile

why di you need this?

emnigma · 2026-02-05T10:08:29Z

.gitignore

+src/benchmark/simple_benchmarking/agent_chat/logs
+src/benchmark/tool_plan_benchmarking/logs


smth like **/logs ? or *.log ? I do not remember correct syntax

…lm-as-a-judge

michaelharlam added 19 commits November 9, 2025 16:26

feat: added launch of new evaluation

488fe6b

refactoring 1

6c0bd10

refactoring 2

66fc521

refactoring 3

0944ee5

feat: evaluating is ready for launching

abf4bb7

feat: added f1 score tool evaluator

dace504

feat: launch f1

d0c10b0

feat: added statistics

862a964

feat: added working statistics calculating

3e05acb

feat: jsonl -> json in memory

df80de8

feat: refactoring 1

e4b8544

feat: added parsing new data type

68953a5

feat: added color logs, shuffled launch

afe523e

benchmark + new algorithms

7e263fb

feat: rewrite from string prompt to list of messages (only baseline d…

093b202

…ialogue)

feat: added DialogueWithWeights

8d2c76e

fix: style

e54a59e

feat: added langchain messages to memory algs. also changes by mypy, …

50b3e80

…by ruff and in tests

fix: ruff and mypy

9e371d8

mikhail-kharlamov changed the title ~~Benchmarking first stage~~ Benchmark for Chat History Compression Algorithms in Agentic Systems with Function Calling Jan 7, 2026

michaelharlam added 3 commits January 7, 2026 13:48

fix: some important constants move to .env

2c75ff8

fix: mypy & ruff

186d0b8

fix: pytest version

1d5223c

emnigma self-requested a review January 9, 2026 08:10

michaelharlam added 5 commits January 24, 2026 05:19

feat: added separation into a main model and the model for memory; ad…

50f6bbd

…ded feature for using local models

fix: deleted extra file

d8834a1

feat: added feature for using local models in the dialogue_baseline

3798e9c

fix: memory storage

3936d9f

fix: by ruff

8963da4

emnigma requested changes Jan 29, 2026

View reviewed changes

michaelharlam added 7 commits January 29, 2026 18:05

fix: removed comments & hardcode

e38c3d4

feat: added docs to the main classes and methods

0b395ed

feat: project restructured, added makefile (now it doesn't work), pro…

93f50fe

…mpt is divided, all of system messages are assembled into the one

fix: refactored

04fca3d

fix: removed vestige prompt

ba33cbc

fix: removed vestige prompt

8bba95f

feat: refactoring, added more tests

99c0474

emnigma reviewed Feb 5, 2026

View reviewed changes

michaelharlam added 3 commits February 10, 2026 19:51

feat: added semantic similarity for f1

e05c4b4

feat: added calculating by logs, args similarities, the stub od the l…

88dd410

…lm-as-a-judge

feat: llm as a judge template is updated

8d194f0

		self.memory_llm = ChatOllama(
		model=LocalModels.QWEN_2_5_14_B.value,

		message.content = message.content[:ceil(len(message.content) * coefficient)]
		cropped_messages.append(message)

		src/benchmark/simple_benchmarking/agent_chat/logs
		src/benchmark/tool_plan_benchmarking/logs

Conversation

mikhail-kharlamov commented Nov 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emnigma left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mikhail-kharlamov commented Nov 9, 2025 •

edited

Loading