Benchmark for Chat History Compression Algorithms in Agentic Systems with Function Calling#13
Open
mikhail-kharlamov wants to merge 37 commits intoemnigma:mainfrom
Open
Conversation
added 19 commits
November 9, 2025 16:26
…by ruff and in tests
added 5 commits
January 24, 2026 05:19
emnigma
requested changes
Jan 29, 2026
src/benchmarking/tool_metrics/run.py
Outdated
Comment on lines
93
to
94
| if count_of_sessions == 3: | ||
| print("ye") |
src/benchmarking/tool_metrics/run.py
Outdated
|
|
||
| #runner = Runner() | ||
| runner = Runner() | ||
| #runner.run(sys.argv[1]) |
Comment on lines
93
to
96
| #self.memory_llm = ChatOpenAI( | ||
| # model=OpenAIModels.GPT_5_MINI.value, | ||
| # api_key=SecretStr(api_key) | ||
| #) |
| #) | ||
| self.memory_llm = ChatOllama( | ||
| model=LocalModels.QWEN_2_5_14_B.value, | ||
| temperature=0.7, |
Comment on lines
97
to
98
| self.memory_llm = ChatOllama( | ||
| model=LocalModels.QWEN_2_5_14_B.value, |
added 7 commits
January 29, 2026 18:05
…mpt is divided, all of system messages are assembled into the one
emnigma
reviewed
Feb 5, 2026
Comment on lines
+27
to
+29
| for message in messages: | ||
| if isinstance(message, ToolMessage): | ||
| message.content = "" |
Owner
There was a problem hiding this comment.
<compressed content> to signal that this was important
also, are tool call arguments being saved?
| :param sessions: past sessions. | ||
| :return: list[BaseMessage]: flattened history with shortened tool messages. | ||
| """ | ||
| messages: list[BaseMessage] = DialogueBaseline._get_context(sessions) |
Owner
There was a problem hiding this comment.
super()._get_context? if super changes, this will break
| :param sessions: past sessions. | ||
| :return: list[BaseMessage]: flattened history with weighted truncation applied. | ||
| """ | ||
| messages: list[BaseMessage] = DialogueBaseline._get_context(sessions) |
Comment on lines
+46
to
+47
| message.content = message.content[:ceil(len(message.content) * coefficient)] | ||
| cropped_messages.append(message) |
Owner
There was a problem hiding this comment.
we need to signal that there was a cropping. insert ... at the end
| @retry( | ||
| stop=stop_after_attempt(3), | ||
| wait=wait_exponential(multiplier=1, min=2, max=10), | ||
| retry=retry_if_exception_type(OutputParserException), |
|
|
||
| @override | ||
| @classmethod | ||
| def from_block(cls, block: ToolCallBlock, session_id: int): |
Owner
There was a problem hiding this comment.
ToolMemoryFragment can be a generic with cls type. smth like ToolMemoryFragment(MemoryFragment[ToolBlock])
then override can also be generic and typed:
def from_block(block: ToolCallBlock, session_id: int) -> ToolBlock:
...
Comment on lines
+13
to
+14
| src/benchmark/simple_benchmarking/agent_chat/logs | ||
| src/benchmark/tool_plan_benchmarking/logs |
Owner
There was a problem hiding this comment.
smth like **/logs ? or *.log ? I do not remember correct syntax
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR introduces a benchmarking framework and experimental results for comparing chat history compression algorithms within an agentic system context. The primary goal is to evaluate how different memory management strategies affect an LLM agent's ability to correctly plan and execute tool calls (function calling) over extended dialogue sessions.
Key Features & Implementations
The benchmark compares the following compression strategies against a baseline (Full Context):
Methodology
The results of the experiments, including aggregated metrics tables and visualization graphs, are generated by the framework.