Skip to content

Benchmark for Chat History Compression Algorithms in Agentic Systems with Function Calling#13

Open
mikhail-kharlamov wants to merge 37 commits intoemnigma:mainfrom
mikhail-kharlamov:benchmarking-first-stage
Open

Benchmark for Chat History Compression Algorithms in Agentic Systems with Function Calling#13
mikhail-kharlamov wants to merge 37 commits intoemnigma:mainfrom
mikhail-kharlamov:benchmarking-first-stage

Conversation

@mikhail-kharlamov
Copy link
Contributor

@mikhail-kharlamov mikhail-kharlamov commented Nov 9, 2025

This PR introduces a benchmarking framework and experimental results for comparing chat history compression algorithms within an agentic system context. The primary goal is to evaluate how different memory management strategies affect an LLM agent's ability to correctly plan and execute tool calls (function calling) over extended dialogue sessions.

Key Features & Implementations
The benchmark compares the following compression strategies against a baseline (Full Context):

  • RecSum: Recursive summarization of the dialogue history.
  • MemoryBank: A memory-bank approach for storing and retrieving dialogue segments.
  • RecSum + RAG: Recursive summarization augmented with retrieval (RAG) specifically for the history of tool/function calls.
  • MemoryBank + RAG: MemoryBank augmented with RAG for tool/function calls.
  • V-shape: An algorithm that increases the compression ratio to the middle of the context and decreases it from the middle to the end
  • Short-Tools: An algorithm that removes all descriptions of the result of executing tools from the context.

Methodology

  1. Data: 15 real-world user-agent interaction sessions collected from IDE tasks involving the Explyt agent.
  2. Simulation: Since the proprietary agent code is restricted, the benchmark operates in a "planning mode." The LLM is tasked with generating a JSON-structured action plan based on a provided schema and the compressed context.
  3. Metrics:
    • F1 Score (Tool Names): Measures the accuracy of selected tools ignoring arguments.
    • F1 Score (Strict): Measures the accuracy of selected tools including the correctness of their arguments.

The results of the experiments, including aggregated metrics tables and visualization graphs, are generated by the framework.

@mikhail-kharlamov mikhail-kharlamov changed the title Benchmarking first stage Benchmark for Chat History Compression Algorithms in Agentic Systems with Function Calling Jan 7, 2026
@emnigma emnigma self-requested a review January 9, 2026 08:10
Copy link
Owner

@emnigma emnigma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last commit review

Comment on lines 93 to 94
if count_of_sessions == 3:
print("ye")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??


#runner = Runner()
runner = Runner()
#runner.run(sys.argv[1])
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

Comment on lines 93 to 96
#self.memory_llm = ChatOpenAI(
# model=OpenAIModels.GPT_5_MINI.value,
# api_key=SecretStr(api_key)
#)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

#)
self.memory_llm = ChatOllama(
model=LocalModels.QWEN_2_5_14_B.value,
temperature=0.7,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 0.7? set 0

Comment on lines 97 to 98
self.memory_llm = ChatOllama(
model=LocalModels.QWEN_2_5_14_B.value,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chatOllama is not isLocal ?

Comment on lines +27 to +29
for message in messages:
if isinstance(message, ToolMessage):
message.content = ""
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<compressed content> to signal that this was important
also, are tool call arguments being saved?

:param sessions: past sessions.
:return: list[BaseMessage]: flattened history with shortened tool messages.
"""
messages: list[BaseMessage] = DialogueBaseline._get_context(sessions)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super()._get_context? if super changes, this will break

:param sessions: past sessions.
:return: list[BaseMessage]: flattened history with weighted truncation applied.
"""
messages: list[BaseMessage] = DialogueBaseline._get_context(sessions)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super?

Comment on lines +46 to +47
message.content = message.content[:ceil(len(message.content) * coefficient)]
cropped_messages.append(message)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to signal that there was a cropping. insert ... at the end

@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10),
retry=retry_if_exception_type(OutputParserException),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?


@override
@classmethod
def from_block(cls, block: ToolCallBlock, session_id: int):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToolMemoryFragment can be a generic with cls type. smth like ToolMemoryFragment(MemoryFragment[ToolBlock])

then override can also be generic and typed:

def from_block(block: ToolCallBlock, session_id: int) -> ToolBlock:
    ...

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove from git. it is doable

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why di you need this?

Comment on lines +13 to +14
src/benchmark/simple_benchmarking/agent_chat/logs
src/benchmark/tool_plan_benchmarking/logs
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smth like **/logs ? or *.log ? I do not remember correct syntax

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants