-
Notifications
You must be signed in to change notification settings - Fork 44
Description
Hi there,
I am currently attempting to reproduce the results reported in the SAGE paper. Following the instructions, I used DeepSeek-V3 as the judge model to evaluate several models, including gemini-2.5-pro, gpt-4o, and qwen3-8b.
However, the experimental results I obtained are highly counter-intuitive and do not align with the benchmarks reported in the paper. Specifically, the score for GPT-4o is unreasonably low, while the score for the smaller model (Qwen3-8b) seems unexpectedly high.
Reproduction Results
Here are the Sentient Scores we obtained:
| Model | Our Score |
|---|---|
| Gemini-2.5-Pro | 90.61 |
| GPT-4o | 42.74 (Too Low) |
| Qwen3-8b | 66.80 (Surprisingly High) |
Suspected Cause
We suspect that this discrepancy might be caused by the lack of specific call_llm implementations (or configurations) for these models in the repository. Without specific handling, the default generation hyperparameters (such as temperature, top_p, etc.) might be causing instability in the model outputs, leading to skewed evaluation results.
Configuration Details
Below are the settings we used for the reproduction:
1. NPC LLM Configuration
We used anthropic/claude-3.5-sonnet via OpenRouter:
CUSTOM_API_CONFIG = {
"api_key": "...",
"base_url": "https://openrouter.ai/api/v1",
"model_name": "anthropic/claude-3.5-sonnet",
"max_tokens": 8192,
"temperature": 0.7,
"top_p": 0.8,
"presence_penalty": 1.5
}2. Judger LLM Configuration
We used deepseek-chat (DeepSeek-V3) as the judge:
# judger llm
response = client.chat.completions.create(
model="deepseek/deepseek-chat", # which is deepseek v3 on openrouter
messages=[{"role": "user", "content": prompt}],
temperature=0.7,
)Request
Could you please clarify the exact hyperparameters (temperature, top_p, etc.) used for the models and the judge in the paper? Alternatively, are there specific configurations required in the call_llm function that we might have missed?
Any guidance on reproducing the official results would be greatly appreciated.
Thanks!