[Question/Reproduction] Significant discrepancy in Sentient Scores compared to the paper (Abnormal results for GPT-4o and Qwen)



Hi there,

I am currently attempting to reproduce the results reported in the SAGE paper. Following the instructions, I used **DeepSeek-V3** as the judge model to evaluate several models, including `gemini-2.5-pro`, `gpt-4o`, and `qwen3-8b`.

However, the experimental results I obtained are highly counter-intuitive and do not align with the benchmarks reported in the paper. Specifically, the score for GPT-4o is unreasonably low, while the score for the smaller model (Qwen3-8b) seems unexpectedly high.

**Reproduction Results**
Here are the *Sentient Scores* we obtained:

| Model | Our Score |
| :--- | :--- |
| **Gemini-2.5-Pro** | 90.61 | 
| **GPT-4o** | **42.74** (Too Low) |
| **Qwen3-8b** | **66.80** (Surprisingly High) |

**Suspected Cause**
We suspect that this discrepancy might be caused by the lack of specific `call_llm` implementations (or configurations) for these models in the repository. Without specific handling, the default generation hyperparameters (such as `temperature`, `top_p`, etc.) might be causing instability in the model outputs, leading to skewed evaluation results.

**Configuration Details**
Below are the settings we used for the reproduction:

**1. NPC LLM Configuration**
We used `anthropic/claude-3.5-sonnet` via OpenRouter:
```python
CUSTOM_API_CONFIG = {
    "api_key": "...",
    "base_url": "https://openrouter.ai/api/v1",
    "model_name": "anthropic/claude-3.5-sonnet",
    "max_tokens": 8192,
    "temperature": 0.7,
    "top_p": 0.8,
    "presence_penalty": 1.5
}
```

**2. Judger LLM Configuration**
We used `deepseek-chat` (DeepSeek-V3) as the judge:
```python
# judger llm
response = client.chat.completions.create(
    model="deepseek/deepseek-chat", # which is deepseek v3 on openrouter
    messages=[{"role": "user", "content": prompt}],
    temperature=0.7,
)
```

**Request**
Could you please clarify the exact hyperparameters (temperature, top_p, etc.) used for the models and the judge in the paper? Alternatively, are there specific configurations required in the `call_llm` function that we might have missed?

Any guidance on reproducing the official results would be greatly appreciated.

Thanks!

Model	Our Score
Gemini-2.5-Pro	90.61
GPT-4o	42.74 (Too Low)
Qwen3-8b	66.80 (Surprisingly High)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question/Reproduction] Significant discrepancy in Sentient Scores compared to the paper (Abnormal results for GPT-4o and Qwen) #26

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Question/Reproduction] Significant discrepancy in Sentient Scores compared to the paper (Abnormal results for GPT-4o and Qwen) #26

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions