Skip to content

[Question/Reproduction] Significant discrepancy in Sentient Scores compared to the paper (Abnormal results for GPT-4o and Qwen) #26

@2200xiaohu

Description

@2200xiaohu

Hi there,

I am currently attempting to reproduce the results reported in the SAGE paper. Following the instructions, I used DeepSeek-V3 as the judge model to evaluate several models, including gemini-2.5-pro, gpt-4o, and qwen3-8b.

However, the experimental results I obtained are highly counter-intuitive and do not align with the benchmarks reported in the paper. Specifically, the score for GPT-4o is unreasonably low, while the score for the smaller model (Qwen3-8b) seems unexpectedly high.

Reproduction Results
Here are the Sentient Scores we obtained:

Model Our Score
Gemini-2.5-Pro 90.61
GPT-4o 42.74 (Too Low)
Qwen3-8b 66.80 (Surprisingly High)

Suspected Cause
We suspect that this discrepancy might be caused by the lack of specific call_llm implementations (or configurations) for these models in the repository. Without specific handling, the default generation hyperparameters (such as temperature, top_p, etc.) might be causing instability in the model outputs, leading to skewed evaluation results.

Configuration Details
Below are the settings we used for the reproduction:

1. NPC LLM Configuration
We used anthropic/claude-3.5-sonnet via OpenRouter:

CUSTOM_API_CONFIG = {
    "api_key": "...",
    "base_url": "https://openrouter.ai/api/v1",
    "model_name": "anthropic/claude-3.5-sonnet",
    "max_tokens": 8192,
    "temperature": 0.7,
    "top_p": 0.8,
    "presence_penalty": 1.5
}

2. Judger LLM Configuration
We used deepseek-chat (DeepSeek-V3) as the judge:

# judger llm
response = client.chat.completions.create(
    model="deepseek/deepseek-chat", # which is deepseek v3 on openrouter
    messages=[{"role": "user", "content": prompt}],
    temperature=0.7,
)

Request
Could you please clarify the exact hyperparameters (temperature, top_p, etc.) used for the models and the judge in the paper? Alternatively, are there specific configurations required in the call_llm function that we might have missed?

Any guidance on reproducing the official results would be greatly appreciated.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions