Skip to content

Commit e7d9bfe

Browse files
committed
cleanup: improve judge prompt
1 parent 692077c commit e7d9bfe

File tree

3 files changed

+20
-15
lines changed

3 files changed

+20
-15
lines changed

mcpx_eval/constants.py

Lines changed: 15 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -5,22 +5,27 @@
55
- All numeric scores should be scored from 0.0 - 100.0, where 100 is the best score and 0 is the worst
66
- The original prompt provided to the LLM can be found between the <prompt></prompt> tags
77
- The output of the LLM for the given prompt can be found between the <output></output> tags
8-
- Additional direction for each evaluation may be marked in the input between <direction></direction> tags
9-
- The <expected-tools></expected-tools> section is provided to the user as a hint about which tools are expected to be used
10-
if all of them are not needed that shouldn't affect the score, however it is not desirable for non-expected tools to be
11-
used
8+
- Additional information and context for each evaluation is included in the <settings></settings> section
9+
- The <expected-tools></expected-tools> section is provided by the user to list which tools are expected to be used
10+
if all of them are not needed is okay and should not affect the score, however it is not desirable for non-expected
11+
tools to be used
1212
- Do not make assumptions about improvements to the quality of the output beyond what is noted in the <check></check> tags,
1313
the <check> section is defined by the user as a way to validate the output given for the associated prompt
14-
- The accuracy score should reflect the accuracy of the result generally and taking into account the <direction> block
15-
- The tool use score should be based on whether or not the correct tool was used and whether the minimum amount
14+
- The accuracy score should reflect the accuracy of the result generally and taking into account the <check> block and results
15+
of tool calls
16+
- The tool_use score should be based on whether or not the correct tool was used and whether the minimum amount
1617
of tools were used to accomplish a task. Over use of tools or repeated use of tools should deduct points from
17-
this score.
18+
this score. This score should also be affected by how well the tools used conform to the tools listed in the
19+
<expected-tools> block.
20+
- If more tools are used then the number of max tools specified then points should be deducted from the tool_use
21+
score
1822
- The helpfulness score should measure how useful the response is in addressing the user's need
1923
- The quality score should reflect the overall quality, clearness and conciseness of the output
2024
- Try to utilize the tools that are available instead of searching for new tools
2125
- Not using any tools should deduct some points from the tool use score
2226
2327
Advanced evaluation metrics:
28+
- A guess should not be considered a hallucination, however it should affect the accuracy score
2429
- The hallucination_score should measure the presence of made-up, incorrect, or factually unsupported statements
2530
(lower is better, with 0 being no hallucinations and 100 being completely hallucinated)
2631
- hallucination_score should only apply to made up information, if information is true at the time of the request
@@ -33,6 +38,9 @@
3338
3. Whether hallucinations are central to the response or peripheral
3439
4. Whether the hallucination could lead to harmful actions if believed
3540
41+
For the hallucination_score metric (0-100 scale, lower is better), carefully check for any false statements,
42+
incorrect information, or made-up facts in the response and list them in the false_claims field.
43+
3644
Be thorough in your evaluation, considering how well the model's response meets both technical requirements and user needs.
3745
"""
3846

mcpx_eval/judge.py

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -160,14 +160,10 @@ async def run(self, prompt, check, expected_tools, max_tool_calls: int | None =
160160
logger.info(f"Analyzing results of {model.name}")
161161
res = await self.agent.run(
162162
user_prompt=f"""
163-
<direction>
164-
The maximum number of tools calls is: {max_tool_calls}
165-
166-
Today's date and time: {datetime.now().isoformat()}
167-
168-
For the hallucination_score metric (0-100 scale, lower is better), carefully check for any false statements,
169-
incorrect information, or made-up facts in the response and list them in the false_claims field.
170-
</direction>
163+
<settings>
164+
Max tool calls: {max_tool_calls}
165+
Current date and time: {datetime.now().isoformat()}
166+
</settings>
171167
<prompt>
172168
{prompt}
173169
</prompt>

run.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
remote_models="\
44
--model o1 \
5+
--model openai:o3-mini \
56
--model gpt-4o \
67
--model claude-3-5-sonnet-latest \
78
--model claude-3-7-sonnet-latest

0 commit comments

Comments
 (0)