|
5 | 5 | - All numeric scores should be scored from 0.0 - 100.0, where 100 is the best score and 0 is the worst
|
6 | 6 | - The original prompt provided to the LLM can be found between the <prompt></prompt> tags
|
7 | 7 | - The output of the LLM for the given prompt can be found between the <output></output> tags
|
8 |
| -- Additional direction for each evaluation may be marked in the input between <direction></direction> tags |
9 |
| -- The <expected-tools></expected-tools> section is provided to the user as a hint about which tools are expected to be used |
10 |
| - if all of them are not needed that shouldn't affect the score, however it is not desirable for non-expected tools to be |
11 |
| - used |
| 8 | +- Additional information and context for each evaluation is included in the <settings></settings> section |
| 9 | +- The <expected-tools></expected-tools> section is provided by the user to list which tools are expected to be used |
| 10 | + if all of them are not needed is okay and should not affect the score, however it is not desirable for non-expected |
| 11 | + tools to be used |
12 | 12 | - Do not make assumptions about improvements to the quality of the output beyond what is noted in the <check></check> tags,
|
13 | 13 | the <check> section is defined by the user as a way to validate the output given for the associated prompt
|
14 |
| -- The accuracy score should reflect the accuracy of the result generally and taking into account the <direction> block |
15 |
| -- The tool use score should be based on whether or not the correct tool was used and whether the minimum amount |
| 14 | +- The accuracy score should reflect the accuracy of the result generally and taking into account the <check> block and results |
| 15 | + of tool calls |
| 16 | +- The tool_use score should be based on whether or not the correct tool was used and whether the minimum amount |
16 | 17 | of tools were used to accomplish a task. Over use of tools or repeated use of tools should deduct points from
|
17 |
| - this score. |
| 18 | + this score. This score should also be affected by how well the tools used conform to the tools listed in the |
| 19 | + <expected-tools> block. |
| 20 | +- If more tools are used then the number of max tools specified then points should be deducted from the tool_use |
| 21 | + score |
18 | 22 | - The helpfulness score should measure how useful the response is in addressing the user's need
|
19 | 23 | - The quality score should reflect the overall quality, clearness and conciseness of the output
|
20 | 24 | - Try to utilize the tools that are available instead of searching for new tools
|
21 | 25 | - Not using any tools should deduct some points from the tool use score
|
22 | 26 |
|
23 | 27 | Advanced evaluation metrics:
|
| 28 | +- A guess should not be considered a hallucination, however it should affect the accuracy score |
24 | 29 | - The hallucination_score should measure the presence of made-up, incorrect, or factually unsupported statements
|
25 | 30 | (lower is better, with 0 being no hallucinations and 100 being completely hallucinated)
|
26 | 31 | - hallucination_score should only apply to made up information, if information is true at the time of the request
|
|
33 | 38 | 3. Whether hallucinations are central to the response or peripheral
|
34 | 39 | 4. Whether the hallucination could lead to harmful actions if believed
|
35 | 40 |
|
| 41 | +For the hallucination_score metric (0-100 scale, lower is better), carefully check for any false statements, |
| 42 | +incorrect information, or made-up facts in the response and list them in the false_claims field. |
| 43 | +
|
36 | 44 | Be thorough in your evaluation, considering how well the model's response meets both technical requirements and user needs.
|
37 | 45 | """
|
38 | 46 |
|
|
0 commit comments