Skip to content

Commit 0a499b7

Browse files
committed
Add evaluation script
1 parent 8673270 commit 0a499b7

File tree

3 files changed

+71
-1
lines changed

3 files changed

+71
-1
lines changed

README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ ChatGLM2-6B 开源模型旨在与开源社区一起推动大模型技术发展
2525
尽管模型在训练的各个阶段都尽力确保数据的合规性和准确性,但由于 ChatGLM2-6B 模型规模较小,且模型受概率随机性因素影响,无法保证输出内容的准确性,且模型易被误导。**本项目不承担开源模型和代码导致的数据安全、舆情风险或发生任何模型被误导、滥用、传播、不当利用而产生的风险和责任。**
2626

2727
## 评测结果
28-
我们选取了部分中英文典型数据集进行了评测,以下为 ChatGLM2-6B 模型在 [MMLU](https://github.com/hendrycks/test) (英文)、[C-Eval](https://cevalbenchmark.com/static/leaderboard.html)(中文)、[GSM8K](https://github.com/openai/grade-school-math)(数学)、[BBH](https://github.com/suzgunmirac/BIG-Bench-Hard)(英文) 上的测评结果。
28+
我们选取了部分中英文典型数据集进行了评测,以下为 ChatGLM2-6B 模型在 [MMLU](https://github.com/hendrycks/test) (英文)、[C-Eval](https://cevalbenchmark.com/static/leaderboard.html)(中文)、[GSM8K](https://github.com/openai/grade-school-math)(数学)、[BBH](https://github.com/suzgunmirac/BIG-Bench-Hard)(英文) 上的测评结果。[evaluation](./evaluation/README.md) 中提供了在 C-Eval 上进行测评的脚本。
2929

3030
### MMLU
3131

evaluation/README.md

+10
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
首先从 [Tsinghua Cloud](https://cloud.tsinghua.edu.cn/f/e84444333b6d434ea7b0) 下载处理好的 C-Eval 数据集,解压到 `evaluation` 目录下。然后运行
2+
3+
```shell
4+
cd evaluation
5+
python evaluate_ceval.py
6+
```
7+
8+
这个脚本会在C-Eval的验证集上进行预测并输出准确率。如果想要得到测试集上的结果可以将代码中的 `./CEval/val/**/*.jsonl` 改为 `./CEval/test/**/*.jsonl`,并按照 C-Eval 规定的格式保存结果并在 [官网](https://cevalbenchmark.com/) 上提交。
9+
10+
汇报的结果使用的是内部的并行测试框架,结果可能会有轻微波动。

evaluation/evaluate_ceval.py

+60
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
import os
2+
import glob
3+
import re
4+
import json
5+
import torch
6+
import torch.utils.data
7+
from transformers import AutoTokenizer, AutoModel
8+
from tqdm import tqdm
9+
10+
tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True)
11+
model = AutoModel.from_pretrained("THUDM/chatglm2-6b", trust_remote_code=True).bfloat16().cuda()
12+
13+
choices = ["A", "B", "C", "D"]
14+
choice_tokens = [tokenizer.encode(choice, add_special_tokens=False)[0] for choice in choices]
15+
16+
17+
def build_prompt(text):
18+
return "[Round {}]\n\n问:{}\n\n答:".format(1, text)
19+
20+
21+
extraction_prompt = '综上所述,ABCD中正确的选项是:'
22+
23+
accuracy_dict, count_dict = {}, {}
24+
with torch.no_grad():
25+
for entry in glob.glob("./CEval/val/**/*.jsonl", recursive=True):
26+
dataset = []
27+
with open(entry, encoding='utf-8') as file:
28+
for line in file:
29+
dataset.append(json.loads(line))
30+
correct = 0
31+
dataloader = torch.utils.data.DataLoader(dataset, batch_size=8)
32+
for batch in tqdm(dataloader):
33+
texts = batch["inputs_pretokenized"]
34+
queries = [build_prompt(query) for query in texts]
35+
inputs = tokenizer(queries, padding=True, return_tensors="pt", truncation=True, max_length=2048).to('cuda')
36+
outputs = model.generate(**inputs, do_sample=False, max_new_tokens=512)
37+
intermediate_outputs = []
38+
for idx in range(len(outputs)):
39+
output = outputs.tolist()[idx][len(inputs["input_ids"][idx]):]
40+
response = tokenizer.decode(output)
41+
intermediate_outputs.append(response)
42+
answer_texts = [text + intermediate + "\n" + extraction_prompt for text, intermediate in
43+
zip(texts, intermediate_outputs)]
44+
input_tokens = [build_prompt(answer_text) for answer_text in answer_texts]
45+
inputs = tokenizer(input_tokens, padding=True, return_tensors="pt", truncation=True, max_length=2048).to('cuda')
46+
outputs = model(**inputs, return_last_logit=True)
47+
logits = outputs.logits[:, -1]
48+
logits = logits[:, choice_tokens]
49+
preds = logits.argmax(dim=-1)
50+
correct += (preds.cpu() == batch["label"]).sum().item()
51+
accuracy = correct / len(dataset)
52+
print(entry, accuracy)
53+
accuracy_dict[entry] = accuracy
54+
count_dict[entry] = len(dataset)
55+
56+
acc_total, count_total = 0.0, 0
57+
for key in accuracy_dict:
58+
acc_total += accuracy_dict[key] * count_dict[key]
59+
count_total += count_dict[key]
60+
print(acc_total / count_total)

0 commit comments

Comments
 (0)