Filter the data format returns a few samples for Llama3.1

Thanks for open-sourcing the amazing work! I tried to reproduce the results using the Llama3.1-8B-Instruct, and I achieved 91.5% on 7473 samples of the GSM8K training set. However, when I used `LLMLingua.py` to filter the formatted data using this [line](https://github.com/hemingkx/TokenSkip/blob/603f65f50995c00283c3257ad6a536278d57c2e5/LLMLingua.py#L137), it only returns the 847 samples. 

Is this the expected behavior when trying to reproduce the results?

It seems the filtering relies on a specific answer format such as “The final answer is...”, as shown in [this line](https://github.com/hemingkx/TokenSkip/blob/603f65f50995c00283c3257ad6a536278d57c2e5/LLMLingua.py#L51). However, there are other valid formats like `\n\n\boxed{}` and `\n\boxed` that might be excluded by the current filtering logic.

Lastly, I was wondering why you didn't perform the same filtering and CoT/answer split on the Qwen model and why it is necessary for LLaMA. 
The processed Qwen [dataset](https://github.com/hemingkx/TokenSkip/blob/main/datasets/gsm8k/llamafactory_inputs/mydataset_compressed_gsm8k_llmlingua2_qwen_3B.json) has the repeated answer at the end of each output.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Filter the data format returns a few samples for Llama3.1 #6

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Filter the data format returns a few samples for Llama3.1 #6

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions