Skip to content

Filter the data format returns a few samples for Llama3.1 #6

@YefanZhou

Description

@YefanZhou

Thanks for open-sourcing the amazing work! I tried to reproduce the results using the Llama3.1-8B-Instruct, and I achieved 91.5% on 7473 samples of the GSM8K training set. However, when I used LLMLingua.py to filter the formatted data using this line, it only returns the 847 samples.

Is this the expected behavior when trying to reproduce the results?

It seems the filtering relies on a specific answer format such as “The final answer is...”, as shown in this line. However, there are other valid formats like \n\n\boxed{} and \n\boxed that might be excluded by the current filtering logic.

Lastly, I was wondering why you didn't perform the same filtering and CoT/answer split on the Qwen model and why it is necessary for LLaMA.
The processed Qwen dataset has the repeated answer at the end of each output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions