-
Notifications
You must be signed in to change notification settings - Fork 5
Seed data creation and data preprocessing improvements #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Seed data creation and data preprocessing improvements #4
Conversation
Signed-off-by: Ali Maredia <[email protected]>
create_seed_dataset.py used in this notebook is heavily inspired by docprocessor.py from the sdg-hub repo. Co-authored-by: Abhishek B <[email protected]> Co-authored-by: shiv <[email protected]> Signed-off-by: Ali Maredia <[email protected]>
871297c
to
0058f67
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple nits, no merge blockers. LGTM.
chunk_files = path.glob('*.txt') | ||
|
||
chunks_dict = {} | ||
for file in chunk_files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the file order important here? I don't think it is, but if it is, afaik glob
lists in arbitrary order.
{"chunks": "document"} | ||
) | ||
|
||
# Only keep document greater than 100 tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, it would be nice to have a comment about why we need to do this.
Signed-off-by: Ali Maredia <[email protected]>
This PR 1) Adds a notebook that combines qna.yaml's and chunks into the seed_data.jsonl that is consumed by notebooks like this one in sdg-hub and 2) Addresses issues from the previously merged conversion and chunking notebooks.