Skip to content

Seed data creation and data preprocessing improvements #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

alimaredia
Copy link
Contributor

This PR 1) Adds a notebook that combines qna.yaml's and chunks into the seed_data.jsonl that is consumed by notebooks like this one in sdg-hub and 2) Addresses issues from the previously merged conversion and chunking notebooks.

alimaredia and others added 2 commits April 24, 2025 08:13
create_seed_dataset.py used in this notebook
is heavily inspired by docprocessor.py from the
sdg-hub repo.

Co-authored-by: Abhishek B <[email protected]>
Co-authored-by: shiv <[email protected]>
Signed-off-by: Ali Maredia <[email protected]>
@alimaredia alimaredia force-pushed the seed-data-creation-and-data-preprocessing-improvements branch from 871297c to 0058f67 Compare April 24, 2025 12:22
Copy link

@fabianofranz fabianofranz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple nits, no merge blockers. LGTM.

chunk_files = path.glob('*.txt')

chunks_dict = {}
for file in chunk_files:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the file order important here? I don't think it is, but if it is, afaik glob lists in arbitrary order.

{"chunks": "document"}
)

# Only keep document greater than 100 tokens
Copy link

@fabianofranz fabianofranz Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, it would be nice to have a comment about why we need to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants