Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster ingestion of Zarr into BQ by converting the chunk into pd.Dataframe #415

Closed
wants to merge 4 commits into from

Conversation

DarshanSP19
Copy link
Collaborator

@DarshanSP19 DarshanSP19 commented Oct 31, 2023

We're having small chunks of a dataset while processing zarr in weather-mv. These changes will convert those chunks into dataframe and then extract rows directly from dataframes. As the chunk size in our control we can control the memory consumption during the pipeline.

Considerable Points

  • Works with the zarr dataset for now.
  • Will update for all types of dataset in future.
  • Users can pass cli arguments to open datasets in the specified chunk scheme.
    Example
--input_chunks '{ "time": 1, "level": 1 }'

Partially Solved: #414

@DarshanSP19 DarshanSP19 self-assigned this Oct 31, 2023
@DarshanSP19 DarshanSP19 force-pushed the ar-bq-df branch 4 times, most recently from 2a1b730 to 0e0a907 Compare November 2, 2023 11:35
@mahrsee1997 mahrsee1997 changed the title Faster ingestion into BQ by converting the chunk into pd.Dataframe Faster ingestion of Zarr into BQ by converting the chunk into pd.Dataframe Nov 3, 2023
@DarshanSP19
Copy link
Collaborator Author

The optimization changes are done in #473 .

@DarshanSP19 DarshanSP19 closed this Sep 5, 2024
@DarshanSP19 DarshanSP19 deleted the ar-bq-df branch September 5, 2024 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant