Skip to content

Unoptimized observations #7

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
amotl opened this issue Mar 12, 2025 · 1 comment
Open

Unoptimized observations #7

amotl opened this issue Mar 12, 2025 · 1 comment

Comments

@amotl
Copy link
Member

amotl commented Mar 12, 2025

About

The program is not optimized in any way yet, other than using the orjson and orjsonl packages for high-performance JSON marshalling.

Problem

Processing an 11 GB JSONL/NDJSON file using Python 3.11, the CPU saturation looks reasonable, but the memory usage peaks at 75 GB,

Image

and after 30 minutes of processing time, the program croaks.

Assertion failed: (p->curr_buf_pos == p->curr_buf_length), function jv_parser_next, file jv_parse.c, line 823.
Abort trap: 6

real	34m27.128s
user	8m5.955s
sys	18m25.432s
@amotl
Copy link
Member Author

amotl commented Mar 14, 2025

This patch adds streaming support for JSONL/NDJSON files, so the program does not fail any longer on large data inputs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant