-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Parquet] >2GiB Memory Leak on reading single parquet metadata file #44599
Comments
Thanks for the report @jonded94 . Two questions:
|
"system" memory pool:
=> also ~2.3GiB leak at first read, but a second read even increases it to ~2.7GiB. "jemalloc" memory pool:
"mimalloc" memory pool:
Then I tuned a bit the statment where I previously just called
Following datapoints resulted with this tuned version of the script. "system" memory pool:
=> Here, the memory actually seems to be released to the system again! "jemalloc" memory pool:
=> jemalloc memory pool doesn't care for the explicit cleanup call apparently. "mimalloc" memory pool (this was increasing & releasing memory on every iteration and spamming stdout, I had to limit output to cases where memory changed by more than 5MiB):
=> Memory leak also still seems to appear. |
Ok, so what this tell us is that there is no actual memory leak in Arrow (at least on this use case :-)), but the memory allocator may decide to hold onto some memory to make further allocations faster. Note that the memory is not necessarily unavailable to other applications. For example, the memory allocator might have set these memory areas as This all shows how complicated it is to get accurate memory footprint measurements on modern OSes... |
I have a ~1.5TiB, ~1.7k files parquet dataset with an additional
_metadata.parquet
file containing metadata of all row groups. The_metadata
file was written with the mechanism described in the documentation.The
_metadata
file is ~390MiB, the 1.7k parquet files are around 900MiB each.I have a script called
read_metadata.py
which can be used to iterate through all files in the dataset, get their metadata and simultaneously measure memory load (RSS):Doing that gives these results (note that only steps where there is a change in memory load are printed):
Memory load stays mostly constant, but as soon as the
metadata.parquet
file is read, a huge, over 2GiB large memory leak appears. Reading that particular file multiple times does not lead to multiple memory leaks.There seems to be no way to reduce memory load back to normal levels again, not even a
pool = pyarrow.default_memory_pool(); pool.release_unused()
orgc.collect()
does help.Component(s)
Parquet, Python
The text was updated successfully, but these errors were encountered: