Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Parquet] >2GiB Memory Leak on reading single parquet metadata file #44599

Open
jonded94 opened this issue Oct 31, 2024 · 3 comments
Open

[Parquet] >2GiB Memory Leak on reading single parquet metadata file #44599

jonded94 opened this issue Oct 31, 2024 · 3 comments

Comments

@jonded94
Copy link

jonded94 commented Oct 31, 2024

I have a ~1.5TiB, ~1.7k files parquet dataset with an additional _metadata.parquet file containing metadata of all row groups. The _metadata file was written with the mechanism described in the documentation.

The _metadata file is ~390MiB, the 1.7k parquet files are around 900MiB each.

I have a script called read_metadata.py which can be used to iterate through all files in the dataset, get their metadata and simultaneously measure memory load (RSS):

import gc
import time
import json
from contextlib import contextmanager
from pathlib import Path

import psutil
import pyarrow
import pyarrow.parquet

process = psutil.Process()

@contextmanager
def profiling(name: str):
    start = time.monotonic()
    start_mem = process.memory_info().rss / 1024**2
    yield
    end = time.monotonic()
    end_mem = process.memory_info().rss / 1024**2
    duration = end - start
    if end_mem - start_mem == 0:
        return
    print(
        f"{name}\n"
        f" took {duration:.5f} s, "
        f"mem diff {end_mem - start_mem:.3f}MiB [start: {start_mem:.3f}MiB, end: {end_mem:.3f}MiB]"
    )


def read_metadata(path: Path) -> None:
    pyarrow.parquet.read_metadata(path)
    return


if __name__ == "__main__":
    import argparse
    import random

    parser = argparse.ArgumentParser()
    parser.add_argument("files", nargs="+")
    parser.add_argument("repeats", type=int)
    args = parser.parse_args()

    paths = args.files
    repeats = args.repeats

    paths = paths * repeats
    random.shuffle(paths)

    for path in paths:
        with profiling(f"load: {path}"):
            read_metadata(path)

        with profiling(f"gc:   {path}"):
            gc.collect()

Doing that gives these results (note that only steps where there is a change in memory load are printed):

$ python scripts/read_metadata.py repartition/* 3
load: repartition/part-52.parquet
 took 0.00132 s, mem diff 1.500MiB [start: 194.281MiB, end: 195.781MiB]
load: repartition/_metadata.parquet
 took 2.38347 s, mem diff 2082.062MiB [start: 195.781MiB, end: 2277.844MiB]
load: repartition/_metadata.parquet
 took 2.15518 s, mem diff 16.094MiB [start: 2277.844MiB, end: 2293.938MiB]
load: repartition/part-1587.parquet
 took 0.00099 s, mem diff 1.500MiB [start: 2293.938MiB, end: 2295.438MiB]
gc:   repartition/part-1299.parquet
 took 0.00217 s, mem diff -1.832MiB [start: 2295.438MiB, end: 2293.605MiB]
load: repartition/_metadata.parquet
 took 2.15186 s, mem diff 0.562MiB [start: 2293.605MiB, end: 2294.168MiB]
gc:   repartition/part-1112.parquet
 took 0.00220 s, mem diff -1.543MiB [start: 2294.168MiB, end: 2292.625MiB]
load: repartition/part-751.parquet
 took 0.00111 s, mem diff 1.500MiB [start: 2292.625MiB, end: 2294.125MiB]
load: repartition/part-321.parquet
 took 0.00113 s, mem diff 1.500MiB [start: 2294.125MiB, end: 2295.625MiB]

image

Memory load stays mostly constant, but as soon as the metadata.parquet file is read, a huge, over 2GiB large memory leak appears. Reading that particular file multiple times does not lead to multiple memory leaks.

There seems to be no way to reduce memory load back to normal levels again, not even a pool = pyarrow.default_memory_pool(); pool.release_unused() or gc.collect() does help.

Component(s)

Parquet, Python

@pitrou pitrou changed the title >2GiB Memory Leak on reading single parquet metadata file [Parquet] >2GiB Memory Leak on reading single parquet metadata file Nov 12, 2024
@pitrou
Copy link
Member

pitrou commented Nov 12, 2024

Thanks for the report @jonded94 . Two questions:

  1. which platform are you running this on?
  2. did you try changing the default memory pool? See https://arrow.apache.org/docs/cpp/env_vars.html#envvar-ARROW_DEFAULT_MEMORY_POOL

@jonded94
Copy link
Author

which platform are you running this on?

$ uname -a
 6.6.30-flatcar #1 SMP PREEMPT_DYNAMIC Sun May 19 16:12:26 -00 2024 x86_64 GNU/Linux
$ python -c "import pyarrow; print(pyarrow.__version__)"
17.0.0

did you try changing the default memory pool?

"system" memory pool:

$ ARROW_DEFAULT_MEMORY_POOL=system python scripts/read_metadata.py repartition/* 3
load: repartition/part-332.parquet
 took 0.00124 s, mem diff 1.500MiB [start: 194.336MiB, end: 195.836MiB]0, 0
load: repartition/_metadata
 took 2.41739 s, mem diff 2077.484MiB [start: 195.836MiB, end: 2273.320MiB]0, 0
load: repartition/_metadata
 took 2.25331 s, mem diff 400.500MiB [start: 2273.320MiB, end: 2673.820MiB]0, 0
load: repartition/part-759.parquet
 took 0.00094 s, mem diff 1.500MiB [start: 2673.820MiB, end: 2675.320MiB]0, 0

=> also ~2.3GiB leak at first read, but a second read even increases it to ~2.7GiB.

"jemalloc" memory pool:

$ ARROW_DEFAULT_MEMORY_POOL=jemalloc python scripts/read_metadata.py repartition/* 3
load: repartition/part-58.parquet
 took 0.00122 s, mem diff 1.500MiB [start: 198.316MiB, end: 199.816MiB]0, 0
load: repartition/_metadata
 took 2.40700 s, mem diff 2081.992MiB [start: 199.816MiB, end: 2281.809MiB]0, 0
load: repartition/part-40.parquet
 took 0.00101 s, mem diff 1.500MiB [start: 2281.809MiB, end: 2283.309MiB]0, 0
load: repartition/_metadata
 took 2.19111 s, mem diff 17.043MiB [start: 2283.309MiB, end: 2300.352MiB]0, 0
gc:   repartition/part-912.parquet
 took 0.00223 s, mem diff -2.000MiB [start: 2300.352MiB, end: 2298.352MiB]0, 0
load: repartition/_metadata
 took 2.16370 s, mem diff 0.629MiB [start: 2298.352MiB, end: 2298.980MiB]0, 0

"mimalloc" memory pool:

$ ARROW_DEFAULT_MEMORY_POOL=mimalloc python scripts/read_metadata.py repartition/* 3
load: repartition/part-887.parquet
 took 0.00150 s, mem diff 5.285MiB [start: 189.285MiB, end: 194.570MiB]0, 0
load: repartition/_metadata
 took 2.40356 s, mem diff 2078.855MiB [start: 194.570MiB, end: 2273.426MiB]0, 0
load: repartition/_metadata
 took 2.25088 s, mem diff 15.887MiB [start: 2273.426MiB, end: 2289.312MiB]0, 0
load: repartition/_metadata
 took 2.23642 s, mem diff -0.391MiB [start: 2289.312MiB, end: 2288.922MiB]0, 0

Then I tuned a bit the statment where I previously just called gc.collect() and added a memory pool cleanup:

        with profiling(f"gc:   {path}"):
            pyarrow.default_memory_pool().release_unused()
            gc.collect()

Following datapoints resulted with this tuned version of the script.

"system" memory pool:

$ ARROW_DEFAULT_MEMORY_POOL=system python scripts/read_metadata.py repartition/* 3
load: repartition/part-419.parquet
 took 0.00142 s, mem diff 3.000MiB [start: 192.777MiB, end: 195.777MiB]0, 0
gc:   repartition/part-419.parquet
 took 0.00343 s, mem diff -1.695MiB [start: 195.777MiB, end: 194.082MiB]0, 0
load: repartition/_metadata
 took 2.40818 s, mem diff 2080.812MiB [start: 194.082MiB, end: 2274.895MiB]0, 0
gc:   repartition/_metadata
 took 0.17570 s, mem diff -2079.277MiB [start: 2274.895MiB, end: 195.617MiB]0, 0
load: repartition/_metadata
 took 3.15845 s, mem diff 2476.500MiB [start: 195.617MiB, end: 2672.117MiB]0, 0
gc:   repartition/_metadata
 took 0.20289 s, mem diff -2481.230MiB [start: 2672.117MiB, end: 190.887MiB]0, 0
load: repartition/_metadata
 took 3.10904 s, mem diff 2479.500MiB [start: 190.887MiB, end: 2670.387MiB]0, 0
gc:   repartition/_metadata
 took 0.18726 s, mem diff -2479.953MiB [start: 2670.387MiB, end: 190.434MiB]0, 0
load: repartition/part-1549.parquet
 took 0.00179 s, mem diff 1.500MiB [start: 190.434MiB, end: 191.934MiB]0, 0

=> Here, the memory actually seems to be released to the system again!

"jemalloc" memory pool:

$ ARROW_DEFAULT_MEMORY_POOL=jemalloc python scripts/read_metadata.py repartition/* 3
load: repartition/part-426.parquet
 took 0.00138 s, mem diff 3.000MiB [start: 193.805MiB, end: 196.805MiB]0, 0
gc:   repartition/part-426.parquet
 took 0.00635 s, mem diff 26.812MiB [start: 196.805MiB, end: 223.617MiB]0, 0
load: repartition/_metadata
 took 2.48826 s, mem diff 2082.836MiB [start: 223.617MiB, end: 2306.453MiB]0, 0
load: repartition/_metadata
 took 2.23224 s, mem diff 16.172MiB [start: 2306.453MiB, end: 2322.625MiB]0, 0
gc:   repartition/part-267.parquet
 took 0.00250 s, mem diff -1.762MiB [start: 2322.625MiB, end: 2320.863MiB]0, 0
gc:   repartition/part-839.parquet
 took 0.00231 s, mem diff -1.988MiB [start: 2320.863MiB, end: 2318.875MiB]0, 0
load: repartition/_metadata
 took 2.18774 s, mem diff 0.621MiB [start: 2318.875MiB, end: 2319.496MiB]0, 0

=> jemalloc memory pool doesn't care for the explicit cleanup call apparently.

"mimalloc" memory pool (this was increasing & releasing memory on every iteration and spamming stdout, I had to limit output to cases where memory changed by more than 5MiB):

$ ARROW_DEFAULT_MEMORY_POOL=mimalloc python scripts/read_metadata.py repartition/* 3
load: repartition/_metadata
 took 2.37767 s, mem diff 2079.824MiB [start: 193.172MiB, end: 2272.996MiB]0, 0
load: repartition/_metadata
 took 2.14551 s, mem diff 17.660MiB [start: 2271.785MiB, end: 2289.445MiB]0, 0

=> Memory leak also still seems to appear.

@pitrou
Copy link
Member

pitrou commented Nov 14, 2024

Ok, so what this tell us is that there is no actual memory leak in Arrow (at least on this use case :-)), but the memory allocator may decide to hold onto some memory to make further allocations faster.

Note that the memory is not necessarily unavailable to other applications. For example, the memory allocator might have set these memory areas as MADV_FREE, which lets the kernel free the pages when it needs to, but not necessarily immediately.

This all shows how complicated it is to get accurate memory footprint measurements on modern OSes...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants