[Parquet] >2GiB Memory Leak on reading single parquet metadata file #44599

jonded94 · 2024-10-31T19:11:22Z

I have a ~1.5TiB, ~1.7k files parquet dataset with an additional _metadata.parquet file containing metadata of all row groups. The _metadata file was written with the mechanism described in the documentation.

The _metadata file is ~390MiB, the 1.7k parquet files are around 900MiB each.

I have a script called read_metadata.py which can be used to iterate through all files in the dataset, get their metadata and simultaneously measure memory load (RSS):

import gc
import time
import json
from contextlib import contextmanager
from pathlib import Path

import psutil
import pyarrow
import pyarrow.parquet

process = psutil.Process()

@contextmanager
def profiling(name: str):
    start = time.monotonic()
    start_mem = process.memory_info().rss / 1024**2
    yield
    end = time.monotonic()
    end_mem = process.memory_info().rss / 1024**2
    duration = end - start
    if end_mem - start_mem == 0:
        return
    print(
        f"{name}\n"
        f" took {duration:.5f} s, "
        f"mem diff {end_mem - start_mem:.3f}MiB [start: {start_mem:.3f}MiB, end: {end_mem:.3f}MiB]"
    )


def read_metadata(path: Path) -> None:
    pyarrow.parquet.read_metadata(path)
    return


if __name__ == "__main__":
    import argparse
    import random

    parser = argparse.ArgumentParser()
    parser.add_argument("files", nargs="+")
    parser.add_argument("repeats", type=int)
    args = parser.parse_args()

    paths = args.files
    repeats = args.repeats

    paths = paths * repeats
    random.shuffle(paths)

    for path in paths:
        with profiling(f"load: {path}"):
            read_metadata(path)

        with profiling(f"gc:   {path}"):
            gc.collect()

Doing that gives these results (note that only steps where there is a change in memory load are printed):

$ python scripts/read_metadata.py repartition/* 3
load: repartition/part-52.parquet
 took 0.00132 s, mem diff 1.500MiB [start: 194.281MiB, end: 195.781MiB]
load: repartition/_metadata.parquet
 took 2.38347 s, mem diff 2082.062MiB [start: 195.781MiB, end: 2277.844MiB]
load: repartition/_metadata.parquet
 took 2.15518 s, mem diff 16.094MiB [start: 2277.844MiB, end: 2293.938MiB]
load: repartition/part-1587.parquet
 took 0.00099 s, mem diff 1.500MiB [start: 2293.938MiB, end: 2295.438MiB]
gc:   repartition/part-1299.parquet
 took 0.00217 s, mem diff -1.832MiB [start: 2295.438MiB, end: 2293.605MiB]
load: repartition/_metadata.parquet
 took 2.15186 s, mem diff 0.562MiB [start: 2293.605MiB, end: 2294.168MiB]
gc:   repartition/part-1112.parquet
 took 0.00220 s, mem diff -1.543MiB [start: 2294.168MiB, end: 2292.625MiB]
load: repartition/part-751.parquet
 took 0.00111 s, mem diff 1.500MiB [start: 2292.625MiB, end: 2294.125MiB]
load: repartition/part-321.parquet
 took 0.00113 s, mem diff 1.500MiB [start: 2294.125MiB, end: 2295.625MiB]

Memory load stays mostly constant, but as soon as the metadata.parquet file is read, a huge, over 2GiB large memory leak appears. Reading that particular file multiple times does not lead to multiple memory leaks.

There seems to be no way to reduce memory load back to normal levels again, not even a pool = pyarrow.default_memory_pool(); pool.release_unused() or gc.collect() does help.

Component(s)

Parquet, Python

The text was updated successfully, but these errors were encountered:

pitrou · 2024-11-12T08:42:38Z

Thanks for the report @jonded94 . Two questions:

which platform are you running this on?
did you try changing the default memory pool? See https://arrow.apache.org/docs/cpp/env_vars.html#envvar-ARROW_DEFAULT_MEMORY_POOL

jonded94 · 2024-11-14T16:45:46Z

which platform are you running this on?

$ uname -a
 6.6.30-flatcar #1 SMP PREEMPT_DYNAMIC Sun May 19 16:12:26 -00 2024 x86_64 GNU/Linux
$ python -c "import pyarrow; print(pyarrow.__version__)"
17.0.0

did you try changing the default memory pool?

"system" memory pool:

$ ARROW_DEFAULT_MEMORY_POOL=system python scripts/read_metadata.py repartition/* 3
load: repartition/part-332.parquet
 took 0.00124 s, mem diff 1.500MiB [start: 194.336MiB, end: 195.836MiB]0, 0
load: repartition/_metadata
 took 2.41739 s, mem diff 2077.484MiB [start: 195.836MiB, end: 2273.320MiB]0, 0
load: repartition/_metadata
 took 2.25331 s, mem diff 400.500MiB [start: 2273.320MiB, end: 2673.820MiB]0, 0
load: repartition/part-759.parquet
 took 0.00094 s, mem diff 1.500MiB [start: 2673.820MiB, end: 2675.320MiB]0, 0

=> also ~2.3GiB leak at first read, but a second read even increases it to ~2.7GiB.

"jemalloc" memory pool:

$ ARROW_DEFAULT_MEMORY_POOL=jemalloc python scripts/read_metadata.py repartition/* 3
load: repartition/part-58.parquet
 took 0.00122 s, mem diff 1.500MiB [start: 198.316MiB, end: 199.816MiB]0, 0
load: repartition/_metadata
 took 2.40700 s, mem diff 2081.992MiB [start: 199.816MiB, end: 2281.809MiB]0, 0
load: repartition/part-40.parquet
 took 0.00101 s, mem diff 1.500MiB [start: 2281.809MiB, end: 2283.309MiB]0, 0
load: repartition/_metadata
 took 2.19111 s, mem diff 17.043MiB [start: 2283.309MiB, end: 2300.352MiB]0, 0
gc:   repartition/part-912.parquet
 took 0.00223 s, mem diff -2.000MiB [start: 2300.352MiB, end: 2298.352MiB]0, 0
load: repartition/_metadata
 took 2.16370 s, mem diff 0.629MiB [start: 2298.352MiB, end: 2298.980MiB]0, 0

"mimalloc" memory pool:

$ ARROW_DEFAULT_MEMORY_POOL=mimalloc python scripts/read_metadata.py repartition/* 3
load: repartition/part-887.parquet
 took 0.00150 s, mem diff 5.285MiB [start: 189.285MiB, end: 194.570MiB]0, 0
load: repartition/_metadata
 took 2.40356 s, mem diff 2078.855MiB [start: 194.570MiB, end: 2273.426MiB]0, 0
load: repartition/_metadata
 took 2.25088 s, mem diff 15.887MiB [start: 2273.426MiB, end: 2289.312MiB]0, 0
load: repartition/_metadata
 took 2.23642 s, mem diff -0.391MiB [start: 2289.312MiB, end: 2288.922MiB]0, 0

Then I tuned a bit the statment where I previously just called gc.collect() and added a memory pool cleanup:

        with profiling(f"gc:   {path}"):
            pyarrow.default_memory_pool().release_unused()
            gc.collect()

Following datapoints resulted with this tuned version of the script.

"system" memory pool:

$ ARROW_DEFAULT_MEMORY_POOL=system python scripts/read_metadata.py repartition/* 3
load: repartition/part-419.parquet
 took 0.00142 s, mem diff 3.000MiB [start: 192.777MiB, end: 195.777MiB]0, 0
gc:   repartition/part-419.parquet
 took 0.00343 s, mem diff -1.695MiB [start: 195.777MiB, end: 194.082MiB]0, 0
load: repartition/_metadata
 took 2.40818 s, mem diff 2080.812MiB [start: 194.082MiB, end: 2274.895MiB]0, 0
gc:   repartition/_metadata
 took 0.17570 s, mem diff -2079.277MiB [start: 2274.895MiB, end: 195.617MiB]0, 0
load: repartition/_metadata
 took 3.15845 s, mem diff 2476.500MiB [start: 195.617MiB, end: 2672.117MiB]0, 0
gc:   repartition/_metadata
 took 0.20289 s, mem diff -2481.230MiB [start: 2672.117MiB, end: 190.887MiB]0, 0
load: repartition/_metadata
 took 3.10904 s, mem diff 2479.500MiB [start: 190.887MiB, end: 2670.387MiB]0, 0
gc:   repartition/_metadata
 took 0.18726 s, mem diff -2479.953MiB [start: 2670.387MiB, end: 190.434MiB]0, 0
load: repartition/part-1549.parquet
 took 0.00179 s, mem diff 1.500MiB [start: 190.434MiB, end: 191.934MiB]0, 0

=> Here, the memory actually seems to be released to the system again!

"jemalloc" memory pool:

$ ARROW_DEFAULT_MEMORY_POOL=jemalloc python scripts/read_metadata.py repartition/* 3
load: repartition/part-426.parquet
 took 0.00138 s, mem diff 3.000MiB [start: 193.805MiB, end: 196.805MiB]0, 0
gc:   repartition/part-426.parquet
 took 0.00635 s, mem diff 26.812MiB [start: 196.805MiB, end: 223.617MiB]0, 0
load: repartition/_metadata
 took 2.48826 s, mem diff 2082.836MiB [start: 223.617MiB, end: 2306.453MiB]0, 0
load: repartition/_metadata
 took 2.23224 s, mem diff 16.172MiB [start: 2306.453MiB, end: 2322.625MiB]0, 0
gc:   repartition/part-267.parquet
 took 0.00250 s, mem diff -1.762MiB [start: 2322.625MiB, end: 2320.863MiB]0, 0
gc:   repartition/part-839.parquet
 took 0.00231 s, mem diff -1.988MiB [start: 2320.863MiB, end: 2318.875MiB]0, 0
load: repartition/_metadata
 took 2.18774 s, mem diff 0.621MiB [start: 2318.875MiB, end: 2319.496MiB]0, 0

=> jemalloc memory pool doesn't care for the explicit cleanup call apparently.

"mimalloc" memory pool (this was increasing & releasing memory on every iteration and spamming stdout, I had to limit output to cases where memory changed by more than 5MiB):

$ ARROW_DEFAULT_MEMORY_POOL=mimalloc python scripts/read_metadata.py repartition/* 3
load: repartition/_metadata
 took 2.37767 s, mem diff 2079.824MiB [start: 193.172MiB, end: 2272.996MiB]0, 0
load: repartition/_metadata
 took 2.14551 s, mem diff 17.660MiB [start: 2271.785MiB, end: 2289.445MiB]0, 0

=> Memory leak also still seems to appear.

pitrou · 2024-11-14T17:40:21Z

Ok, so what this tell us is that there is no actual memory leak in Arrow (at least on this use case :-)), but the memory allocator may decide to hold onto some memory to make further allocations faster.

Note that the memory is not necessarily unavailable to other applications. For example, the memory allocator might have set these memory areas as MADV_FREE, which lets the kernel free the pages when it needs to, but not necessarily immediately.

This all shows how complicated it is to get accurate memory footprint measurements on modern OSes...

jonded94 added the Type: bug label Oct 31, 2024

github-actions bot added Component: Parquet Component: Python labels Oct 31, 2024

pitrou changed the title ~~>2GiB Memory Leak on reading single parquet metadata file~~ [Parquet] >2GiB Memory Leak on reading single parquet metadata file Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Parquet] >2GiB Memory Leak on reading single parquet metadata file #44599

[Parquet] >2GiB Memory Leak on reading single parquet metadata file #44599

jonded94 commented Oct 31, 2024 •

edited

Loading

pitrou commented Nov 12, 2024

jonded94 commented Nov 14, 2024

pitrou commented Nov 14, 2024

[Parquet] >2GiB Memory Leak on reading single parquet metadata file #44599

[Parquet] >2GiB Memory Leak on reading single parquet metadata file #44599

Comments

jonded94 commented Oct 31, 2024 • edited Loading

Component(s)

pitrou commented Nov 12, 2024

jonded94 commented Nov 14, 2024

pitrou commented Nov 14, 2024

jonded94 commented Oct 31, 2024 •

edited

Loading