bug: out of memory during OSV database load #4710

anthonyharrison · 2025-01-22T15:02:00Z

Description

Attempting to create an initial database results in the cve-bin-tool process being killed with out of memory message

To reproduce

cve-bin-tool -u now -n json-mirror afile

Expected behaviour:

Database is created

Actual behaviour:

Process is killed part way through the database load and the database file is not created

Version/platform info

Version of CVE-bin-tool( e.g. output of cve-bin-tool --version): 3.4
Installed from pypi or github? pypi
Operating system: Linux/Windows (other platforms are unsupported but feel free to report issues anyhow)
WSL2 on Windows 11
Python version (e.g. python3 --version): 3.10.12
Running in any particular CI environment we should know about? (e.g. Github Actions) Running in WSL2 (10GB RAM)

The text was updated successfully, but these errors were encountered:

anthonyharrison · 2025-01-22T15:16:24Z

Disabling data sources OSV, GAD and RSD allowed the database to be created.

terriko · 2025-01-22T23:14:53Z

I think this is a duplicate of #4592 . But I'm still not sure what the right fix is.

terriko · 2025-01-23T18:04:47Z

Actually, I'm going to re-open this because I think it's the more concisely described of the several issues that are related to this problem.

Some stuff I know so far:

cve-bin-tool is using a lot of memory. See fix: [bug description] Not able to generate any vuln report in STDOUT nor SBOMs for mounted qcow2 #4662
our cache jobs are likely failing because of this, but not every day just sometimes
Things seem to be better on python 3.13, likely due to other memory improvements made in python
Python is notoriously not the best at memory usage
This problem is likely to get worse because every year we get more CVEs

Some conjecture:

~~I strongly suspect that the failures are related to doing a json.load() on the NVD 2024 data, which is much larger than any previous year and still growing a little bit right now.~~ Edit: we now know it's OSV not NVD causing our biggest problem.

Next steps:

I'm going to switch the cache job over to python 3.13 and see if that helps make the problem occur less frequently. It's not a long-term solution but it's a few minutes of work and worth a shot. If that works I'll probably move longtests too.
Medium term solution may involve chopping the 2024 data up into monthly chunks on the mirror, then adjusting cve-bin-tool to load those from the mirror.
We may also want to look into pre-processing options in cve-bin-tool itself (likely using Rust for performance reasons).
We should also look into whether we can make memory improvements in cve-bin-tool. Not sure what form that would take, but I won't be shocked if there are places we could be more memory efficient during our processing.

I'm open to more suggestions, most of that was from a quick brainstorming session this morning.

Snehallaldas · 2025-01-23T20:15:06Z

Update the database in stages for each data source instead of all at once..
This spreads memory usage across separate runs.

anthonyharrison · 2025-01-23T20:45:30Z

@terriko I think the issue might be with the OSV database load as removing this data source solved the problem. Committing every 1000 records or so rather than one big commit at the end may also be a useful improvement.

Given the number of records and continued growtyh, I think we may be getting to the stage of relooking at the database architecture. Might be a bit too ambitious for GSOC 2025 but maybe some useful work could be done to move things along

terriko · 2025-02-21T18:27:20Z

I'm flagging this for the hackathon folk:

The problem as far as we know is happening during the OSV database load. We need some way to reduce the memory usage there. Ideas above but knowing the talent coming in for the hackathon I suspect some of you may know better than I do about how to fix this.

OSV data source code can be found here: https://github.com/intel/cve-bin-tool/blob/main/cve_bin_tool/data_sources/osv_source.py

Note that unlike many of our other data sources, OSV uses a google-based backend so we're using gsutils. This may be a factor in why it's worse than other data sources, or it may just be the sheer amount of data as more people move away from NVD.

I'm still game to have a pre-parser that allows us to mirror the OSV data on cve-b.in then use our own mirrors as we do with NVD if that seems like it's the best solution. But I won't be shocked if our code just needs some tweaks to handle memory more appropriately.

terriko · 2025-02-27T20:20:03Z

A note about the hackathon label: I've flagged a bunch of issues for folk participating in the Open Source Ecosyststems Hackathon March 3-7. Please leave these issues to hackathon participants. if they're not claimed after, say, March 10th, they're fair game to other people (including GSoC participants).

fil1n · 2025-03-11T07:15:07Z

It seems that nobody has claimed this issue yet. I would like to work on it.

anthonyharrison added the bug Something isn't working label Jan 22, 2025

terriko closed this as completed Jan 22, 2025

terriko reopened this Jan 23, 2025

This was referenced Feb 6, 2025

ci: update long and network tests to use python 3.13 #4775

Closed

fix: Command Getting Killed While Updating CVE Database or Scanning with cve-bin-tool #4592

Open

terriko changed the title ~~fix: [bug description] Unable to create database - out of memory~~ bug: out of memory during OSV database load Feb 21, 2025

terriko added the hackathon Issues for folk participating in the Open Ecosystems hackathon label Feb 21, 2025

terriko mentioned this issue Mar 7, 2025

test: test_update is hanging #4680

Open

terriko assigned fil1n Mar 11, 2025

fil1n linked a pull request Mar 19, 2025 that will close this issue

fix: osv data source memory consumption #4956

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: out of memory during OSV database load #4710

bug: out of memory during OSV database load #4710

anthonyharrison commented Jan 22, 2025

anthonyharrison commented Jan 22, 2025

terriko commented Jan 22, 2025

terriko commented Jan 23, 2025 •

edited

Loading

Snehallaldas commented Jan 23, 2025 •

edited

Loading

anthonyharrison commented Jan 23, 2025

terriko commented Feb 21, 2025

terriko commented Feb 27, 2025

fil1n commented Mar 11, 2025

bug: out of memory during OSV database load #4710

bug: out of memory during OSV database load #4710

Comments

anthonyharrison commented Jan 22, 2025

Description

To reproduce

Version/platform info

anthonyharrison commented Jan 22, 2025

terriko commented Jan 22, 2025

terriko commented Jan 23, 2025 • edited Loading

Snehallaldas commented Jan 23, 2025 • edited Loading

anthonyharrison commented Jan 23, 2025

terriko commented Feb 21, 2025

terriko commented Feb 27, 2025

fil1n commented Mar 11, 2025

terriko commented Jan 23, 2025 •

edited

Loading

Snehallaldas commented Jan 23, 2025 •

edited

Loading