Create new aboutcode.federated library #747 #2006

pombredanne · 2025-10-20T22:42:48Z

This is an extensive rework of the utilities to compute federated paths using PURLs.

See: add support for calculating CVSS score from the CVSS vector #747

The design (copied from the script) comes out this way:

Federated data utilities to handle content-defined and hash-addressable Package
data keyed by PURL stored in many Git repositories. This approach to federate
decentralized data is called FederatedCode.

Overview

The main design elements are:

Data Federation: A Data Federation is a database, representing a consistent,
non-overlapping set of data kind clusters (like scans, vulnerabilities or SBOMs)
across many package ecosystems, aka. PURL types.
A Federation is similar to a traditional database.
Data Cluster: A Data Federation contains Data Clusters, where a Data Cluster
purpose is to store the data of a single kind (like scans) across multiple PURL
types. The cluster name is the data kind name and is used as the prefix for
repository names. A Data Cluster is akin to a table in a traditional database.
Data Repository: A DataCluster contains of one or more Git Data Repository,
each storing datafiles of the cluster data kind and a one PURL type, spreading
the datafiles in multiple Data Directories. The name is data-kind +PURL-
type+hashid. A Repository is similar to a shard or tablespace in a traditionale
database.
Data Directory: In a Repository, a Data Directory contains the datafiles for
PURLs. The directory name PURL-type+hashid
Data File: This is a Data File of the DataCluster's Data Kind that is
stored in subdirectories structured after the PURL components:
namespace/name/version/qualifiers/subpath:

Either at the level of a PURL name: namespace/name,
Or at the PURL version level namespace/name/version,
Or at the PURL qualifiers+PURL subpath level.

A Data File can be for instance a JSON scan results file, or a list of PURLs in
YAML.

For example, a list of PURLs as a Data Kind would sored at the name
subdirectory level::

gem-0107/gem/random_password_generator/purls.yml

Or a ScanCode scan as a Data Kind at the version subdirectory level::

gem-0107/npm/file/3.24.3/scancode.yml

Design

The core approach is to distribute the many datafiles for a package in multiple
directories stored in multiple Git repositories, so that each directory and repo
is not too big, with not too many files, and files are spread roughly evenly
across all the directories and repositories.

At the same time the design is such that it is possible to directly access a
single datafile across all these directories and Git repositories knowing only
its package PURL and resolve that to a URL to fetch a single datafile directly
by using the Git web interface (like on GitHub, Gitlab or gitweb)

Why not using a single Git repo?

We need multiple Git repositories to avoid very big repositories that are
impractical to use. We want each repo to be under the common limits of public
repository hosting services, like GitHub and its 5GB limit. Typicaly a maximum
size of 5GB and a target size of about 1GB of compressed content makes the most
sense. We store text and Git combination of XDiff, XDelta a zlib compression
typically can reduce the stored size by about 5, meaning that a 1GB repo may
contain about 5GB actual uncompressed text.

Why not using a single dir in a repo?

Multiple directories are needed to store many package datafiles to avoid
directories with too many files in the same directory, which makes every
filesystem performance suffer. Typically a max of about 10,000 files in a
directory is a decent target.

Hash-based content distribution

To distribute files roughly evenly across repositories and directories and still
using PURL as a key, we use a hashid derived from a hash computed on the PURL
string and use that to generate repositories and directory names.

It then becomes possible to distribute the data across many Git repositories and
directories evenly and compute a URL and path to access a datafile directly
from a PURL.

Object hierarchy

federation: defined by its name and a Git repo with a config file with
clusters configuration for data kind and PURL type parameters, enabling pointing
to multiple repositories.
cluster: identified by the data kind name, prefixing its data repos
repo: data repo (Git) identified by datakind+PURL-type+hashid
directory: dir in a repo, identified by PURL-type+PURL-hashid
PURL path: ns/name/version/extra_path derived from the PURL
datafile: file storing the data as text JSON/YAML/XML

Example

For instance, in the aboutcode data federation, for a cluster about purl
versions, we would have:

data federation definition git repo, with its config file.
aboutcode-data/aboutcode-data
aboutcode-federation-config.yml
data cluster repos name prefix is the data kind
aboutcode-data/purls
data repository git repo, with a purl sub dir tree and datafile.
The first repo name has a hash of 0000 which is the first PURL hashid of the
range of PURL hashid stored in this repo's dirs.
aboutcode-data/purls-gem-0000/
data directory, with a purl sub dir tree and datafile. The dir name
composed of type+hashid.
aboutcode-data/purls-gem-0000/gem-0107/
PURL subdirectory, and datafile, here list of PURLs for the gem named rails:
aboutcode-data/purls-gem-0000/gem-0107/rails/purls.yml

In this example, if the base URL for this cluster is at the aboutcode-data
GitHub organization, so the URL to the purls.yml datafile is inferred this way
based on the cluster config:

https://github.com/
aboutcode-data/purls-gem-0000/
raw/refs/heads/main/
gem-0107/rails/purls.yml

More Design details

The DataCluster and Data kind design aligns with the needs of users: for
example, a user using only vulnerability data for Java and JavaScript may not
care directly for Haskell metadata. Or may care only for another kind of data
like fingerprints.

DataCluster: A set of repos for only one data kind for many package types.
Data Kind: Identifier for the kind of data stored in the datafile of
DataCluster, like PURL versions, or the original API metadata files, or high
level scans, or scans with file details, reachability slices, fingerprints, or
vulnerability advisories and so on.
Repository: A repo is a Git repo that stores a group of Directories of a
DataCluster/data kind, like for all the npms with a PURL hash of 0000 to 1023,
where we store npm metadata files for each PURL. All repo names in a cluster
share the same data-kind prefix.
Directory: Named after a PURL type and PURL hashid, it stores the datafiles
for the PURLs that hash to that hashid.

Naming conventions

Federation: like aboutcode-data. Also the name of the config repo.
DataCluster name prefix: data kind stored in that cluster, like "purls" or "scancode"
For data repos: data kind + PURL type + PURL hashid like
purls-npm-0512 or purls-scancode-scans-0000
The PURL hashid is the first hashid of a range of hashid stored in that repo.
For data dirs in a repo: PURL type + dir_number like npm-0513 or pypi-0000.
The hashid is that of the PURLs whose data files are stored in that directory.

PURL Hashid

The PURL hashid is central to the design and is simply a number between 0 and
1023 (e.g., 1024 values which is a power of two).

It could be updated to up 8192 in the future, but 1024 is good enough to spread
files in multiple dirs.

The Core PURL is a PURL without version, subpath and qualifiers. We hash this
Core PURL as UTF-8-encoded bytes using SHA256.

The first few bytes of the SHA256 binary digest are converted to an integer
using little endian encoding, then converted modulo a max value of 1024 to yield
an integer converted to a 4-chars, zero-padded string between 0000 and 1023.

Based on this hashid and the data kind and PURL type, directories are grouped in
one or more Git reposities of a cluster, based on a cluster-defined number of
directories of a type per Git repo.

Example of repo and dir names

With 4 dirs per repo, we get 256 repos, like tehse

purls-npm-0000
npm-0000
npm-0001
npm-0002
npm-0003

purls-npm-0004
npm-0004
npm-0005
npm-0006
npm-0007

purls-npm-0008
npm-0008
... and so on

And with 512 dirs per repo, we get 2 repos:

purls-npm-0000
npm-0000
npm-0001
npm-0002
...
npm-0511

purls-npm-0512
npm-0512
npm-0513
...
npm-1023

Git repos sizing assumptions for each ecosystems

For small ecosystems with few packages, like luarocks or swift, a single Git
repo or a few repos may be enough to store all the data of a kind. There, a
luarocks cluster of repos will have a single Git repo, with 1024 root
directories.

At the other end of the spectrum, a package type with many packages like npm may
need 1024 Git repositories to store all the metadata. In this case a npm cluster
of repos will have 1024 Git repos, each with a single root directory.

We can start with reasonable assumptions wrt. the size of each cluster, as a
number of directory per Git repo and the volume of data we would store in each
using these starting values:

For super large ecosystems (with ~5M packages):

one dir per repo, yielding 1,024 repos
github, npm

For large ecosystems (with ~500K packages)

eight dirs per repo, yielding 128 repos
golang, maven, nuget, perl, php, pypi, ruby, huggingface

For medium ecosystems (with ~50K packages)

32 dirs per repo, yielding 32 Git repositories
alpm, bitbucket, cocoapods, composer, deb, docker, gem, generic,
mlflow, pub, rpm, cargo

For small ecosystem (with ~2K packages)

1,024 directories in one git repository
all others

For instance, say we want a cluster to store all the npm PURLs. As of 2025-10,
npm hosts about 4M unique package names (and roughly 20 versions per name on
average with ~80M updates in total in https://replicate.npmjs.com/). Storing 4M
names takes about 100MB uncompressed. Adding versions would take about 2GB
uncompressed. This means that we can store comfortably all npm PURLs in a single
repository size-wise, but we may want to use more repositories anyway as storing
4M directories and purls.yml files in a single repo will not be a happy event,
so using 32 repos with 32 dirs or 64 repos with 16 dirs may be a better
approach.

Rebalancing and splitting a DataCluster repos

We can rebalance a cluster, like when we first store the data in a cluster with
a single Git repository for a given PURL type, and later split this repo to more
repos, without loosing the ability to address datafiles directly just knowing a
PURL and without having to rename all the files and directories.

In this design, the directory names are stable and do not change as long as we
keep the default 1024 hash values for the PURL hashid. The only thing that
changes are the repo names when more repos are created from a split, when the
size of a Git repo grows too large.

When a split to occur, we should perform these operations:

lock the cluster as "read-only" for the duration of a split operation. This is
to signal to processes and tool that are updating the cluster that they cannot
push new data to there yet. This could be done by updating the cluster config
or the federation config.
copy existing Git repos to be split to new repos based on the new number of
directories per repo.
filter Git history in existing and new repos to keep only the history related
to the directories stored in a given repo.
update the cluster config file in cluster Git repo with the new number of
directories
push new Git and existing Git repos
unlock the cluster.

We may need to keep the old and new Clusters around too, and may need to add a
simple DataCluster version suffix in Cluter names, and a way to redirect from an
old frozen, inactive DataCluster to a new rebalanced one.

It may even be possible to continue writing to a cluster as long as writing is
done in two places until the split is completed. In practice split should be
reasonably rare and reasonably fast, making this a lesser issue.

It is also possible to change the PURL hashid range for a DataCluster, say going
from 1024 to 2049, 4096 or 8192. This would imply moving all the files around
are the directory structure would change from the new hashids. This is likely
to be an exceptional operation.

This is an extensive rework of the utilities to compute federated paths using PURLs.

pombredanne · 2025-10-20T22:43:16Z

Note that I would prefer to pull this out as a new library and this is still a WIP

Initial aboutcode.federated commit #747

e1a1f30

This is an extensive rework of the utilities to compute federated paths using PURLs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Create new aboutcode.federated library #747 #2006

Create new aboutcode.federated library #747 #2006

Uh oh!

pombredanne commented Oct 20, 2025

Uh oh!

pombredanne commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Create new aboutcode.federated library #747 #2006

Are you sure you want to change the base?

Create new aboutcode.federated library #747 #2006

Uh oh!

Conversation

pombredanne commented Oct 20, 2025

Overview

Design

Why not using a single Git repo?

Why not using a single dir in a repo?

Hash-based content distribution

Object hierarchy

Example

More Design details

Naming conventions

PURL Hashid

Example of repo and dir names

Git repos sizing assumptions for each ecosystems

Rebalancing and splitting a DataCluster repos

Uh oh!

pombredanne commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant