Create new aboutcode.federated library #747 #2006
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This is an extensive rework of the utilities to compute federated paths using PURLs.
The design (copied from the script) comes out this way:
Federated data utilities to handle content-defined and hash-addressable Package
data keyed by PURL stored in many Git repositories. This approach to federate
decentralized data is called FederatedCode.
Overview
The main design elements are:
Data Federation: A Data Federation is a database, representing a consistent,
non-overlapping set of data kind clusters (like scans, vulnerabilities or SBOMs)
across many package ecosystems, aka. PURL types.
A Federation is similar to a traditional database.
Data Cluster: A Data Federation contains Data Clusters, where a Data Cluster
purpose is to store the data of a single kind (like scans) across multiple PURL
types. The cluster name is the data kind name and is used as the prefix for
repository names. A Data Cluster is akin to a table in a traditional database.
Data Repository: A DataCluster contains of one or more Git Data Repository,
each storing datafiles of the cluster data kind and a one PURL type, spreading
the datafiles in multiple Data Directories. The name is data-kind +PURL-
type+hashid. A Repository is similar to a shard or tablespace in a traditionale
database.
Data Directory: In a Repository, a Data Directory contains the datafiles for
PURLs. The directory name PURL-type+hashid
Data File: This is a Data File of the DataCluster's Data Kind that is
stored in subdirectories structured after the PURL components:
namespace/name/version/qualifiers/subpath:
A Data File can be for instance a JSON scan results file, or a list of PURLs in
YAML.
For example, a list of PURLs as a Data Kind would sored at the name
subdirectory level::
Or a ScanCode scan as a Data Kind at the version subdirectory level::
Design
The core approach is to distribute the many datafiles for a package in multiple
directories stored in multiple Git repositories, so that each directory and repo
is not too big, with not too many files, and files are spread roughly evenly
across all the directories and repositories.
At the same time the design is such that it is possible to directly access a
single datafile across all these directories and Git repositories knowing only
its package PURL and resolve that to a URL to fetch a single datafile directly
by using the Git web interface (like on GitHub, Gitlab or gitweb)
Why not using a single Git repo?
We need multiple Git repositories to avoid very big repositories that are
impractical to use. We want each repo to be under the common limits of public
repository hosting services, like GitHub and its 5GB limit. Typicaly a maximum
size of 5GB and a target size of about 1GB of compressed content makes the most
sense. We store text and Git combination of XDiff, XDelta a zlib compression
typically can reduce the stored size by about 5, meaning that a 1GB repo may
contain about 5GB actual uncompressed text.
Why not using a single dir in a repo?
Multiple directories are needed to store many package datafiles to avoid
directories with too many files in the same directory, which makes every
filesystem performance suffer. Typically a max of about 10,000 files in a
directory is a decent target.
Hash-based content distribution
To distribute files roughly evenly across repositories and directories and still
using PURL as a key, we use a hashid derived from a hash computed on the PURL
string and use that to generate repositories and directory names.
It then becomes possible to distribute the data across many Git repositories and
directories evenly and compute a URL and path to access a datafile directly
from a PURL.
Object hierarchy
federation: defined by its name and a Git repo with a config file with
clusters configuration for data kind and PURL type parameters, enabling pointing
to multiple repositories.
cluster: identified by the data kind name, prefixing its data repos
repo: data repo (Git) identified by datakind+PURL-type+hashid
directory: dir in a repo, identified by PURL-type+PURL-hashid
PURL path: ns/name/version/extra_path derived from the PURL
datafile: file storing the data as text JSON/YAML/XML
Example
For instance, in the aboutcode data federation, for a cluster about purl
versions, we would have:
data federation definition git repo, with its config file.
aboutcode-data/aboutcode-data
aboutcode-federation-config.yml
data cluster repos name prefix is the data kind
aboutcode-data/purls
data repository git repo, with a purl sub dir tree and datafile.
The first repo name has a hash of 0000 which is the first PURL hashid of the
range of PURL hashid stored in this repo's dirs.
aboutcode-data/purls-gem-0000/
data directory, with a purl sub dir tree and datafile. The dir name
composed of type+hashid.
aboutcode-data/purls-gem-0000/gem-0107/
PURL subdirectory, and datafile, here list of PURLs for the gem named rails:
aboutcode-data/purls-gem-0000/gem-0107/rails/purls.yml
In this example, if the base URL for this cluster is at the aboutcode-data
GitHub organization, so the URL to the purls.yml datafile is inferred this way
based on the cluster config:
https://github.com/
aboutcode-data/purls-gem-0000/
raw/refs/heads/main/
gem-0107/rails/purls.yml
More Design details
The DataCluster and Data kind design aligns with the needs of users: for
example, a user using only vulnerability data for Java and JavaScript may not
care directly for Haskell metadata. Or may care only for another kind of data
like fingerprints.
DataCluster: A set of repos for only one data kind for many package types.
Data Kind: Identifier for the kind of data stored in the datafile of
DataCluster, like PURL versions, or the original API metadata files, or high
level scans, or scans with file details, reachability slices, fingerprints, or
vulnerability advisories and so on.
Repository: A repo is a Git repo that stores a group of Directories of a
DataCluster/data kind, like for all the npms with a PURL hash of 0000 to 1023,
where we store npm metadata files for each PURL. All repo names in a cluster
share the same data-kind prefix.
Directory: Named after a PURL type and PURL hashid, it stores the datafiles
for the PURLs that hash to that hashid.
Naming conventions
Federation: like aboutcode-data. Also the name of the config repo.
DataCluster name prefix: data kind stored in that cluster, like "purls" or "scancode"
For data repos: data kind + PURL type + PURL hashid like
purls-npm-0512 or purls-scancode-scans-0000
The PURL hashid is the first hashid of a range of hashid stored in that repo.
For data dirs in a repo: PURL type + dir_number like npm-0513 or pypi-0000.
The hashid is that of the PURLs whose data files are stored in that directory.
PURL Hashid
The PURL hashid is central to the design and is simply a number between 0 and
1023 (e.g., 1024 values which is a power of two).
It could be updated to up 8192 in the future, but 1024 is good enough to spread
files in multiple dirs.
The Core PURL is a PURL without version, subpath and qualifiers. We hash this
Core PURL as UTF-8-encoded bytes using SHA256.
The first few bytes of the SHA256 binary digest are converted to an integer
using little endian encoding, then converted modulo a max value of 1024 to yield
an integer converted to a 4-chars, zero-padded string between 0000 and 1023.
Based on this hashid and the data kind and PURL type, directories are grouped in
one or more Git reposities of a cluster, based on a cluster-defined number of
directories of a type per Git repo.
Example of repo and dir names
With 4 dirs per repo, we get 256 repos, like tehse
purls-npm-0000
npm-0000
npm-0001
npm-0002
npm-0003
purls-npm-0004
npm-0004
npm-0005
npm-0006
npm-0007
purls-npm-0008
npm-0008
... and so on
And with 512 dirs per repo, we get 2 repos:
purls-npm-0000
npm-0000
npm-0001
npm-0002
...
npm-0511
purls-npm-0512
npm-0512
npm-0513
...
npm-1023
Git repos sizing assumptions for each ecosystems
For small ecosystems with few packages, like luarocks or swift, a single Git
repo or a few repos may be enough to store all the data of a kind. There, a
luarocks cluster of repos will have a single Git repo, with 1024 root
directories.
At the other end of the spectrum, a package type with many packages like npm may
need 1024 Git repositories to store all the metadata. In this case a npm cluster
of repos will have 1024 Git repos, each with a single root directory.
We can start with reasonable assumptions wrt. the size of each cluster, as a
number of directory per Git repo and the volume of data we would store in each
using these starting values:
mlflow, pub, rpm, cargo
For instance, say we want a cluster to store all the npm PURLs. As of 2025-10,
npm hosts about 4M unique package names (and roughly 20 versions per name on
average with ~80M updates in total in https://replicate.npmjs.com/). Storing 4M
names takes about 100MB uncompressed. Adding versions would take about 2GB
uncompressed. This means that we can store comfortably all npm PURLs in a single
repository size-wise, but we may want to use more repositories anyway as storing
4M directories and purls.yml files in a single repo will not be a happy event,
so using 32 repos with 32 dirs or 64 repos with 16 dirs may be a better
approach.
See also original post on the approach:
Rebalancing and splitting a DataCluster repos
We can rebalance a cluster, like when we first store the data in a cluster with
a single Git repository for a given PURL type, and later split this repo to more
repos, without loosing the ability to address datafiles directly just knowing a
PURL and without having to rename all the files and directories.
In this design, the directory names are stable and do not change as long as we
keep the default 1024 hash values for the PURL hashid. The only thing that
changes are the repo names when more repos are created from a split, when the
size of a Git repo grows too large.
When a split to occur, we should perform these operations:
lock the cluster as "read-only" for the duration of a split operation. This is
to signal to processes and tool that are updating the cluster that they cannot
push new data to there yet. This could be done by updating the cluster config
or the federation config.
copy existing Git repos to be split to new repos based on the new number of
directories per repo.
filter Git history in existing and new repos to keep only the history related
to the directories stored in a given repo.
update the cluster config file in cluster Git repo with the new number of
directories
push new Git and existing Git repos
unlock the cluster.
We may need to keep the old and new Clusters around too, and may need to add a
simple DataCluster version suffix in Cluter names, and a way to redirect from an
old frozen, inactive DataCluster to a new rebalanced one.
It may even be possible to continue writing to a cluster as long as writing is
done in two places until the split is completed. In practice split should be
reasonably rare and reasonably fast, making this a lesser issue.
It is also possible to change the PURL hashid range for a DataCluster, say going
from 1024 to 2049, 4096 or 8192. This would imply moving all the files around
are the directory structure would change from the new hashids. This is likely
to be an exceptional operation.