Skip to content

Conversation

pombredanne
Copy link
Member

This is an extensive rework of the utilities to compute federated paths using PURLs.

The design (copied from the script) comes out this way:

Federated data utilities to handle content-defined and hash-addressable Package
data keyed by PURL stored in many Git repositories. This approach to federate
decentralized data is called FederatedCode.

Overview

The main design elements are:

  1. Data Federation: A Data Federation is a database, representing a consistent,
    non-overlapping set of data kind clusters (like scans, vulnerabilities or SBOMs)
    across many package ecosystems, aka. PURL types.
    A Federation is similar to a traditional database.

  2. Data Cluster: A Data Federation contains Data Clusters, where a Data Cluster
    purpose is to store the data of a single kind (like scans) across multiple PURL
    types. The cluster name is the data kind name and is used as the prefix for
    repository names. A Data Cluster is akin to a table in a traditional database.

  3. Data Repository: A DataCluster contains of one or more Git Data Repository,
    each storing datafiles of the cluster data kind and a one PURL type, spreading
    the datafiles in multiple Data Directories. The name is data-kind +PURL-
    type+hashid. A Repository is similar to a shard or tablespace in a traditionale
    database.

  4. Data Directory: In a Repository, a Data Directory contains the datafiles for
    PURLs. The directory name PURL-type+hashid

  5. Data File: This is a Data File of the DataCluster's Data Kind that is
    stored in subdirectories structured after the PURL components:
    namespace/name/version/qualifiers/subpath:

  • Either at the level of a PURL name: namespace/name,
  • Or at the PURL version level namespace/name/version,
  • Or at the PURL qualifiers+PURL subpath level.

A Data File can be for instance a JSON scan results file, or a list of PURLs in
YAML.

For example, a list of PURLs as a Data Kind would sored at the name
subdirectory level::

gem-0107/gem/random_password_generator/purls.yml

Or a ScanCode scan as a Data Kind at the version subdirectory level::

gem-0107/npm/file/3.24.3/scancode.yml

Design

The core approach is to distribute the many datafiles for a package in multiple
directories stored in multiple Git repositories, so that each directory and repo
is not too big, with not too many files, and files are spread roughly evenly
across all the directories and repositories.

At the same time the design is such that it is possible to directly access a
single datafile across all these directories and Git repositories knowing only
its package PURL and resolve that to a URL to fetch a single datafile directly
by using the Git web interface (like on GitHub, Gitlab or gitweb)

Why not using a single Git repo?

We need multiple Git repositories to avoid very big repositories that are
impractical to use. We want each repo to be under the common limits of public
repository hosting services, like GitHub and its 5GB limit. Typicaly a maximum
size of 5GB and a target size of about 1GB of compressed content makes the most
sense. We store text and Git combination of XDiff, XDelta a zlib compression
typically can reduce the stored size by about 5, meaning that a 1GB repo may
contain about 5GB actual uncompressed text.

Why not using a single dir in a repo?

Multiple directories are needed to store many package datafiles to avoid
directories with too many files in the same directory, which makes every
filesystem performance suffer. Typically a max of about 10,000 files in a
directory is a decent target.

Hash-based content distribution

To distribute files roughly evenly across repositories and directories and still
using PURL as a key, we use a hashid derived from a hash computed on the PURL
string and use that to generate repositories and directory names.

It then becomes possible to distribute the data across many Git repositories and
directories evenly and compute a URL and path to access a datafile directly
from a PURL.

Object hierarchy

federation: defined by its name and a Git repo with a config file with
clusters configuration for data kind and PURL type parameters, enabling pointing
to multiple repositories.
cluster: identified by the data kind name, prefixing its data repos
repo: data repo (Git) identified by datakind+PURL-type+hashid
directory: dir in a repo, identified by PURL-type+PURL-hashid
PURL path: ns/name/version/extra_path derived from the PURL
datafile: file storing the data as text JSON/YAML/XML

Example

For instance, in the aboutcode data federation, for a cluster about purl
versions, we would have:

  • data federation definition git repo, with its config file.
    aboutcode-data/aboutcode-data
    aboutcode-federation-config.yml

  • data cluster repos name prefix is the data kind
    aboutcode-data/purls

  • data repository git repo, with a purl sub dir tree and datafile.
    The first repo name has a hash of 0000 which is the first PURL hashid of the
    range of PURL hashid stored in this repo's dirs.
    aboutcode-data/purls-gem-0000/

  • data directory, with a purl sub dir tree and datafile. The dir name
    composed of type+hashid.
    aboutcode-data/purls-gem-0000/gem-0107/

  • PURL subdirectory, and datafile, here list of PURLs for the gem named rails:
    aboutcode-data/purls-gem-0000/gem-0107/rails/purls.yml

In this example, if the base URL for this cluster is at the aboutcode-data
GitHub organization, so the URL to the purls.yml datafile is inferred this way
based on the cluster config:

https://github.com/
aboutcode-data/purls-gem-0000/
raw/refs/heads/main/
gem-0107/rails/purls.yml

More Design details

The DataCluster and Data kind design aligns with the needs of users: for
example, a user using only vulnerability data for Java and JavaScript may not
care directly for Haskell metadata. Or may care only for another kind of data
like fingerprints.

  • DataCluster: A set of repos for only one data kind for many package types.

  • Data Kind: Identifier for the kind of data stored in the datafile of
    DataCluster, like PURL versions, or the original API metadata files, or high
    level scans, or scans with file details, reachability slices, fingerprints, or
    vulnerability advisories and so on.

  • Repository: A repo is a Git repo that stores a group of Directories of a
    DataCluster/data kind, like for all the npms with a PURL hash of 0000 to 1023,
    where we store npm metadata files for each PURL. All repo names in a cluster
    share the same data-kind prefix.

  • Directory: Named after a PURL type and PURL hashid, it stores the datafiles
    for the PURLs that hash to that hashid.

Naming conventions

  • Federation: like aboutcode-data. Also the name of the config repo.

  • DataCluster name prefix: data kind stored in that cluster, like "purls" or "scancode"

  • For data repos: data kind + PURL type + PURL hashid like
    purls-npm-0512 or purls-scancode-scans-0000
    The PURL hashid is the first hashid of a range of hashid stored in that repo.

  • For data dirs in a repo: PURL type + dir_number like npm-0513 or pypi-0000.
    The hashid is that of the PURLs whose data files are stored in that directory.

PURL Hashid

The PURL hashid is central to the design and is simply a number between 0 and
1023 (e.g., 1024 values which is a power of two).

It could be updated to up 8192 in the future, but 1024 is good enough to spread
files in multiple dirs.

The Core PURL is a PURL without version, subpath and qualifiers. We hash this
Core PURL as UTF-8-encoded bytes using SHA256.

The first few bytes of the SHA256 binary digest are converted to an integer
using little endian encoding, then converted modulo a max value of 1024 to yield
an integer converted to a 4-chars, zero-padded string between 0000 and 1023.

Based on this hashid and the data kind and PURL type, directories are grouped in
one or more Git reposities of a cluster, based on a cluster-defined number of
directories of a type per Git repo.

Example of repo and dir names

With 4 dirs per repo, we get 256 repos, like tehse

purls-npm-0000
npm-0000
npm-0001
npm-0002
npm-0003

purls-npm-0004
npm-0004
npm-0005
npm-0006
npm-0007

purls-npm-0008
npm-0008
... and so on

And with 512 dirs per repo, we get 2 repos:

purls-npm-0000
npm-0000
npm-0001
npm-0002
...
npm-0511

purls-npm-0512
npm-0512
npm-0513
...
npm-1023

Git repos sizing assumptions for each ecosystems

For small ecosystems with few packages, like luarocks or swift, a single Git
repo or a few repos may be enough to store all the data of a kind. There, a
luarocks cluster of repos will have a single Git repo, with 1024 root
directories.

At the other end of the spectrum, a package type with many packages like npm may
need 1024 Git repositories to store all the metadata. In this case a npm cluster
of repos will have 1024 Git repos, each with a single root directory.

We can start with reasonable assumptions wrt. the size of each cluster, as a
number of directory per Git repo and the volume of data we would store in each
using these starting values:

  1. For super large ecosystems (with ~5M packages):
  • one dir per repo, yielding 1,024 repos
  • github, npm
  1. For large ecosystems (with ~500K packages)
  • eight dirs per repo, yielding 128 repos
  • golang, maven, nuget, perl, php, pypi, ruby, huggingface
  1. For medium ecosystems (with ~50K packages)
  • 32 dirs per repo, yielding 32 Git repositories
  • alpm, bitbucket, cocoapods, composer, deb, docker, gem, generic,
    mlflow, pub, rpm, cargo
  1. For small ecosystem (with ~2K packages)
  • 1,024 directories in one git repository
  • all others

For instance, say we want a cluster to store all the npm PURLs. As of 2025-10,
npm hosts about 4M unique package names (and roughly 20 versions per name on
average with ~80M updates in total in https://replicate.npmjs.com/). Storing 4M
names takes about 100MB uncompressed. Adding versions would take about 2GB
uncompressed. This means that we can store comfortably all npm PURLs in a single
repository size-wise, but we may want to use more repositories anyway as storing
4M directories and purls.yml files in a single repo will not be a happy event,
so using 32 repos with 32 dirs or 64 repos with 16 dirs may be a better
approach.

See also original post on the approach:

Rebalancing and splitting a DataCluster repos

We can rebalance a cluster, like when we first store the data in a cluster with
a single Git repository for a given PURL type, and later split this repo to more
repos, without loosing the ability to address datafiles directly just knowing a
PURL and without having to rename all the files and directories.

In this design, the directory names are stable and do not change as long as we
keep the default 1024 hash values for the PURL hashid. The only thing that
changes are the repo names when more repos are created from a split, when the
size of a Git repo grows too large.

When a split to occur, we should perform these operations:

  • lock the cluster as "read-only" for the duration of a split operation. This is
    to signal to processes and tool that are updating the cluster that they cannot
    push new data to there yet. This could be done by updating the cluster config
    or the federation config.

  • copy existing Git repos to be split to new repos based on the new number of
    directories per repo.

  • filter Git history in existing and new repos to keep only the history related
    to the directories stored in a given repo.

  • update the cluster config file in cluster Git repo with the new number of
    directories

  • push new Git and existing Git repos

  • unlock the cluster.

We may need to keep the old and new Clusters around too, and may need to add a
simple DataCluster version suffix in Cluter names, and a way to redirect from an
old frozen, inactive DataCluster to a new rebalanced one.

It may even be possible to continue writing to a cluster as long as writing is
done in two places until the split is completed. In practice split should be
reasonably rare and reasonably fast, making this a lesser issue.

It is also possible to change the PURL hashid range for a DataCluster, say going
from 1024 to 2049, 4096 or 8192. This would imply moving all the files around
are the directory structure would change from the new hashids. This is likely
to be an exceptional operation.

This is an extensive rework of the utilities to compute federated paths
using PURLs.
@pombredanne
Copy link
Member Author

Note that I would prefer to pull this out as a new library and this is still a WIP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant