swh-subgrapher
is a Rust script designed to assist in the generation of Software Heritage Subgraphs. It takes a list of input Origins, retrieves the corresponding SWHIDs (Software Heritage Identifiers), and traverses the Software Heritage graph to find all associated objects.
This tool leverages the official swh-graph
library to interact with the Software Heritage graph data.
The primary goal is to produce a list of SWHIDs that can then be used in the Software Heritage generate_subdataset
process to create a custom, smaller dataset from the vast Software Heritage archive.
The script performs the following main functions:
- Loads Origins: Reads a list of origin URLs from a specified input file.
- Loads Graph Data: Initializes and loads the Software Heritage graph dataset from a local path.
- Resolves Origins to SWHIDs: For each origin URL:
- Calculates its SHA1 hash to form a potential SWHID.
- Looks up the SWHID in the loaded graph.
- If an origin is not found, and the
--try-protocol-variations
flag is set, it will attempt to find the origin by switching betweengit://
andhttps://
protocols.
- Graph Traversal: For each successfully found origin node, it performs a Breadth-First Search (BFS) starting from that node to discover all reachable nodes (revisions, directories, contents, etc.) in the graph.
- Collects SWHIDs: All unique SWHIDs encountered during the traversal are collected.
- Outputs Results:
- Writes the collected SWHIDs to
output.txt
, with each SWHID on a new line. - If any origins could not be found in the graph, their URLs are written to
errors.txt
.
- Writes the collected SWHIDs to
- Rust programming language and Cargo (its package manager).
- A local copy of the Software Heritage graph dataset. You can find information on how to obtain this on the Software Heritage documentation.
- the smaller “History and hosting” Compressed graph has everything needed for this task
- The
swh-graph
library and its dependencies must be available.
-
Clone this repository or download the source code.
-
Navigate to the project directory.
-
Build the project using Cargo:
cargo build --release
The executable will be located in
target/release/swh-subgrapher
.
To run the script, you need to provide the path to the Software Heritage graph dataset and the path to a file containing the list of origin URLs.
swh-subgrapher --graph /path/to/your/dataset/graph --origins origins.txt