How dependent is a project on a handful of people? The concentration component
measures the distribution of authorship across a repo's contributors and
distills it into one concentration-risk score (score, 0–100, higher = more
concentrated = more at-risk) that feeds data/risk/risk.csv as the column
concentration. Two independent methods — a git-clone commit log and the GitHub
/contributors API — each produce a bus factor, an HHI, and contributor counts;
only the git _5y axis drives the score.
Scope: the class-A value-class repos in the risk pipeline (see
value.md). Build step: src/risk/build_concentration.py.
concentration.csv carries 25 columns, but the score uses exactly two —
both from the git-clone log over the _5y window (2021–2025), each turned into a
direction-aware 0–100 risk percentile and combined as their geometric mean:
- Bus factor (
bf_commits_git_5y) — sort the merged, non-bot contributors by commits descending and count how many it takes for their cumulative commit share to reach 50% (bus_factor_threshold). A low bus factor (often 1, a single person covering half the window's commits) means high risk. - HHI (
hhi_commits_git_5y) — the Herfindahl–Hirschman index of commit shares,10000 · Σ pᵢ², ranging from10000 / n(commits spread evenly) up to10000(one author writes everything). A high HHI means high risk. - No active human in the window — any repo with no human commit in the
_5ywindow (2021–2025) is imputed the worst valid pair, bus factor1and HHI10000: a project with no active human maintainer in five years is a maximal single-point-of-failure, so it ranks as fully concentrated rather than dropping out of the score. This covers the dormant/archived case (comment = no commits in 5y), the bot-only-window case (no human commits in 5y), and the new-repo case (no commits through last complete year). It applies only to successfully-fetched repos — a clone/log failure writes no data and is the only thing left blank, so a fetch failure is never mistaken for a real measurement. As a result the score is fully populated across the risk set.
score = max(1, round( √( bf_commits_git_5y_p × hhi_commits_git_5y_p ) ))
Every other column — the GitHub /contributors _gh_alltime figures, the git
_full lifetime figures, and all contributor/commit counts — is emitted for
inspection and cross-checking only. None of them feed score.
Each leaf is one column with its source and the period it represents. _full =
all commits through the last complete year (max(settings.years) = 2025);
_5y = the last concentration.window_years (5) complete years, 2021–2025;
_gh_alltime = the GitHub API's uncapped lifetime count as of fetch (no
per-year breakdown, list capped near 500). Raw long-format signals are fetched
per-source under data/sources/; all derived columns are computed by
build_concentration.py. Only the git _5y bus-factor + HHI percentiles feed
score.
Concentration → data/risk/concentration.csv (one row per A/B risk repo)
│
├── git-clone method (data/sources/git/contributor-commits.csv)
│ ├── _full (all commits through 2025)
│ │ ├── total_commits_git_full ← Σ non-merge commits ≤ last complete year [→2025]
│ │ ├── contributors_git_full ← derived (merged non-bot identities) [→2025]
│ │ ├── bf_commits_git_full ← derived (bus factor) [→2025]
│ │ ├── bf_commits_git_full_p ← derived (risk percentile, low BF → high) [→2025]
│ │ ├── hhi_commits_git_full ← derived (HHI, 0–10000) [→2025]
│ │ └── hhi_commits_git_full_p ← derived (risk percentile, high HHI → high) [→2025]
│ └── _5y (2021–2025 window — the scoring axis)
│ ├── commits_git_5y ← Σ non-merge commits in window [2021–2025]
│ ├── active_contributors_git_5y ← derived (merged non-bot, active in win) [2021–2025]
│ ├── bf_commits_git_5y ← derived (bus factor) [2021–2025]
│ ├── bf_commits_git_5y_p ← derived (risk percentile) ★ scores [2021–2025]
│ ├── hhi_commits_git_5y ← derived (HHI, 0–10000) [2021–2025]
│ └── hhi_commits_git_5y_p ← derived (risk percentile) ★ scores [2021–2025]
│
├── GitHub /contributors method (data/sources/github/contributor-commits.csv)
│ └── _gh_alltime (lifetime, list capped ~500 — cross-check only)
│ ├── total_commits_gh_alltime ← Σ /contributors `contributions` [lifetime]
│ ├── total_contributors_gh_alltime ← all /contributors rows (incl. bots) [lifetime]
│ ├── active_contributors_gh_alltime ← derived (non-bot rows) [lifetime]
│ ├── bf_commits_gh_alltime ← derived (bus factor) [lifetime]
│ ├── bf_commits_gh_alltime_p ← derived (risk percentile) [lifetime]
│ ├── hhi_commits_gh_alltime ← derived (HHI, 0–10000) [lifetime]
│ └── hhi_commits_gh_alltime_p ← derived (risk percentile) [lifetime]
│
└── score (the component score) ← geometric mean of bf_commits_git_5y_p
│ and hhi_commits_git_5y_p (0–100)
└─ carried into risk.csv as the column `concentration` (this score only)
- Collect — two fetchers dump raw long-format per-contributor data into
data/sources/: the git fetcher walksgit logon a bare clone, the GitHub fetcher hits the/contributorsAPI. Each writes a.status.csvsidecar carryingfetched_atper repo, so a missing metric is distinguishable from a failed fetch. - Join —
build_concentration.pyjoins both sources onto the risk repos byreposlug (and readsdata/value/value.csvfor the A/B scope). - Derive — for each method: merge contributor identities, drop bots, then
compute bus factor, HHI (0–10000), and contributor counts. The git method
yields both a lifetime (
_full) and a windowed (_5y) figure; the GitHub method yields lifetime only (_gh_alltime). - Score —
add_percentilesturns the_5ybus factor and HHI into risk percentiles (bf_commits_git_5y_p,hhi_commits_git_5y_p);scoreis their geometric mean, an integer 0–100 (higher = more concentrated = more risk). - Aggregate —
aggregate_risk.pycarries onlyscoreintorisk.csv, renamed to the columnconcentration.
Pipeline order. The GitHub /contributors fetcher runs inside the risk
pipeline; the git long-format fetcher is run separately (it clones repos, so it
is decoupled from the API-only pipeline run):
src.sources.git.contributors (standalone clone-based dump)
↘
src.risk.run_risk_pipeline: … → contributors (GitHub) → … → concentration (build) → … → aggregate
Both methods read a long-format raw file plus its status sidecar. Each row of
the raw file is one (repo, contributor[, year]) tuple; the builder aggregates
over them. Join key into the risk-repo set is repo for both.
Source file (data/sources/) |
Fetcher | Collects | Key |
|---|---|---|---|
git/contributor-commits.csv |
src/sources/git/contributors.py |
long raw: repo, author_name, author_email, year, commits from git log --no-merges on a bare treeless clone (mailmap-resolved %aN/%aE) |
repo |
git/contributor-commits.status.csv |
src/sources/git/contributors.py |
per-repo git-fetch status + fetched_at |
repo |
github/contributor-commits.csv |
src/sources/github/fetch_contributors_metrics.py |
long raw: repo, login, contributions, account_type from the /repos/{repo}/contributors endpoint |
repo |
github/contributor-commits.status.csv |
src/sources/github/fetch_contributors_metrics.py |
per-repo GitHub-fetch status + fetched_at |
repo |
value/value.csv |
value pipeline | A/B scope (load_top_repos) |
repo |
| git-clone log | GitHub /contributors |
|
|---|---|---|
| Sees | every committer, with author dates | accounts only, lifetime cumulative |
| Periods | _full and _5y (windowed) |
_gh_alltime (lifetime only) |
| Identity | raw name+email pairs → union-find merge | already keyed by GitHub login |
| Limit | times out on kernel-scale mirrors | list capped near 500 → under-counts big repos |
| Role | drives score (the _5y axis) |
parallel cross-check, never scored |
The git method is authoritative because it carries author dates (so it can
honour a 2021–2025 window) and sees every contributor. The GitHub method has no
per-year breakdown and truncates the contributor list near 500 entries, so its
columns are deliberately labelled _gh_alltime — an uncapped lifetime figure as
of github_fetched_at (may include the partial current year), never _full/_5y.
The git method's raw rows are keyed by mailmap-resolved (author_name, author_email) pairs. The builder additionally union-finds identities that share
a normalised email or a full name (merge_identity_groups), so a person who
committed under several addresses counts once. Bot identities are then dropped
(_is_bot_identity for git, account_type == "Bot" / is_bot(login) for
GitHub) before any metric is computed. The GitHub method needs no merging —
/contributors is already keyed by account.
Over the merged, bot-free per-contributor commit counts:
| Metric | Definition |
|---|---|
bus factor (bf_commits_*) |
fewest contributors whose combined commits reach bus_factor_threshold (0.5 = the people covering 50% of commits). Low = concentrated. |
HHI (hhi_commits_*) |
Herfindahl–Hirschman index of commit shares, scaled to 0–10000. High = concentrated. |
Both are undefined (blank, not 0) for a repo with no positive-commit
contributor — a real 0 would falsely rank as both maximum-concentration bus
factor and minimum-concentration HHI, so a blank keeps such repos out of both
percentile rankings.
add_percentiles ranks each metric into a 0–100 risk percentile, with the
direction chosen per metric so that more concentrated always ranks higher:
| Column | Basis | higher_is_worse |
Direction |
|---|---|---|---|
bf_commits_git_5y_p |
bf_commits_git_5y |
False |
low bus factor → high percentile |
hhi_commits_git_5y_p |
hhi_commits_git_5y |
True |
high HHI → high percentile |
bf_commits_git_full_p |
bf_commits_git_full |
False |
(not scored) |
hhi_commits_git_full_p |
hhi_commits_git_full |
True |
(not scored) |
bf_commits_gh_alltime_p |
bf_commits_gh_alltime |
False |
(not scored) |
hhi_commits_gh_alltime_p |
hhi_commits_gh_alltime |
True |
(not scored) |
score |
geom mean of the two _5y _p |
— | the concentration-risk score |
Only the two _5y percentiles compose the score (composite_cols = ["bf_commits_git_5y_p", "hhi_commits_git_5y_p"]). The _full and _gh_alltime
percentiles are emitted for inspection but do not feed score. The
geometric mean means a repo only scores as low-risk when both axes agree
it is well-distributed — one concentrated axis pulls the product up.
25 columns, one row per risk repo. No fetched_at value columns — per-method
timestamps live in github_fetched_at / git_fetched_at.
| Column | Description |
|---|---|
repo, repo_id |
identity |
total_commits_gh_alltime |
Σ of /contributors contributions (incl. bots) |
total_contributors_gh_alltime |
all /contributors rows (incl. bots) |
active_contributors_gh_alltime |
non-bot /contributors rows |
bf_commits_gh_alltime |
bus factor — GitHub method, lifetime |
bf_commits_gh_alltime_p |
risk percentile of bf_commits_gh_alltime |
hhi_commits_gh_alltime |
HHI (0–10000) — GitHub method, lifetime |
hhi_commits_gh_alltime_p |
risk percentile of hhi_commits_gh_alltime |
total_commits_git_full |
Σ non-merge commits through 2025 |
contributors_git_full |
merged non-bot identities, _full |
bf_commits_git_full |
bus factor — git method, _full |
bf_commits_git_full_p |
risk percentile of bf_commits_git_full |
hhi_commits_git_full |
HHI (0–10000) — git method, _full |
hhi_commits_git_full_p |
risk percentile of hhi_commits_git_full |
commits_git_5y |
Σ non-merge commits in 2021–2025 |
active_contributors_git_5y |
merged non-bot identities active in window |
bf_commits_git_5y |
bus factor — git method, _5y |
bf_commits_git_5y_p |
risk percentile of bf_commits_git_5y (scores) |
hhi_commits_git_5y |
HHI (0–10000) — git method, _5y |
hhi_commits_git_5y_p |
risk percentile of hhi_commits_git_5y (scores) |
score |
concentration-risk score (geom-mean of the two _5y _p, 0–100) |
comment |
edge-case note on the _5y axis (auditability), else empty. All but the last are imputed bf=1/HHI=10000: no commits in 5y (dormant), no human commits in 5y (bot-only window), no commits through last complete year (only in-progress-year activity). git fetch <status> / no git data (fetch failed → blank, the only unscored case) |
github_fetched_at |
when the GitHub /contributors data was fetched |
git_fetched_at |
when the git-clone log was fetched |
aggregate_risk.py carries only this component's score into risk.csv,
renamed to the column concentration. Every other column above stays in the
per-dimension CSV. risk.csv today is just repo, repo_id, concentration, complexity, security, funding, workload, score — one score per component plus
the overall geometric-mean score.
See docs/stats.md → Risk → Concentration for current per-signal coverage over the top repos and the score distribution.
- No human in the window is imputed; fetch failures are not. Any repo with
no human commit in 2021–2025 (dormant, bot-only, or new) is scored as maximally
concentrated (
bf = 1,HHI = 10000) — see Scored components. The only thing left blank (out of the ranking) is the genuinely unmeasured: a clone that failed/timed out (kernel-scale mirrors). The distinction is auditable — the imputation lives ingit_metrics, which only runs after a successful fetch — andscripts/pipeline_health.pyasserts the score is fully populated, so any blank surfaces as a fetch gap to fix rather than passing silently. - Commits ≠ effort. Both methods count authored commits, not lines, reviews, triage, or maintenance burden. A reviewer or release manager who rarely commits is invisible, so a repo can read as more concentrated than it truly is.
- GitHub method under-counts big repos.
/contributorscaps the list near 500 entries and exposes no per-year breakdown, so its_gh_alltimecolumns are kept only as a cross-check and never feedscore. - git method can time out. A bare treeless clone of kernel-scale mirrors
(e.g.
archlinux/linux) can exceed the fetch budget; the status sidecar records the failure and the repo's git columns stay blank. - Identity merge is heuristic. Union-find over shared email/name catches most aliases, but a contributor who never reused an email or canonical name across identities will still be split — inflating contributor count and deflating concentration slightly.
scoreis a percentile, not a class. It is a 0–100 rank within this cohort, not an absolute rating, and it is one of five inputs to the overallrisk.csvscore(geometric mean of the component scores).