Skip to content

Latest commit

 

History

History
268 lines (227 loc) · 16.1 KB

File metadata and controls

268 lines (227 loc) · 16.1 KB

Concentration (risk component)

How dependent is a project on a handful of people? The concentration component measures the distribution of authorship across a repo's contributors and distills it into one concentration-risk score (score, 0–100, higher = more concentrated = more at-risk) that feeds data/risk/risk.csv as the column concentration. Two independent methods — a git-clone commit log and the GitHub /contributors API — each produce a bus factor, an HHI, and contributor counts; only the git _5y axis drives the score.

Scope: the class-A value-class repos in the risk pipeline (see value.md). Build step: src/risk/build_concentration.py.

Scored components: Bus Factor + HHI

concentration.csv carries 25 columns, but the score uses exactly two — both from the git-clone log over the _5y window (2021–2025), each turned into a direction-aware 0–100 risk percentile and combined as their geometric mean:

  • Bus factor (bf_commits_git_5y) — sort the merged, non-bot contributors by commits descending and count how many it takes for their cumulative commit share to reach 50% (bus_factor_threshold). A low bus factor (often 1, a single person covering half the window's commits) means high risk.
  • HHI (hhi_commits_git_5y) — the Herfindahl–Hirschman index of commit shares, 10000 · Σ pᵢ², ranging from 10000 / n (commits spread evenly) up to 10000 (one author writes everything). A high HHI means high risk.
  • No active human in the window — any repo with no human commit in the _5y window (2021–2025) is imputed the worst valid pair, bus factor 1 and HHI 10000: a project with no active human maintainer in five years is a maximal single-point-of-failure, so it ranks as fully concentrated rather than dropping out of the score. This covers the dormant/archived case (comment = no commits in 5y), the bot-only-window case (no human commits in 5y), and the new-repo case (no commits through last complete year). It applies only to successfully-fetched repos — a clone/log failure writes no data and is the only thing left blank, so a fetch failure is never mistaken for a real measurement. As a result the score is fully populated across the risk set.
score = max(1, round( √( bf_commits_git_5y_p × hhi_commits_git_5y_p ) ))

Every other column — the GitHub /contributors _gh_alltime figures, the git _full lifetime figures, and all contributor/commit counts — is emitted for inspection and cross-checking only. None of them feed score.

Metrics Roadmap

Each leaf is one column with its source and the period it represents. _full = all commits through the last complete year (max(settings.years) = 2025); _5y = the last concentration.window_years (5) complete years, 2021–2025; _gh_alltime = the GitHub API's uncapped lifetime count as of fetch (no per-year breakdown, list capped near 500). Raw long-format signals are fetched per-source under data/sources/; all derived columns are computed by build_concentration.py. Only the git _5y bus-factor + HHI percentiles feed score.

Concentration  → data/risk/concentration.csv  (one row per A/B risk repo)
│
├── git-clone method  (data/sources/git/contributor-commits.csv)
│   ├── _full  (all commits through 2025)
│   │   ├── total_commits_git_full   ← Σ non-merge commits ≤ last complete year   [→2025]
│   │   ├── contributors_git_full    ← derived (merged non-bot identities)        [→2025]
│   │   ├── bf_commits_git_full      ← derived (bus factor)                       [→2025]
│   │   ├── bf_commits_git_full_p    ← derived (risk percentile, low BF → high)   [→2025]
│   │   ├── hhi_commits_git_full     ← derived (HHI, 0–10000)                     [→2025]
│   │   └── hhi_commits_git_full_p   ← derived (risk percentile, high HHI → high) [→2025]
│   └── _5y  (2021–2025 window — the scoring axis)
│       ├── commits_git_5y               ← Σ non-merge commits in window          [2021–2025]
│       ├── active_contributors_git_5y   ← derived (merged non-bot, active in win) [2021–2025]
│       ├── bf_commits_git_5y            ← derived (bus factor)                    [2021–2025]
│       ├── bf_commits_git_5y_p          ← derived (risk percentile)  ★ scores     [2021–2025]
│       ├── hhi_commits_git_5y           ← derived (HHI, 0–10000)                  [2021–2025]
│       └── hhi_commits_git_5y_p         ← derived (risk percentile)  ★ scores     [2021–2025]
│
├── GitHub /contributors method  (data/sources/github/contributor-commits.csv)
│   └── _gh_alltime  (lifetime, list capped ~500 — cross-check only)
│       ├── total_commits_gh_alltime        ← Σ /contributors `contributions`     [lifetime]
│       ├── total_contributors_gh_alltime   ← all /contributors rows (incl. bots) [lifetime]
│       ├── active_contributors_gh_alltime  ← derived (non-bot rows)              [lifetime]
│       ├── bf_commits_gh_alltime           ← derived (bus factor)                [lifetime]
│       ├── bf_commits_gh_alltime_p         ← derived (risk percentile)           [lifetime]
│       ├── hhi_commits_gh_alltime          ← derived (HHI, 0–10000)              [lifetime]
│       └── hhi_commits_gh_alltime_p        ← derived (risk percentile)           [lifetime]
│
└── score  (the component score)  ← geometric mean of bf_commits_git_5y_p
    │                                and hhi_commits_git_5y_p (0–100)
    └─ carried into risk.csv as the column `concentration` (this score only)

How It Works

  1. Collect — two fetchers dump raw long-format per-contributor data into data/sources/: the git fetcher walks git log on a bare clone, the GitHub fetcher hits the /contributors API. Each writes a .status.csv sidecar carrying fetched_at per repo, so a missing metric is distinguishable from a failed fetch.
  2. Joinbuild_concentration.py joins both sources onto the risk repos by repo slug (and reads data/value/value.csv for the A/B scope).
  3. Derive — for each method: merge contributor identities, drop bots, then compute bus factor, HHI (0–10000), and contributor counts. The git method yields both a lifetime (_full) and a windowed (_5y) figure; the GitHub method yields lifetime only (_gh_alltime).
  4. Scoreadd_percentiles turns the _5y bus factor and HHI into risk percentiles (bf_commits_git_5y_p, hhi_commits_git_5y_p); score is their geometric mean, an integer 0–100 (higher = more concentrated = more risk).
  5. Aggregateaggregate_risk.py carries only score into risk.csv, renamed to the column concentration.

Pipeline order. The GitHub /contributors fetcher runs inside the risk pipeline; the git long-format fetcher is run separately (it clones repos, so it is decoupled from the API-only pipeline run):

src.sources.git.contributors  (standalone clone-based dump)
                              ↘
src.risk.run_risk_pipeline:  … → contributors (GitHub) → … → concentration (build) → … → aggregate

Collection

Both methods read a long-format raw file plus its status sidecar. Each row of the raw file is one (repo, contributor[, year]) tuple; the builder aggregates over them. Join key into the risk-repo set is repo for both.

Source file (data/sources/) Fetcher Collects Key
git/contributor-commits.csv src/sources/git/contributors.py long raw: repo, author_name, author_email, year, commits from git log --no-merges on a bare treeless clone (mailmap-resolved %aN/%aE) repo
git/contributor-commits.status.csv src/sources/git/contributors.py per-repo git-fetch status + fetched_at repo
github/contributor-commits.csv src/sources/github/fetch_contributors_metrics.py long raw: repo, login, contributions, account_type from the /repos/{repo}/contributors endpoint repo
github/contributor-commits.status.csv src/sources/github/fetch_contributors_metrics.py per-repo GitHub-fetch status + fetched_at repo
value/value.csv value pipeline A/B scope (load_top_repos) repo

Two methods, two different lenses

git-clone log GitHub /contributors
Sees every committer, with author dates accounts only, lifetime cumulative
Periods _full and _5y (windowed) _gh_alltime (lifetime only)
Identity raw name+email pairs → union-find merge already keyed by GitHub login
Limit times out on kernel-scale mirrors list capped near 500 → under-counts big repos
Role drives score (the _5y axis) parallel cross-check, never scored

The git method is authoritative because it carries author dates (so it can honour a 2021–2025 window) and sees every contributor. The GitHub method has no per-year breakdown and truncates the contributor list near 500 entries, so its columns are deliberately labelled _gh_alltime — an uncapped lifetime figure as of github_fetched_at (may include the partial current year), never _full/_5y.

Processing & scoring

Identity merge + bot drop

The git method's raw rows are keyed by mailmap-resolved (author_name, author_email) pairs. The builder additionally union-finds identities that share a normalised email or a full name (merge_identity_groups), so a person who committed under several addresses counts once. Bot identities are then dropped (_is_bot_identity for git, account_type == "Bot" / is_bot(login) for GitHub) before any metric is computed. The GitHub method needs no merging — /contributors is already keyed by account.

Bus factor and HHI

Over the merged, bot-free per-contributor commit counts:

Metric Definition
bus factor (bf_commits_*) fewest contributors whose combined commits reach bus_factor_threshold (0.5 = the people covering 50% of commits). Low = concentrated.
HHI (hhi_commits_*) Herfindahl–Hirschman index of commit shares, scaled to 0–10000. High = concentrated.

Both are undefined (blank, not 0) for a repo with no positive-commit contributor — a real 0 would falsely rank as both maximum-concentration bus factor and minimum-concentration HHI, so a blank keeps such repos out of both percentile rankings.

The percentiles (_p)

add_percentiles ranks each metric into a 0–100 risk percentile, with the direction chosen per metric so that more concentrated always ranks higher:

Column Basis higher_is_worse Direction
bf_commits_git_5y_p bf_commits_git_5y False low bus factor → high percentile
hhi_commits_git_5y_p hhi_commits_git_5y True high HHI → high percentile
bf_commits_git_full_p bf_commits_git_full False (not scored)
hhi_commits_git_full_p hhi_commits_git_full True (not scored)
bf_commits_gh_alltime_p bf_commits_gh_alltime False (not scored)
hhi_commits_gh_alltime_p hhi_commits_gh_alltime True (not scored)
score geom mean of the two _5y _p the concentration-risk score

Only the two _5y percentiles compose the score (composite_cols = ["bf_commits_git_5y_p", "hhi_commits_git_5y_p"]). The _full and _gh_alltime percentiles are emitted for inspection but do not feed score. The geometric mean means a repo only scores as low-risk when both axes agree it is well-distributed — one concentrated axis pulls the product up.

Output

data/risk/concentration.csv (per-dimension build)

25 columns, one row per risk repo. No fetched_at value columns — per-method timestamps live in github_fetched_at / git_fetched_at.

Column Description
repo, repo_id identity
total_commits_gh_alltime Σ of /contributors contributions (incl. bots)
total_contributors_gh_alltime all /contributors rows (incl. bots)
active_contributors_gh_alltime non-bot /contributors rows
bf_commits_gh_alltime bus factor — GitHub method, lifetime
bf_commits_gh_alltime_p risk percentile of bf_commits_gh_alltime
hhi_commits_gh_alltime HHI (0–10000) — GitHub method, lifetime
hhi_commits_gh_alltime_p risk percentile of hhi_commits_gh_alltime
total_commits_git_full Σ non-merge commits through 2025
contributors_git_full merged non-bot identities, _full
bf_commits_git_full bus factor — git method, _full
bf_commits_git_full_p risk percentile of bf_commits_git_full
hhi_commits_git_full HHI (0–10000) — git method, _full
hhi_commits_git_full_p risk percentile of hhi_commits_git_full
commits_git_5y Σ non-merge commits in 2021–2025
active_contributors_git_5y merged non-bot identities active in window
bf_commits_git_5y bus factor — git method, _5y
bf_commits_git_5y_p risk percentile of bf_commits_git_5y (scores)
hhi_commits_git_5y HHI (0–10000) — git method, _5y
hhi_commits_git_5y_p risk percentile of hhi_commits_git_5y (scores)
score concentration-risk score (geom-mean of the two _5y _p, 0–100)
comment edge-case note on the _5y axis (auditability), else empty. All but the last are imputed bf=1/HHI=10000: no commits in 5y (dormant), no human commits in 5y (bot-only window), no commits through last complete year (only in-progress-year activity). git fetch <status> / no git data (fetch failed → blank, the only unscored case)
github_fetched_at when the GitHub /contributors data was fetched
git_fetched_at when the git-clone log was fetched

data/risk/risk.csv (aggregate)

aggregate_risk.py carries only this component's score into risk.csv, renamed to the column concentration. Every other column above stays in the per-dimension CSV. risk.csv today is just repo, repo_id, concentration, complexity, security, funding, workload, score — one score per component plus the overall geometric-mean score.

Coverage

See docs/stats.md → Risk → Concentration for current per-signal coverage over the top repos and the score distribution.

Limitations

  • No human in the window is imputed; fetch failures are not. Any repo with no human commit in 2021–2025 (dormant, bot-only, or new) is scored as maximally concentrated (bf = 1, HHI = 10000) — see Scored components. The only thing left blank (out of the ranking) is the genuinely unmeasured: a clone that failed/timed out (kernel-scale mirrors). The distinction is auditable — the imputation lives in git_metrics, which only runs after a successful fetch — and scripts/pipeline_health.py asserts the score is fully populated, so any blank surfaces as a fetch gap to fix rather than passing silently.
  • Commits ≠ effort. Both methods count authored commits, not lines, reviews, triage, or maintenance burden. A reviewer or release manager who rarely commits is invisible, so a repo can read as more concentrated than it truly is.
  • GitHub method under-counts big repos. /contributors caps the list near 500 entries and exposes no per-year breakdown, so its _gh_alltime columns are kept only as a cross-check and never feed score.
  • git method can time out. A bare treeless clone of kernel-scale mirrors (e.g. archlinux/linux) can exceed the fetch budget; the status sidecar records the failure and the repo's git columns stay blank.
  • Identity merge is heuristic. Union-find over shared email/name catches most aliases, but a contributor who never reused an email or canonical name across identities will still be split — inflating contributor count and deflating concentration slightly.
  • score is a percentile, not a class. It is a 0–100 rank within this cohort, not an absolute rating, and it is one of five inputs to the overall risk.csv score (geometric mean of the component scores).