Vectorize participant info computation (3-15x speedup) by jucor · Pull Request #2516 · compdemocracy/polis

jucor · 2026-03-30T22:25:15Z

Summary

Replaces the O(N×G×C) per-participant Python loop in _compute_participant_info_optimized with bulk NumPy operations: matrix-wide vote counting (np.sum over axis) and per-group Pearson correlation via P @ g matrix multiply
Adds 31 unit tests covering vote counts, group correlations, edge cases (small groups, zero-std, NaN handling, missing members), and golden snapshot regression
Correlations now return Python float instead of numpy.float64
Includes a benchmark script (scripts/benchmark_participant_info.py) that runs old vs new on the same data

Benchmark results

Measured on real datasets (5 runs, median), old per-participant loop vs new vectorized:

Dataset	Size	Old	New	Speedup
vw	69p × 125c × 4g	0.011s	0.001s	14.6x
biodiversity	536p × 314c × 2g	0.047s	0.006s	8.1x
(larger private datasets)				3–6x

Speedup is higher on smaller datasets (loop overhead dominates) and lower on very large ones (matrix materialization dominates). Overall 3–15x depending on size.

Test plan

31 unit tests pass (pre-vectorization baseline established first, then re-run post)
Golden snapshot regression passes for biodiversity + vw
Full regression test suite passes (40/40)
Benchmark run on all datasets including private (results above)
Max correlation diff across all datasets: < 2e-15

🤖 Generated with Claude Code

Squashed commits

Address Copilot review: clarify MAP estimate, centralize PSEUDO_COUNT imports
Update plan: add stack cross-reference and GitHub PR numbers
Vectorize _compute_participant_info_optimized for ~100x speedup
Add benchmark script for participant info vectorization
Address Copilot review: drop unused cache, validate --runs, tighten type assert
Remove old participant_stats() in favor of vectorized replacement

commit-id:ea747196

Stack:

⚠️ Part of a stack created by spr. Do not merge manually using the UI - doing so may have unexpected results.

Copilot

Pull request overview

Vectorizes _compute_participant_info_optimized in Conversation by replacing the per-participant Python loop with bulk NumPy operations: matrix-wide np.sum for vote counts and a per-group P @ g matmul for Pearson correlation, yielding 3–15x speedup. Removes the now-unused legacy participant_stats() function from repness.py and updates all callers/imports. Adds a comprehensive unit test file and an A/B benchmark script.

Changes:

Replace per-participant loop with vectorized vote counting and matmul-based Pearson correlation in _compute_participant_info_optimized
Delete dead participant_stats() from repness.py; update imports in conversation.py, pca_kmeans_rep/__init__.py, notebooks/run_analysis.py, and 4 test files to call the optimized method instead
Add tests/test_participant_info.py (31 tests covering vote counts, group correlations, edge cases, real-data equivalence, golden snapshot regression) and scripts/benchmark_participant_info.py

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
delphi/polismath/conversation/conversation.py	Vectorize participant info using `np.sum` and `P @ g` Pearson formula
delphi/polismath/pca_kmeans_rep/repness.py	Delete dead `participant_stats()`; clarify PSEUDO_COUNT (MAP) comment
delphi/polismath/pca_kmeans_rep/init.py	Drop `participant_stats` from public exports
delphi/notebooks/run_analysis.py	Remove unused `participant_stats` import
delphi/tests/test_participant_info.py	New comprehensive unit + golden + real-data equivalence tests
delphi/tests/test_repness_unit.py	Switch `test_participant_stats` to call optimized method; centralize PSEUDO_COUNT import
delphi/tests/test_old_format_repness.py	Same switch and import centralization
delphi/tests/test_repness_smoke.py	Replace `participant_stats` call with `_compute_participant_info_optimized`
delphi/tests/test_legacy_repness_comparison.py	Drop unused `participant_stats` import
delphi/tests/simplified_repness_test.py	Import `PSEUDO_COUNT` from repness instead of hardcoding
delphi/scripts/benchmark_participant_info.py	New A/B benchmark of old loop vs vectorized impl
delphi/docs/PLAN_DISCREPANCY_FIXES.md	Add stack cross-reference table and GitHub PR numbers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

## Summary - Replaces the O(N×G×C) per-participant Python loop in `_compute_participant_info_optimized` with bulk NumPy operations: matrix-wide vote counting (`np.sum` over axis) and per-group Pearson correlation via `P @ g` matrix multiply - Adds 31 unit tests covering vote counts, group correlations, edge cases (small groups, zero-std, NaN handling, missing members), and golden snapshot regression - Correlations now return Python `float` instead of `numpy.float64` - Includes a benchmark script (`scripts/benchmark_participant_info.py`) that runs old vs new on the same data ## Benchmark results Measured on real datasets (5 runs, median), old per-participant loop vs new vectorized: | Dataset | Size | Old | New | Speedup | |---------|------|-----|-----|---------| | vw | 69p × 125c × 4g | 0.011s | 0.001s | **14.6x** | | biodiversity | 536p × 314c × 2g | 0.047s | 0.006s | **8.1x** | | _(larger private datasets)_ | | | | **3–6x** | Speedup is higher on smaller datasets (loop overhead dominates) and lower on very large ones (matrix materialization dominates). Overall **3–15x** depending on size. ## Test plan - [x] 31 unit tests pass (pre-vectorization baseline established first, then re-run post) - [x] Golden snapshot regression passes for biodiversity + vw - [x] Full regression test suite passes (40/40) - [x] Benchmark run on all datasets including private (results above) - [x] Max correlation diff across all datasets: < 2e-15 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Squashed commits - Address Copilot review: clarify MAP estimate, centralize PSEUDO_COUNT imports - Update plan: add stack cross-reference and GitHub PR numbers - Vectorize _compute_participant_info_optimized for ~100x speedup - Add benchmark script for participant info vectorization - Address Copilot review: drop unused cache, validate --runs, tighten type assert - Remove old participant_stats() in favor of vectorized replacement commit-id:ea747196

github-actions · 2026-05-19T22:17:08Z

Delphi Coverage Report

File	Stmts	Miss	Cover
init.py	2	0	100%
benchmarks/bench_pca.py	76	76	0%
benchmarks/bench_repness.py	81	81	0%
benchmarks/bench_update_votes.py	38	38	0%
benchmarks/benchmark_utils.py	34	34	0%
components/init.py	1	0	100%
components/config.py	165	133	19%
conversation/init.py	2	0	100%
conversation/conversation.py	1107	320	71%
conversation/manager.py	131	42	68%
database/init.py	1	0	100%
database/dynamodb.py	387	234	40%
database/postgres.py	305	205	33%
pca_kmeans_rep/init.py	5	0	100%
pca_kmeans_rep/clusters.py	257	22	91%
pca_kmeans_rep/corr.py	98	17	83%
pca_kmeans_rep/pca.py	52	16	69%
pca_kmeans_rep/repness.py	297	43	86%
pca_kmeans_rep/stats.py	107	22	79%
regression/init.py	4	0	100%
regression/clojure_comparer.py	188	17	91%
regression/comparer.py	887	720	19%
regression/datasets.py	135	27	80%
regression/recorder.py	36	27	25%
regression/utils.py	138	94	32%
run_math_pipeline.py	260	114	56%
umap_narrative/500_generate_embedding_umap_cluster.py	210	109	48%
umap_narrative/501_calculate_comment_extremity.py	112	54	52%
umap_narrative/502_calculate_priorities.py	135	135	0%
umap_narrative/700_datamapplot_for_layer.py	502	502	0%
umap_narrative/701_static_datamapplot_for_layer.py	310	310	0%
umap_narrative/702_consensus_divisive_datamapplot.py	432	432	0%
umap_narrative/801_narrative_report_batch.py	785	785	0%
umap_narrative/802_process_batch_results.py	265	265	0%
umap_narrative/803_check_batch_status.py	175	175	0%
umap_narrative/llm_factory_constructor/init.py	2	2	0%
umap_narrative/llm_factory_constructor/model_provider.py	157	157	0%
umap_narrative/polismath_commentgraph/init.py	1	0	100%
umap_narrative/polismath_commentgraph/cli.py	270	270	0%
umap_narrative/polismath_commentgraph/core/init.py	3	3	0%
umap_narrative/polismath_commentgraph/core/clustering.py	108	108	0%
umap_narrative/polismath_commentgraph/core/embedding.py	104	104	0%
umap_narrative/polismath_commentgraph/lambda_handler.py	219	219	0%
umap_narrative/polismath_commentgraph/schemas/init.py	2	0	100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py	160	9	94%
umap_narrative/polismath_commentgraph/tests/conftest.py	17	17	0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py	74	74	0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py	55	55	0%
umap_narrative/polismath_commentgraph/tests/test_storage.py	87	87	0%
umap_narrative/polismath_commentgraph/utils/init.py	3	0	100%
umap_narrative/polismath_commentgraph/utils/converter.py	283	237	16%
umap_narrative/polismath_commentgraph/utils/group_data.py	354	336	5%
umap_narrative/polismath_commentgraph/utils/storage.py	584	477	18%
umap_narrative/reset_conversation.py	159	50	69%
umap_narrative/run_pipeline.py	453	312	31%
utils/general.py	62	41	34%
Total	10877	7607	30%

jucor changed the title ~~Vectorize participant info computation (3-15x speedup)~~ [Stack 9/17] Vectorize participant info computation (3-15x speedup) Mar 30, 2026

jucor force-pushed the spr/edge/ea747196 branch from 750e32f to 07c707d Compare March 30, 2026 22:39

jucor force-pushed the spr/edge/f39f3218 branch from 510205d to b1ebec4 Compare March 30, 2026 22:47

jucor force-pushed the spr/edge/ea747196 branch 2 times, most recently from e046e08 to b982198 Compare March 31, 2026 00:35

jucor force-pushed the spr/edge/f39f3218 branch from b1ebec4 to 8637399 Compare March 31, 2026 00:35

ballPointPenguin approved these changes Apr 26, 2026

View reviewed changes

jucor requested a review from Copilot May 19, 2026 21:43

Copilot started reviewing on behalf of jucor May 19, 2026 21:44 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

jucor changed the title ~~[Stack 9/17] Vectorize participant info computation (3-15x speedup)~~ Vectorize participant info computation (3-15x speedup) May 19, 2026

jucor force-pushed the spr/edge/f39f3218 branch from 8637399 to daad2ff Compare May 19, 2026 22:09

jucor force-pushed the spr/edge/ea747196 branch from b982198 to 280fa1c Compare May 19, 2026 22:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vectorize participant info computation (3-15x speedup)#2516

Vectorize participant info computation (3-15x speedup)#2516
jucor wants to merge 1 commit into
spr/edge/f39f3218from
spr/edge/ea747196

jucor commented Mar 30, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jucor commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark results

Test plan

Squashed commits

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions Bot commented May 19, 2026

Delphi Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jucor commented Mar 30, 2026 •

edited

Loading