Fix group-cluster serialization: unfold base-cluster IDs to participant IDs (squashed into #2431) by jucor · Pull Request #2433 · compdemocracy/polis

jucor · 2026-03-10T15:47:07Z

Summary

Stacked on #2432 (Cold-start Clojure math blob generation and cluster visualization). Please review and merge #2432 first.
Next in stack: #2419 (Deep analysis of Python-Clojure discrepancies and fix plan)

Fixes a bug where two-level clustering output exposed internal base-cluster IDs instead of participant IDs to downstream consumers and serialization.

Changes

_unfolded_group_clusters() helper in conversation.py: Expands group-cluster members from base-cluster IDs to participant IDs (equivalent to Clojure's clusters/group-members).
Fix 5 downstream consumers: conv_repness, participant_stats, group_votes (in _compute_representativeness, _compute_participant_info, to_dict).
Fix 5 serialization paths: to_dict() (group-clusters, group_clusters, base-clusters), get_full_data() (group_clusters), to_dynamo_dict() (base_clusters/group_clusters).
Tighten Clojure comparison thresholds: 0.95→0.99 Jaccard similarity, 0.05→0.01 distribution tolerance, exact comment priority matching (1e-6 tolerance).
New test_serialization_unfolding.py (TDD: 6/8 fail on the bug, 8/8 pass after fix).
Re-record golden snapshots with correct participant IDs.

Test plan

220 passed, 2 skipped, 4 xfailed, 0 failures
4 former xfails now pass (test_basic_outputs, test_repness_structure)
8 new serialization tests pass
🤖 Generated with Claude Code

Copilot

Pull request overview

This PR fixes a bug where two-level clustering output exposed internal base-cluster IDs instead of participant IDs in group-cluster members. It introduces a _unfolded_group_clusters() helper that expands base-cluster IDs to participant IDs and applies it across all downstream consumers and serialization paths.

Changes:

Added _unfolded_group_clusters() helper in conversation.py and updated 5 downstream consumers (conv_repness, participant_stats, group_votes, to_dict, get_full_data, to_dynamo_dict) to use unfolded participant IDs.
Tightened Clojure comparison thresholds (Jaccard 0.95→0.99, distribution tolerance 0.05→0.01, exact comment priority matching) and updated documentation to reflect both Python and Clojure now use two-level clustering.
Added test_serialization_unfolding.py with 8 tests covering all serialization paths, and removed xfail markers from tests that now pass with the fix.

Reviewed changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`delphi/polismath/conversation/conversation.py`	Core fix: added `_unfolded_group_clusters()` and updated all serialization/consumer call sites
`delphi/tests/test_serialization_unfolding.py`	New TDD tests verifying serialized cluster members are participant IDs
`delphi/tests/test_repness_smoke.py`	Updated to use `_unfolded_group_clusters()`, removed `xfail` on `test_repness_structure`
`delphi/tests/test_legacy_clojure_regression.py`	Updated clustering comparison to unfold both sides, tightened thresholds, removed `xfail`
`delphi/polismath/regression/clojure_comparer.py`	Updated docstring to reflect both implementations use two-level clustering

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        # Convert group clusters (unfolded: base-cluster IDs → participant IDs)
        base_clusters = []
-        for cluster in self.group_clusters:
+        for cluster in self._unfolded_group_clusters():


ballPointPenguin

good catch!

Creates generate_cold_start_clojure.py to generate fresh cold-start Clojure reference data for fair Python vs Clojure comparison. The script: - Stops if math worker is running (prevents conflicts) - Backs up existing math_main row - Deletes row to force cold-start (load-or-init creates fresh new-conv) - Runs Clojure computation via Docker - Extracts cold-start math blob - Restores original row automatically Key features: - Support for --all flag to process all datasets - Support for --include-local flag for local datasets - Automatic zid lookup from report_id via reports table - Loads configuration from polis-kmeans/.env (DATABASE_URL) Test infrastructure updates: - datasets.py now prefers cold-start blobs when available - Added has_cold_start_blob field to DatasetInfo - get_dataset_files() uses cold-start blob by default Documentation updates: - Comprehensive usage guide in SESSION_HANDOFF_KMEANS.md - Commands to find and verify cold-start blobs - Configuration requirements and setup instructions Reference data: - Generated cold-start blobs for biodiversity and vw datasets - Tests will now use these for fair cold vs cold comparison Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

…add visualization with PCA sign flip detection Cold-start generation (generate_cold_start_clojure.py): - Rewrite using "conversation replay" approach that works with Clojure poller design - Creates temporary conversation with fresh zid, copies votes with fresh timestamps - Runs poller with MATH_ZID_ALLOWLIST to only process the replayed conversation - Automatically cleans up all temporary data (math tables, votes, conversation) - Add bash wrapper script (generate_cold_start.sh) that stops math containers first Visualization (visualize_cluster_comparison.py): - Add PCA sign flip detection by comparing component correlations - Apply sign corrections to base cluster centers before visualization - Fix convex hull rendering to show outlines for both datasets in overlay view - Include sign_flips in metrics JSON output Documentation: - Update SESSION_HANDOFF_KMEANS.md with new approach and remove "BROKEN" warnings - Document the conversation replay workflow and cleanup behavior Regenerate cold-start blobs for biodiversity and vw datasets. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Cold-start generator (generate_cold_start_clojure.py): - Add --pause-math to pause/resume workers instead of stopping - Add --verbose/-v for real-time Clojure poller output - Support multiple datasets as arguments - Use fast INSERT...SELECT for vote copying (was executemany) - Handle duplicate votes with DISTINCT ON - Remove shell wrapper (functionality now in Python script) Cluster visualizer (visualize_cluster_comparison.py): - Add --all option for processing all datasets - Synchronize X/Y axis limits in side-by-side plots - Print full absolute paths for generated PNGs - Support multiple datasets as arguments Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Deferred from commit 18ad361 — the test_datasets.py changes depend on the has_cold_start_blob field introduced in datasets.py by the cold-start tooling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…am consumers Group clusters store base-cluster IDs in 'members' (matching Clojure's two-level clustering architecture), but downstream functions (conv_repness, participant_stats, group_votes) need participant IDs to join against the vote matrix. Add _unfolded_group_clusters() helper (equivalent to Clojure's clusters/group-members) and use it in all 5 call sites: - _compute_repness - _compute_participant_info - _compute_group_votes - to_dict group-votes - to_dynamo_dict group_votes Also re-record golden snapshots and remove xfail from test_repness_structure (now passes with correct unfolding). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- test_group_clustering: unfold Python clusters via _unfolded_group_clusters() (both sides now use two-level clustering), tighten thresholds to 0.99 Jaccard / 0.01 distribution tolerance, require overall_match - test_comment_priorities: require exact match (1e-6 tolerance) for all comment IDs instead of 70% at 20% tolerance - clojure_comparer: fix docstring to reflect Python also uses two-level clustering Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

to_dict(), get_full_data(), and to_dynamo_dict() were writing self.group_clusters directly, whose 'members' contain base-cluster IDs (integers 0..N). Downstream consumers (group_data.py, Clojure compat, client apps) expect participant IDs. The internal helper _unfolded_group_clusters() already existed and was used for repness/participant_info/group_votes computation, but the five serialization sites were missed. Fix all five sites to unfold before serializing: - to_dict(): group-clusters, group_clusters, base-clusters - get_full_data(): group_clusters - to_dynamo_dict(): base_clusters / group_clusters Add test_serialization_unfolding.py (TDD: 6 tests fail on the bug, 8/8 pass after fix) using real recompute() pipeline output. Re-record golden snapshots to reflect the corrected output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-03-16T16:14:15Z

Delphi Coverage Report

File	Stmts	Miss	Cover
init.py	3	0	100%
main.py	55	55	0%
benchmarks/bench_pca.py	76	76	0%
benchmarks/bench_repness.py	81	81	0%
benchmarks/bench_update_votes.py	38	38	0%
benchmarks/benchmark_utils.py	34	34	0%
components/init.py	2	0	100%
components/config.py	165	133	19%
components/server.py	116	72	38%
conversation/init.py	2	0	100%
conversation/conversation.py	1119	336	70%
conversation/manager.py	131	42	68%
database/init.py	1	0	100%
database/dynamodb.py	387	233	40%
database/postgres.py	306	205	33%
pca_kmeans_rep/init.py	5	0	100%
pca_kmeans_rep/clusters.py	265	22	92%
pca_kmeans_rep/corr.py	98	17	83%
pca_kmeans_rep/pca.py	50	15	70%
pca_kmeans_rep/repness.py	361	48	87%
pca_kmeans_rep/stats.py	107	22	79%
poller.py	224	188	16%
regression/init.py	5	0	100%
regression/clojure_comparer.py	182	83	54%
regression/comparer.py	887	403	55%
regression/datasets.py	103	22	79%
regression/recorder.py	36	27	25%
regression/utils.py	137	38	72%
run_math_pipeline.py	260	114	56%
system.py	85	55	35%
umap_narrative/500_generate_embedding_umap_cluster.py	210	109	48%
umap_narrative/501_calculate_comment_extremity.py	112	54	52%
umap_narrative/502_calculate_priorities.py	135	135	0%
umap_narrative/700_datamapplot_for_layer.py	502	502	0%
umap_narrative/701_static_datamapplot_for_layer.py	310	310	0%
umap_narrative/702_consensus_divisive_datamapplot.py	432	432	0%
umap_narrative/801_narrative_report_batch.py	787	787	0%
umap_narrative/802_process_batch_results.py	265	265	0%
umap_narrative/803_check_batch_status.py	175	175	0%
umap_narrative/llm_factory_constructor/init.py	2	2	0%
umap_narrative/llm_factory_constructor/model_provider.py	157	157	0%
umap_narrative/polismath_commentgraph/init.py	1	0	100%
umap_narrative/polismath_commentgraph/cli.py	270	270	0%
umap_narrative/polismath_commentgraph/core/init.py	3	3	0%
umap_narrative/polismath_commentgraph/core/clustering.py	110	110	0%
umap_narrative/polismath_commentgraph/core/embedding.py	104	104	0%
umap_narrative/polismath_commentgraph/lambda_handler.py	219	219	0%
umap_narrative/polismath_commentgraph/schemas/init.py	2	0	100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py	160	9	94%
umap_narrative/polismath_commentgraph/tests/conftest.py	17	17	0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py	74	74	0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py	55	55	0%
umap_narrative/polismath_commentgraph/tests/test_storage.py	87	87	0%
umap_narrative/polismath_commentgraph/utils/init.py	3	0	100%
umap_narrative/polismath_commentgraph/utils/converter.py	283	237	16%
umap_narrative/polismath_commentgraph/utils/group_data.py	354	336	5%
umap_narrative/polismath_commentgraph/utils/storage.py	585	477	18%
umap_narrative/reset_conversation.py	159	50	69%
umap_narrative/run_pipeline.py	453	312	31%
utils/general.py	63	41	35%
Total	11410	7688	33%

jucor · 2026-03-19T10:55:08Z

Squashed into #2431 (Stack 3, jc/two-level-clustering). The unfolding fix is now part of the two-level clustering PR that introduced the code it fixes.

jucor · 2026-03-19T10:56:06Z

Thanks @ballPointPenguin !

This was referenced Mar 10, 2026

[Stack 5/25] Cold-start Clojure math blob generation and cluster visualization #2432

Closed

[Stack 8/27] Deep analysis of Python-Clojure discrepancies and fix plan #2419

Closed

[Stack 9/27] Per-discrepancy test infrastructure #2420

Closed

jucor requested review from ballPointPenguin and whilo March 10, 2026 16:08

jucor changed the title ~~Fix group-cluster serialization: unfold base-cluster IDs to participant IDs~~ [Stack 5/8] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 10, 2026

jucor changed the title ~~[Stack 5/8] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs~~ [Stack 5/9] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 11, 2026

jucor mentioned this pull request Mar 11, 2026

[Stack 11/27] Fix D4: pseudocount formula #2435

Closed

5 tasks

jucor changed the title ~~[Stack 5/9] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs~~ [Stack 5/10] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 11, 2026

jucor changed the title ~~[Stack 5/10] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs~~ [Stack 5/11] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 11, 2026

jucor requested a review from Copilot March 13, 2026 12:37

Copilot started reviewing on behalf of jucor March 13, 2026 12:37 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

Comment thread delphi/polismath/conversation/conversation.py

# Convert group clusters (unfolded: base-cluster IDs → participant IDs)

base_clusters = []

for cluster in self.group_clusters:

for cluster in self._unfolded_group_clusters():

jucor changed the title ~~[Stack 5/11] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs~~ [Stack 5/12] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 13, 2026

jucor changed the title ~~[Stack 5/12] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs~~ [Stack 5/13] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 13, 2026

ballPointPenguin approved these changes Mar 14, 2026

View reviewed changes

jucor changed the title ~~[Stack 5/13] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs~~ [Stack 5/15] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 16, 2026

jucor and others added 9 commits March 16, 2026 16:04

Compute cold-start math blobs for kmeans

d855d68

Interrupt upon Clojure error

a012017

test_datasets: add cold_start blob fixture and has_cold_start arg

04398a1

Deferred from commit 18ad361 — the test_datasets.py changes depend on the has_cold_start_blob field introduced in datasets.py by the cold-start tooling. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

jucor force-pushed the jc/cold-start-tooling branch from ab15ae0 to 04398a1 Compare March 16, 2026 16:04

jucor force-pushed the jc/participant-id-unfolding branch from b2f145d to 48afbe4 Compare March 16, 2026 16:04

jucor changed the title ~~[Stack 5/15] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs~~ [Stack 5/16] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 16, 2026

jucor changed the title ~~[Stack 5/16] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs~~ [Stack 5/17] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 16, 2026

jucor changed the title ~~[Stack 5/17] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs~~ [Stack 5/24] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 17, 2026

jucor changed the title ~~[Stack 5/24] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs~~ [Stack 5/25] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 17, 2026

jucor force-pushed the jc/cold-start-tooling branch from 04398a1 to ad6cf37 Compare March 19, 2026 10:42

jucor closed this Mar 19, 2026

jucor changed the title ~~[Stack 5/25] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs~~ Fix group-cluster serialization: unfold base-cluster IDs to participant IDs (squashed into #2431) Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix group-cluster serialization: unfold base-cluster IDs to participant IDs (squashed into #2431)#2433

Fix group-cluster serialization: unfold base-cluster IDs to participant IDs (squashed into #2431)#2433
jucor wants to merge 9 commits into
jc/cold-start-toolingfrom
jc/participant-id-unfolding

jucor commented Mar 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

ballPointPenguin left a comment

Uh oh!

github-actions Bot commented Mar 16, 2026

Uh oh!

jucor commented Mar 19, 2026

Uh oh!

jucor commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jucor commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

ballPointPenguin left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Mar 16, 2026

Delphi Coverage Report

Uh oh!

jucor commented Mar 19, 2026

Uh oh!

jucor commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jucor commented Mar 10, 2026 •

edited

Loading