Skip to content

Fix group-cluster serialization: unfold base-cluster IDs to participant IDs (squashed into #2431)#2433

Closed
jucor wants to merge 9 commits into
jc/cold-start-toolingfrom
jc/participant-id-unfolding
Closed

Fix group-cluster serialization: unfold base-cluster IDs to participant IDs (squashed into #2431)#2433
jucor wants to merge 9 commits into
jc/cold-start-toolingfrom
jc/participant-id-unfolding

Conversation

@jucor
Copy link
Copy Markdown
Collaborator

@jucor jucor commented Mar 10, 2026

Summary

Stacked on #2432 (Cold-start Clojure math blob generation and cluster visualization). Please review and merge #2432 first.
Next in stack: #2419 (Deep analysis of Python-Clojure discrepancies and fix plan)

Fixes a bug where two-level clustering output exposed internal base-cluster IDs instead of participant IDs to downstream consumers and serialization.

Changes

  • _unfolded_group_clusters() helper in conversation.py: Expands group-cluster members from base-cluster IDs to participant IDs (equivalent to Clojure's clusters/group-members).
  • Fix 5 downstream consumers: conv_repness, participant_stats, group_votes (in _compute_representativeness, _compute_participant_info, to_dict).
  • Fix 5 serialization paths: to_dict() (group-clusters, group_clusters, base-clusters), get_full_data() (group_clusters), to_dynamo_dict() (base_clusters/group_clusters).
  • Tighten Clojure comparison thresholds: 0.95→0.99 Jaccard similarity, 0.05→0.01 distribution tolerance, exact comment priority matching (1e-6 tolerance).
  • New test_serialization_unfolding.py (TDD: 6/8 fail on the bug, 8/8 pass after fix).
  • Re-record golden snapshots with correct participant IDs.

Test plan

  • 220 passed, 2 skipped, 4 xfailed, 0 failures
  • 4 former xfails now pass (test_basic_outputs, test_repness_structure)
  • 8 new serialization tests pass
    🤖 Generated with Claude Code

@jucor jucor requested review from ballPointPenguin and whilo March 10, 2026 16:08
@jucor jucor changed the title Fix group-cluster serialization: unfold base-cluster IDs to participant IDs [Stack 5/8] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 10, 2026
@jucor jucor changed the title [Stack 5/8] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs [Stack 5/9] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 11, 2026
@jucor jucor changed the title [Stack 5/9] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs [Stack 5/10] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 11, 2026
@jucor jucor changed the title [Stack 5/10] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs [Stack 5/11] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 11, 2026
@jucor jucor requested a review from Copilot March 13, 2026 12:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where two-level clustering output exposed internal base-cluster IDs instead of participant IDs in group-cluster members. It introduces a _unfolded_group_clusters() helper that expands base-cluster IDs to participant IDs and applies it across all downstream consumers and serialization paths.

Changes:

  • Added _unfolded_group_clusters() helper in conversation.py and updated 5 downstream consumers (conv_repness, participant_stats, group_votes, to_dict, get_full_data, to_dynamo_dict) to use unfolded participant IDs.
  • Tightened Clojure comparison thresholds (Jaccard 0.95→0.99, distribution tolerance 0.05→0.01, exact comment priority matching) and updated documentation to reflect both Python and Clojure now use two-level clustering.
  • Added test_serialization_unfolding.py with 8 tests covering all serialization paths, and removed xfail markers from tests that now pass with the fix.

Reviewed changes

Copilot reviewed 5 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
delphi/polismath/conversation/conversation.py Core fix: added _unfolded_group_clusters() and updated all serialization/consumer call sites
delphi/tests/test_serialization_unfolding.py New TDD tests verifying serialized cluster members are participant IDs
delphi/tests/test_repness_smoke.py Updated to use _unfolded_group_clusters(), removed xfail on test_repness_structure
delphi/tests/test_legacy_clojure_regression.py Updated clustering comparison to unfold both sides, tightened thresholds, removed xfail
delphi/polismath/regression/clojure_comparer.py Updated docstring to reflect both implementations use two-level clustering

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Convert group clusters (unfolded: base-cluster IDs → participant IDs)
base_clusters = []
for cluster in self.group_clusters:
for cluster in self._unfolded_group_clusters():
@jucor jucor changed the title [Stack 5/11] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs [Stack 5/12] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 13, 2026
@jucor jucor changed the title [Stack 5/12] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs [Stack 5/13] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 13, 2026
Copy link
Copy Markdown
Member

@ballPointPenguin ballPointPenguin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@jucor jucor changed the title [Stack 5/13] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs [Stack 5/15] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 16, 2026
jucor and others added 9 commits March 16, 2026 16:04
Creates generate_cold_start_clojure.py to generate fresh cold-start Clojure
reference data for fair Python vs Clojure comparison. The script:
- Stops if math worker is running (prevents conflicts)
- Backs up existing math_main row
- Deletes row to force cold-start (load-or-init creates fresh new-conv)
- Runs Clojure computation via Docker
- Extracts cold-start math blob
- Restores original row automatically

Key features:
- Support for --all flag to process all datasets
- Support for --include-local flag for local datasets
- Automatic zid lookup from report_id via reports table
- Loads configuration from polis-kmeans/.env (DATABASE_URL)

Test infrastructure updates:
- datasets.py now prefers cold-start blobs when available
- Added has_cold_start_blob field to DatasetInfo
- get_dataset_files() uses cold-start blob by default

Documentation updates:
- Comprehensive usage guide in SESSION_HANDOFF_KMEANS.md
- Commands to find and verify cold-start blobs
- Configuration requirements and setup instructions

Reference data:
- Generated cold-start blobs for biodiversity and vw datasets
- Tests will now use these for fair cold vs cold comparison

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…add visualization with PCA sign flip detection

Cold-start generation (generate_cold_start_clojure.py):
- Rewrite using "conversation replay" approach that works with Clojure poller design
- Creates temporary conversation with fresh zid, copies votes with fresh timestamps
- Runs poller with MATH_ZID_ALLOWLIST to only process the replayed conversation
- Automatically cleans up all temporary data (math tables, votes, conversation)
- Add bash wrapper script (generate_cold_start.sh) that stops math containers first

Visualization (visualize_cluster_comparison.py):
- Add PCA sign flip detection by comparing component correlations
- Apply sign corrections to base cluster centers before visualization
- Fix convex hull rendering to show outlines for both datasets in overlay view
- Include sign_flips in metrics JSON output

Documentation:
- Update SESSION_HANDOFF_KMEANS.md with new approach and remove "BROKEN" warnings
- Document the conversation replay workflow and cleanup behavior

Regenerate cold-start blobs for biodiversity and vw datasets.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Cold-start generator (generate_cold_start_clojure.py):
- Add --pause-math to pause/resume workers instead of stopping
- Add --verbose/-v for real-time Clojure poller output
- Support multiple datasets as arguments
- Use fast INSERT...SELECT for vote copying (was executemany)
- Handle duplicate votes with DISTINCT ON
- Remove shell wrapper (functionality now in Python script)

Cluster visualizer (visualize_cluster_comparison.py):
- Add --all option for processing all datasets
- Synchronize X/Y axis limits in side-by-side plots
- Print full absolute paths for generated PNGs
- Support multiple datasets as arguments

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Deferred from commit 18ad361 — the test_datasets.py changes depend on the
has_cold_start_blob field introduced in datasets.py by the cold-start tooling.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…am consumers

Group clusters store base-cluster IDs in 'members' (matching Clojure's
two-level clustering architecture), but downstream functions (conv_repness,
participant_stats, group_votes) need participant IDs to join against the
vote matrix.

Add _unfolded_group_clusters() helper (equivalent to Clojure's
clusters/group-members) and use it in all 5 call sites:
- _compute_repness
- _compute_participant_info
- _compute_group_votes
- to_dict group-votes
- to_dynamo_dict group_votes

Also re-record golden snapshots and remove xfail from test_repness_structure
(now passes with correct unfolding).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- test_group_clustering: unfold Python clusters via _unfolded_group_clusters()
  (both sides now use two-level clustering), tighten thresholds to 0.99
  Jaccard / 0.01 distribution tolerance, require overall_match
- test_comment_priorities: require exact match (1e-6 tolerance) for all
  comment IDs instead of 70% at 20% tolerance
- clojure_comparer: fix docstring to reflect Python also uses two-level
  clustering

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
to_dict(), get_full_data(), and to_dynamo_dict() were writing
self.group_clusters directly, whose 'members' contain base-cluster IDs
(integers 0..N).  Downstream consumers (group_data.py, Clojure compat,
client apps) expect participant IDs.

The internal helper _unfolded_group_clusters() already existed and was
used for repness/participant_info/group_votes computation, but the five
serialization sites were missed.

Fix all five sites to unfold before serializing:
- to_dict():        group-clusters, group_clusters, base-clusters
- get_full_data():  group_clusters
- to_dynamo_dict(): base_clusters / group_clusters

Add test_serialization_unfolding.py (TDD: 6 tests fail on the bug,
8/8 pass after fix) using real recompute() pipeline output.

Re-record golden snapshots to reflect the corrected output.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@jucor jucor force-pushed the jc/cold-start-tooling branch from ab15ae0 to 04398a1 Compare March 16, 2026 16:04
@jucor jucor force-pushed the jc/participant-id-unfolding branch from b2f145d to 48afbe4 Compare March 16, 2026 16:04
@jucor jucor changed the title [Stack 5/15] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs [Stack 5/16] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 16, 2026
@github-actions
Copy link
Copy Markdown

Delphi Coverage Report

File Stmts Miss Cover
init.py 3 0 100%
main.py 55 55 0%
benchmarks/bench_pca.py 76 76 0%
benchmarks/bench_repness.py 81 81 0%
benchmarks/bench_update_votes.py 38 38 0%
benchmarks/benchmark_utils.py 34 34 0%
components/init.py 2 0 100%
components/config.py 165 133 19%
components/server.py 116 72 38%
conversation/init.py 2 0 100%
conversation/conversation.py 1119 336 70%
conversation/manager.py 131 42 68%
database/init.py 1 0 100%
database/dynamodb.py 387 233 40%
database/postgres.py 306 205 33%
pca_kmeans_rep/init.py 5 0 100%
pca_kmeans_rep/clusters.py 265 22 92%
pca_kmeans_rep/corr.py 98 17 83%
pca_kmeans_rep/pca.py 50 15 70%
pca_kmeans_rep/repness.py 361 48 87%
pca_kmeans_rep/stats.py 107 22 79%
poller.py 224 188 16%
regression/init.py 5 0 100%
regression/clojure_comparer.py 182 83 54%
regression/comparer.py 887 403 55%
regression/datasets.py 103 22 79%
regression/recorder.py 36 27 25%
regression/utils.py 137 38 72%
run_math_pipeline.py 260 114 56%
system.py 85 55 35%
umap_narrative/500_generate_embedding_umap_cluster.py 210 109 48%
umap_narrative/501_calculate_comment_extremity.py 112 54 52%
umap_narrative/502_calculate_priorities.py 135 135 0%
umap_narrative/700_datamapplot_for_layer.py 502 502 0%
umap_narrative/701_static_datamapplot_for_layer.py 310 310 0%
umap_narrative/702_consensus_divisive_datamapplot.py 432 432 0%
umap_narrative/801_narrative_report_batch.py 787 787 0%
umap_narrative/802_process_batch_results.py 265 265 0%
umap_narrative/803_check_batch_status.py 175 175 0%
umap_narrative/llm_factory_constructor/init.py 2 2 0%
umap_narrative/llm_factory_constructor/model_provider.py 157 157 0%
umap_narrative/polismath_commentgraph/init.py 1 0 100%
umap_narrative/polismath_commentgraph/cli.py 270 270 0%
umap_narrative/polismath_commentgraph/core/init.py 3 3 0%
umap_narrative/polismath_commentgraph/core/clustering.py 110 110 0%
umap_narrative/polismath_commentgraph/core/embedding.py 104 104 0%
umap_narrative/polismath_commentgraph/lambda_handler.py 219 219 0%
umap_narrative/polismath_commentgraph/schemas/init.py 2 0 100%
umap_narrative/polismath_commentgraph/schemas/dynamo_models.py 160 9 94%
umap_narrative/polismath_commentgraph/tests/conftest.py 17 17 0%
umap_narrative/polismath_commentgraph/tests/test_clustering.py 74 74 0%
umap_narrative/polismath_commentgraph/tests/test_embedding.py 55 55 0%
umap_narrative/polismath_commentgraph/tests/test_storage.py 87 87 0%
umap_narrative/polismath_commentgraph/utils/init.py 3 0 100%
umap_narrative/polismath_commentgraph/utils/converter.py 283 237 16%
umap_narrative/polismath_commentgraph/utils/group_data.py 354 336 5%
umap_narrative/polismath_commentgraph/utils/storage.py 585 477 18%
umap_narrative/reset_conversation.py 159 50 69%
umap_narrative/run_pipeline.py 453 312 31%
utils/general.py 63 41 35%
Total 11410 7688 33%

@jucor jucor changed the title [Stack 5/16] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs [Stack 5/17] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 16, 2026
@jucor jucor changed the title [Stack 5/17] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs [Stack 5/24] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 17, 2026
@jucor jucor changed the title [Stack 5/24] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs [Stack 5/25] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Mar 17, 2026
@jucor jucor force-pushed the jc/cold-start-tooling branch from 04398a1 to ad6cf37 Compare March 19, 2026 10:42
@jucor
Copy link
Copy Markdown
Collaborator Author

jucor commented Mar 19, 2026

Squashed into #2431 (Stack 3, jc/two-level-clustering). The unfolding fix is now part of the two-level clustering PR that introduced the code it fixes.

@jucor jucor closed this Mar 19, 2026
@jucor jucor changed the title [Stack 5/25] Fix group-cluster serialization: unfold base-cluster IDs to participant IDs Fix group-cluster serialization: unfold base-cluster IDs to participant IDs (squashed into #2431) Mar 19, 2026
@jucor
Copy link
Copy Markdown
Collaborator Author

jucor commented Mar 19, 2026

Thanks @ballPointPenguin !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants