Fix D2: in-conv participant threshold + D2c vote count source#2513
Open
jucor wants to merge 1 commit into
Open
Fix D2: in-conv participant threshold + D2c vote count source#2513jucor wants to merge 1 commit into
jucor wants to merge 1 commit into
Conversation
This was referenced Mar 30, 2026
02284b0 to
7f20a34
Compare
ballPointPenguin
approved these changes
Apr 26, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
Aligns Delphi’s “in-conv” participant filtering with the legacy Clojure math pipeline by correcting the vote threshold formula, ensuring vote counts come from the raw (unfiltered) vote matrix, and preserving base-cluster ID/encounter order. Adds targeted discrepancy + monotonicity guard tests and updates supporting documentation/journal entries.
Changes:
- Update in-conv threshold to
min(7, n_comments)and compute bothn_cmts+ per-user vote counts fromraw_rating_mat(so moderated-out comment votes still count). - Preserve base-cluster ordering by k-means ID (avoid size-sorting + ID reassignment).
- Add D2c/D2d synthetic + monotonicity guard tests; update plan/journal docs and a related test comment.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
delphi/polismath/conversation/conversation.py |
Adjusts base-cluster ordering and switches in-conv vote counting / threshold inputs to raw_rating_mat. |
delphi/tests/test_discrepancy_fixes.py |
Removes prior D2 xfails for cold-start blobs, adds D2c tests for raw-vs-filtered vote counting, and adds D2d monotonicity guard tests. |
delphi/tests/test_conversation.py |
Updates an in-test comment to reflect the new threshold definition. |
delphi/docs/PLAN_DISCREPANCY_FIXES.md |
Marks D2/D2b/D2c/D2d as done and documents incremental-blob deferral rationale. |
delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md |
Adds detailed journal entries describing the D2-related investigations and fixes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+303
to
+304
| 0: list(range(10)), # 10 raw votes → in-conv | ||
| 1: list(range(5, 11)), # 6 raw votes (only 1 moderated-out) → NOT in-conv |
Comment on lines
+592
to
+595
| # Keep base clusters in k-means ID order (matching Clojure's sort-by :id) | ||
| # Do NOT sort by size or reassign IDs — that would change the encounter | ||
| # order of centers used in group clustering's first-k-distinct initialization. | ||
| base_clusters.sort(key=lambda c: c['id']) |
## Summary
Fixes the in-conv participant threshold (D2), vote count source (D2c), and base-cluster sort order (D2b) to match Clojure. Adds monotonicity guard tests (D2d).
### D2: In-conv threshold
- **Before**: `threshold = 7 + sqrt(n_cmts) * 0.1` — increasingly restrictive for larger conversations (e.g., 8.8 for biodiversity's 314 comments)
- **After**: `threshold = min(7, n_cmts)` — matches Clojure exactly
### D2b: Base-cluster sort order (from Copilot review)
- **Before**: Base clusters sorted by size (descending) with IDs reassigned — changes encounter order of centers fed into group-level k-means
- **After**: Keep k-means ID order, matching Clojure's `(sort-by :id ...)`
### D2c: Vote count source (raw vs filtered matrix)
- **Before**: `_compute_user_vote_counts` and `n_cmts` used `self.rating_mat` (filtered — moderated-out comment columns removed). A participant who voted on 8 comments could drop to 5 visible votes after 3 comments were moderated-out, falling below threshold.
- **After**: Both use `self.raw_rating_mat` (includes all votes, even on moderated-out comments), matching Clojure's `user-vote-counts` (conversation.clj:217-225) which reads from `raw-rating-mat`.
### D2d: In-conv monotonicity (design decision)
Python does full recompute from `raw_rating_mat` every time, so monotonicity ("once in, always in") is guaranteed without persistence — votes are immutable in PostgreSQL, so a participant's count never decreases. This is **strictly better** than Clojure's approach (which persists in-conv to `math_main` because it uses delta vote processing).
5 guard tests (T1-T5) document this invariant and warn that switching to delta processing would require persisting in-conv to DynamoDB (ref: #2358).
### Impact
- biodiversity: 428 → 441 in-conv participants (now matches Clojure)
- Verified on 4 datasets with complete Clojure cold-start blobs
### Incremental vs cold-start blob testing
D2 tests run against both **cold-start** and **incremental** Clojure blobs (infrastructure from #2420):
- **Cold-start blobs** are computed in one pass on the full dataset. The in-conv threshold `min(7, n_cmts)` is evaluated once with the final `n_cmts`. Python matches these exactly.
- **Incremental blobs** were built progressively as votes trickled in over the conversation's lifetime. The threshold was evaluated at each iteration with a smaller `n_cmts`, admitting a few extra participants during earlier iterations. The difference is tiny (1–2 participants).
D2 tests on incremental blobs are currently **xfailed** with an explanatory comment. Matching incremental behaviour exactly would require simulating the progressive threshold — tracked as future work under Replay Infrastructure.
### Test results
```
253 passed, 5 skipped, 36 xfailed (0 failures)
```
## Test plan
- [x] D2 tests pass on all datasets with complete Clojure cold-start blobs
- [x] D2c: 3 synthetic tests verify vote counts include moderated-out votes, n_cmts includes moderated-out comments, participants stay in-conv after moderation
- [x] D2d: 5 monotonicity tests (basic across updates, survives moderation, worker restart + moderation, restart without new votes, mixed participants)
- [x] D2 tests xfail on incremental blobs (with explanatory comments)
- [x] Full test suite: 253 passed, 0 failures
- [x] Golden snapshots re-recorded for affected datasets
🤖 Generated with [Claude Code](https://claude.com/claude-code)
## Squashed commits
- Fix D2: in-conv threshold min(7, n_cmts) to match Clojure
- Skip D2 tests on datasets with incomplete Clojure blobs
- Address Copilot review: fix base-cluster sort order (D2b) and stale comment
- Add PR 1 test results to journal
- Plan: add D2c (vote count source) and D2d (in-conv monotonicity) to fix plan
- Journal: add session 3 findings (D2c vote count source, D2d monotonicity)
- Re-record golden snapshots and remove passing xfail markers
- xfail D2 in-conv tests on incremental blobs
- Journal: add session 4, update plan with D2 incremental in Replay PR B
- Fix D2c: use raw_rating_mat for vote counts and n_cmts threshold
commit-id:c0a682ec
Delphi Coverage Report
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the in-conv participant threshold (D2), vote count source (D2c), and base-cluster sort order (D2b) to match Clojure. Adds monotonicity guard tests (D2d).
D2: In-conv threshold
threshold = 7 + sqrt(n_cmts) * 0.1— increasingly restrictive for larger conversations (e.g., 8.8 for biodiversity's 314 comments)threshold = min(7, n_cmts)— matches Clojure exactlyD2b: Base-cluster sort order (from Copilot review)
(sort-by :id ...)D2c: Vote count source (raw vs filtered matrix)
_compute_user_vote_countsandn_cmtsusedself.rating_mat(filtered — moderated-out comment columns removed). A participant who voted on 8 comments could drop to 5 visible votes after 3 comments were moderated-out, falling below threshold.self.raw_rating_mat(includes all votes, even on moderated-out comments), matching Clojure'suser-vote-counts(conversation.clj:217-225) which reads fromraw-rating-mat.D2d: In-conv monotonicity (design decision)
Python does full recompute from
raw_rating_matevery time, so monotonicity ("once in, always in") is guaranteed without persistence — votes are immutable in PostgreSQL, so a participant's count never decreases. This is strictly better than Clojure's approach (which persists in-conv tomath_mainbecause it uses delta vote processing).5 guard tests (T1-T5) document this invariant and warn that switching to delta processing would require persisting in-conv to DynamoDB (ref: #2358).
Impact
Incremental vs cold-start blob testing
D2 tests run against both cold-start and incremental Clojure blobs (infrastructure from #2420):
min(7, n_cmts)is evaluated once with the finaln_cmts. Python matches these exactly.n_cmts, admitting a few extra participants during earlier iterations. The difference is tiny (1–2 participants).D2 tests on incremental blobs are currently xfailed with an explanatory comment. Matching incremental behaviour exactly would require simulating the progressive threshold — tracked as future work under Replay Infrastructure.
Test results
Test plan
🤖 Generated with Claude Code
Squashed commits
commit-id:c0a682ec
Stack: