Script to Remediate Over Cap Visits#1152
Conversation
|
would appreciate your review on this one please. |
calellowitz
left a comment
There was a problem hiding this comment.
I think the logic is not quite right here, but I have some higher level concerns that I will also leave in the ticket. I suspect what is happening here is users overriding the overlimit status of visits that had been correctly flagged, rather than the system missing them. That has happened both intentionally and unintentionally in the past, and I would be concerned about overwriting those in an automated way, without talking to the individual PMs for each affected intervention, especially if some have already been paid.
| .order_by("date_created") | ||
| .values_list("id", flat=True) | ||
| ) | ||
| if len(active_visit_ids) > cl.max_visits: |
There was a problem hiding this comment.
I don't think this a correct comparison. ClaimLimits are per payment unit, and some payment units contain multiple deliver units, but this is counting UserVisits/deliver units. So if a payment unit requires a registration and service delivery form, which I think is a common setup even for single visit interventions, this will start to exclude visits only halfway to the limit. There will be 100 UserVisits when there are only 50 payment earned.
There was a problem hiding this comment.
Good catch, I've switched to counting CompletedWork rows instead (one per payment unit), and then flip each over-cap CW + all its visits as a unit.
Addressed in 4b5bbc3.
|
I have same feeling as Cal on this that it's probably best to do this on opportunities where this is explicitly asked for rather than on all Opportunities for the same reasons he hilights. |
The cap (claim_limit.max_visits) is on earned payment units, not raw UserVisits. A payment unit can have multiple deliver units (e.g. a registration + a service-delivery form), so counting visits across deliver units double-counts: 100 UserVisits = 50 earned payment units when there are 2 deliver units per payment unit, but the previous identification flagged that as already at the cap. Switch to counting CompletedWork rows: each CW is unique per (access, entity_id, payment_unit) and represents one earned payment unit regardless of how many deliver-unit forms it took to satisfy. For each over-cap CW we now flip the CW *and all its UserVisits* to over_limit as a unit, removing the previous "mixed completed-work" warning case (no longer possible by construction). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before applying the fix in prod the operator wants to know how much worker and org accrual will drop. Aggregate saved_payment_accrued and saved_org_payment_accrued across the over-cap CompletedWork rows, then print per-access and total deltas in the DRY_RUN branch. Also include the projected reduction in the final post-apply summary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The recompute loop runs outside the atomic block (update_payment_accrued_for_user takes its own Redis lock per access), so a Ctrl-C or transient failure mid-loop leaves the status flips committed but payment_accrued stale for the unfinished accesses. Wrap the loop in try/finally and print a WARNING listing the access ids that still need a manual recompute. Also pre-fetch accesses with in_bulk instead of N .get() calls and derive affected_access_ids from the plan upfront. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@sravfeyn @calellowitz Thanks for the feedback on this. I agree with the points raised and also think it would be risky to apply this generally to all opps in prod. My plan is to just run this script for the affected opp in the linked ticket as they've explicitly requested it. Let me know if there are any concerns with this approach. I've made a few refactors to the code, please could I get another review pass? |
|
Since we know exactly which 4 visits should have been marked as overlimit, why not just do a simple update like below? (Not exactly, but on the lines of) |
calellowitz
left a comment
There was a problem hiding this comment.
thanks, agree seems fine if only for the affected xforms/opps
Product Description
No user-facing changes. This PR will not be merged — it exists solely to give reviewers a clean diff to comment on for the one-off remediation script that will be executed manually via
./manage.py shellin production. The script targetsUserVisitrows that were silently auto-approved past their per-worker cap by the bug fixed inze/fix-cap-bypass-when-duplicate-flag-off(CI-639), flipping them toover_limit, propagating the status to theirCompletedWork, and recomputingpayment_accruedfor the affected workers.Technical Summary
Link to ticket here — companion to the code fix on
ze/fix-cap-bypass-when-duplicate-flag-off.This is a release path 1 feature — Improvements to existing features & quick wins
The script reproduces the cap-check the buggy submission path failed to honour: for each
OpportunityClaimLimit, it loads active visits (status ∉ {over_limit,trial}) for that(opportunity_access, payment_unit)ordered bydate_created, treats the firstmax_visitsas legitimate, and treats the tail as the over-cap set that requires correction. The default scope is a single opportunity (set viaOPP_UUID); passNonefor a global scan once the targeted run is verified.DRY_RUN = Trueis the default so a reviewer can confirm the breakdown before any write happens._find_over_cap_visit_ids): ordering bydate_created(server-received timestamp) ensures that earlier in-cap submissions are kept and only the chronological tail is flipped. Excludes already-over_limitandtrialrows so re-runs are idempotent.CompletedWorksafety (_completed_works_safe_to_flip): aCompletedWorkis only flipped toover_limitif every linkedUserVisitis in the over-cap set. Mixed completed-works (linked to both in-cap and over-cap visits) are not touched and are reported as a warning for manual review — this matches the in-flight behaviour atprocessor.py:427-429, which also gates the completed-work flip per visit.v.status = over_limit) so theUserVisit.__setattr__override updatesstatus_modified_dateautomatically.bulk_updatethen writes both fields in batches ofBATCH_SIZE = 500inside a singletransaction.atomic()block.select_for_update()per batch protects against concurrent submissions touching the same rows mid-script.update_payment_accrued_for_user(access, incremental=False)is called for each affected access. Theincremental=Falserecompute is required because flipping a visit down fromapprovedtoover_limitis the opposite of what the incremental path is designed for (it skips already-approved completed-works).Safety Assurance
Safety story
./manage.py shell < scripts/remediate_over_cap_visits.py) on a pre-arranged production window. No CI deploy, no scheduled job.DRY_RUN = Trueat the top means the first invocation prints the plan without writing. A reviewer should require the actual production run to be witnessed (or executed) by a second operator after the dry-run output has been sanity-checked.transaction.atomic()block; if the run is killed mid-flight or any constraint is violated, nothing commits.UserVisitflipped fromapprovedtoover_limitcan be flipped back via the same import/admin paths that already exist (bulk_update_visit_statusinvisit_import.py). Thestatus_modified_dateis updated, so the audit trail records when the remediation ran.over_limit/trialfrom the "active" count, so a second run finds nothing to do.CompletedWork. Mixed completed-works are never touched automatically; the warning lists them by id for human review. This avoids the failure mode where a completed-work shared between an in-cap and out-of-cap visit would have its status incorrectly flipped.payment_accruedrecompute reduces the future accrual but does not undo prior payouts. Operator must confirm the agreed financial treatment with finance (absorb / deduct / adjust) before the non-dry-run execution.update_payment_accrued_for_userdoes not gate onaccess.suspended. If the production data has any suspended-but-affected workers, decide before running whether to skip them or include them.Automated test coverage
No automated tests are added in this PR. The script is a one-shot remediation — its correctness is validated by:
test_over_limit_status_preserved_when_duplicate_flag_disabled(inze/fix-cap-bypass-when-duplicate-flag-off), which guarantees the bug cannot reproduce after deploy.If reviewers want it as a defensible test, the suggested approach would be: build a fixture with one access at the cap and one access two visits over, run the script in-process with
DRY_RUN = False, and assert (a) the in-cap access is untouched, (b) the over-cap access has exactly two visits flipped toover_limit, (c) the completed-works are flipped where every linked visit is over-cap, (d)payment_accruedis recomputed downward. Happy to add this if desired.QA Plan
QA will not be performed for this change. Below is the testing plan for reference:
_find_over_cap_visit_idsmatches the intent (chronological tail beyondmax_visits)._completed_works_safe_to_flipcorrectly leaves mixed completed-works alone.OPP_UUID = "21bca9f7-00f9-4804-a7c0-e77c6139e579"andDRY_RUN = True. Confirm the output reports exactly two over-cap visits across two distinct accesses, matching the prod-data we already audited.DRY_RUN = Falseand re-run. Verify the twoUserVisitrows are nowover_limit, the twoCompletedWorkrows areover_limit, and the two affected workers'payment_accruedreflects the recompute.DRY_RUN = Trueagain — should report "Nothing to remediate" (idempotency check).OPP_UUIDset to the affected opportunity,DRY_RUN = Truefirst, thenDRY_RUN = Falseonly after the dry-run is reviewed. Capture the output.OPP_UUID = NoneandDRY_RUN = Trueto scan for any other opportunities silently affected by the same bug. Decide per-opportunity remediation with finance.Labels & Review