Skip to content

Script to Remediate Over Cap Visits#1152

Draft
zandre-eng wants to merge 7 commits into
mainfrom
ze/remediate-over-cap-visits
Draft

Script to Remediate Over Cap Visits#1152
zandre-eng wants to merge 7 commits into
mainfrom
ze/remediate-over-cap-visits

Conversation

@zandre-eng
Copy link
Copy Markdown
Contributor

@zandre-eng zandre-eng commented Apr 28, 2026

Product Description

No user-facing changes. This PR will not be merged — it exists solely to give reviewers a clean diff to comment on for the one-off remediation script that will be executed manually via ./manage.py shell in production. The script targets UserVisit rows that were silently auto-approved past their per-worker cap by the bug fixed in ze/fix-cap-bypass-when-duplicate-flag-off (CI-639), flipping them to over_limit, propagating the status to their CompletedWork, and recomputing payment_accrued for the affected workers.

Technical Summary

Link to ticket here — companion to the code fix on ze/fix-cap-bypass-when-duplicate-flag-off.

This is a release path 1 feature — Improvements to existing features & quick wins

The script reproduces the cap-check the buggy submission path failed to honour: for each OpportunityClaimLimit, it loads active visits (status ∉ {over_limit, trial}) for that (opportunity_access, payment_unit) ordered by date_created, treats the first max_visits as legitimate, and treats the tail as the over-cap set that requires correction. The default scope is a single opportunity (set via OPP_UUID); pass None for a global scan once the targeted run is verified. DRY_RUN = True is the default so a reviewer can confirm the breakdown before any write happens.

  • Identification (_find_over_cap_visit_ids): ordering by date_created (server-received timestamp) ensures that earlier in-cap submissions are kept and only the chronological tail is flipped. Excludes already-over_limit and trial rows so re-runs are idempotent.
  • CompletedWork safety (_completed_works_safe_to_flip): a CompletedWork is only flipped to over_limit if every linked UserVisit is in the over-cap set. Mixed completed-works (linked to both in-cap and over-cap visits) are not touched and are reported as a warning for manual review — this matches the in-flight behaviour at processor.py:427-429, which also gates the completed-work flip per visit.
  • Status mutation: visits are flipped via Python attribute assignment (v.status = over_limit) so the UserVisit.__setattr__ override updates status_modified_date automatically. bulk_update then writes both fields in batches of BATCH_SIZE = 500 inside a single transaction.atomic() block.
  • Locking: select_for_update() per batch protects against concurrent submissions touching the same rows mid-script.
  • Payment recompute: after the status writes commit, update_payment_accrued_for_user(access, incremental=False) is called for each affected access. The incremental=False recompute is required because flipping a visit down from approved to over_limit is the opposite of what the incremental path is designed for (it skips already-approved completed-works).
  • Output: dry-run prints a per-access count of visits that would flip; non-dry-run prints final counts of visits/completed-works/accesses touched.

Safety Assurance

Safety story

  • Will not be merged. This PR is a review surface only; the script is intended to be invoked manually (./manage.py shell < scripts/remediate_over_cap_visits.py) on a pre-arranged production window. No CI deploy, no scheduled job.
  • Default is dry-run. DRY_RUN = True at the top means the first invocation prints the plan without writing. A reviewer should require the actual production run to be witnessed (or executed) by a second operator after the dry-run output has been sanity-checked.
  • Atomic. All status writes happen inside a single transaction.atomic() block; if the run is killed mid-flight or any constraint is violated, nothing commits.
  • Reversible at the row level. A UserVisit flipped from approved to over_limit can be flipped back via the same import/admin paths that already exist (bulk_update_visit_status in visit_import.py). The status_modified_date is updated, so the audit trail records when the remediation ran.
  • Idempotent. Re-running excludes rows already at over_limit/trial from the "active" count, so a second run finds nothing to do.
  • Conservative on CompletedWork. Mixed completed-works are never touched automatically; the warning lists them by id for human review. This avoids the failure mode where a completed-work shared between an in-cap and out-of-cap visit would have its status incorrectly flipped.
  • Money already paid out is not reversed. The payment_accrued recompute reduces the future accrual but does not undo prior payouts. Operator must confirm the agreed financial treatment with finance (absorb / deduct / adjust) before the non-dry-run execution.
  • Suspended workers are not skipped. The per-access call to update_payment_accrued_for_user does not gate on access.suspended. If the production data has any suspended-but-affected workers, decide before running whether to skip them or include them.

Automated test coverage

No automated tests are added in this PR. The script is a one-shot remediation — its correctness is validated by:

  • The companion code fix's regression test test_over_limit_status_preserved_when_duplicate_flag_disabled (in ze/fix-cap-bypass-when-duplicate-flag-off), which guarantees the bug cannot reproduce after deploy.
  • The dry-run output, reviewed before each non-dry-run execution.

If reviewers want it as a defensible test, the suggested approach would be: build a fixture with one access at the cap and one access two visits over, run the script in-process with DRY_RUN = False, and assert (a) the in-cap access is untouched, (b) the over-cap access has exactly two visits flipped to over_limit, (c) the completed-works are flipped where every linked visit is over-cap, (d) payment_accrued is recomputed downward. Happy to add this if desired.

QA Plan

QA will not be performed for this change. Below is the testing plan for reference:

  • Reviewer reads the script end-to-end, confirms the identification logic in _find_over_cap_visit_ids matches the intent (chronological tail beyond max_visits).
  • Reviewer confirms _completed_works_safe_to_flip correctly leaves mixed completed-works alone.
  • On staging (or a prod replica), run with OPP_UUID = "21bca9f7-00f9-4804-a7c0-e77c6139e579" and DRY_RUN = True. Confirm the output reports exactly two over-cap visits across two distinct accesses, matching the prod-data we already audited.
  • On staging, set DRY_RUN = False and re-run. Verify the two UserVisit rows are now over_limit, the two CompletedWork rows are over_limit, and the two affected workers' payment_accrued reflects the recompute.
  • Re-run with DRY_RUN = True again — should report "Nothing to remediate" (idempotency check).
  • Production execution gate: confirm with finance how prior payouts to the over-cap workers will be reconciled before running the non-dry-run script in prod.
  • Production execution: run with OPP_UUID set to the affected opportunity, DRY_RUN = True first, then DRY_RUN = False only after the dry-run is reviewed. Capture the output.
  • Optional follow-up: set OPP_UUID = None and DRY_RUN = True to scan for any other opportunities silently affected by the same bug. Decide per-opportunity remediation with finance.

Labels & Review

  • The set of people pinged as reviewers is appropriate for the level of risk of the change

@mkangia
Copy link
Copy Markdown
Contributor

mkangia commented Apr 28, 2026

@calellowitz @sravfeyn

would appreciate your review on this one please.

Copy link
Copy Markdown
Collaborator

@calellowitz calellowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the logic is not quite right here, but I have some higher level concerns that I will also leave in the ticket. I suspect what is happening here is users overriding the overlimit status of visits that had been correctly flagged, rather than the system missing them. That has happened both intentionally and unintentionally in the past, and I would be concerned about overwriting those in an automated way, without talking to the individual PMs for each affected intervention, especially if some have already been paid.

Comment thread docs/plans/script.py Outdated
.order_by("date_created")
.values_list("id", flat=True)
)
if len(active_visit_ids) > cl.max_visits:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this a correct comparison. ClaimLimits are per payment unit, and some payment units contain multiple deliver units, but this is counting UserVisits/deliver units. So if a payment unit requires a registration and service delivery form, which I think is a common setup even for single visit interventions, this will start to exclude visits only halfway to the limit. There will be 100 UserVisits when there are only 50 payment earned.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I've switched to counting CompletedWork rows instead (one per payment unit), and then flip each over-cap CW + all its visits as a unit.

Addressed in 4b5bbc3.

@sravfeyn
Copy link
Copy Markdown
Member

I have same feeling as Cal on this that it's probably best to do this on opportunities where this is explicitly asked for rather than on all Opportunities for the same reasons he hilights.

zandre-eng and others added 3 commits April 29, 2026 11:56
The cap (claim_limit.max_visits) is on earned payment units, not raw
UserVisits. A payment unit can have multiple deliver units (e.g. a
registration + a service-delivery form), so counting visits across
deliver units double-counts: 100 UserVisits = 50 earned payment units
when there are 2 deliver units per payment unit, but the previous
identification flagged that as already at the cap.

Switch to counting CompletedWork rows: each CW is unique per
(access, entity_id, payment_unit) and represents one earned payment
unit regardless of how many deliver-unit forms it took to satisfy.
For each over-cap CW we now flip the CW *and all its UserVisits* to
over_limit as a unit, removing the previous "mixed completed-work"
warning case (no longer possible by construction).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Before applying the fix in prod the operator wants to know how much
worker and org accrual will drop. Aggregate saved_payment_accrued and
saved_org_payment_accrued across the over-cap CompletedWork rows, then
print per-access and total deltas in the DRY_RUN branch. Also include
the projected reduction in the final post-apply summary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The recompute loop runs outside the atomic block (update_payment_accrued_for_user
takes its own Redis lock per access), so a Ctrl-C or transient failure mid-loop
leaves the status flips committed but payment_accrued stale for the unfinished
accesses. Wrap the loop in try/finally and print a WARNING listing the access
ids that still need a manual recompute. Also pre-fetch accesses with in_bulk
instead of N .get() calls and derive affected_access_ids from the plan upfront.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zandre-eng
Copy link
Copy Markdown
Contributor Author

@sravfeyn @calellowitz Thanks for the feedback on this. I agree with the points raised and also think it would be risky to apply this generally to all opps in prod. My plan is to just run this script for the affected opp in the linked ticket as they've explicitly requested it. Let me know if there are any concerns with this approach.

I've made a few refactors to the code, please could I get another review pass?

@sravfeyn
Copy link
Copy Markdown
Member

Since we know exactly which 4 visits should have been marked as overlimit, why not just do a simple update like below?

(Not exactly, but on the lines of)

  XFORM_IDS = ["<id1>", "<id2>"]

  visits = UserVisit.objects.filter(xform_id__in=XFORM_IDS).select_related("comp
  leted_work")
  user_ids = list(visits.values_list("user_id", flat=True))
  cw_ids = [v.completed_work_id for v in visits if v.completed_work_id]

  UserVisit.objects.filter(xform_id__in=XFORM_IDS).update(status=VisitValidation
  Status.over_limit, review_status="")
  CompletedWork.objects.filter(id__in=cw_ids).update(status=CompletedWorkStatus.
  over_limit)

  bulk_update_payment_accrued.delay(visits.first().opportunity_id, user_ids)

Comment thread docs/plans/script.py Outdated
Copy link
Copy Markdown
Member

@sravfeyn sravfeyn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. LGTM.

Copy link
Copy Markdown
Collaborator

@calellowitz calellowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, agree seems fine if only for the affected xforms/opps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants