Skip to content

5477 speed up distributions export#5605

Open
phoffer wants to merge 5 commits into
rubyforgood:mainfrom
phoffer:ph-5477-exporting-distributions-timeout
Open

5477 speed up distributions export#5605
phoffer wants to merge 5 commits into
rubyforgood:mainfrom
phoffer:ph-5477-exporting-distributions-timeout

Conversation

@phoffer

@phoffer phoffer commented Jun 18, 2026

Copy link
Copy Markdown

This improves the timeout issue in #5477. The two biggest factors for duration (slowness) are number of distributions and the number of items for an organization. (distributions x organization.items = total calculations)

Overall, I think this can speed up processing up by roughly 6x, dependent upon data volumes and composition (ie associated data)

My approach was to see how data volume impacts processing time. First, calculate processing time for various volumes of distributions, and then separately calculate across various volumes of items. All performance evaluation is relative, since my computer is not a typical server, and also not memory constrained for this purpose. I have attempted to have consistent system load across test runs, and run enough to get an idea for the potential speed processing, essentially the best realistic performance possible for each variation of code. All my testing is done on a MBP M2 Pro.

All of this is heavily dependent on what real world data looks like. My testing had few items per distribution, and primarily varied distribution count and organization item count. That could be wildly off from reality, but the performance gains should still be sizable.

I have added a skipped spec, which is what I primarily used for testing for various counts of distributions. Simple tweaks allowed for testing different item counts with the same distribution amount.

A couple general findings:

  • processing time scales linearly as distribution count grows. This may not hold up for production systems that have other resource contention
  • processing time scales (mostly) linearly as organization item count grows
  • I have lots of data recorded, please let me know if that would be good to share as well

Summary of changes, commit by commit:

adbfe3a
Add controller spec + perf test spec, and tweak method arg to match actual usage

b41deb0
Switch to #find_each. This did not have any speed impact on my machine, but memory usage is smoothed, and it will likely have some on production systems with other load and memory constraints. This will be better for the system overall

dcf3059
Use group_by to group organization.line_items (once per distribution), rather than doing distribution.line_items.select for each organization item per distribution. This removes about 60% of original processing time per distribution.

439ff17
Use a memoized array for organization line items which are not in the distribution. This replaces the 3 entry calculation in the inner most loop with a pre-computed set of zeroes. It also causes a shift to using flat_map instead of shoveling individual values. This entire change removed approximately 60% of processing time remaining after the prior improvement.

664e292
Update the comment for future devs

Other thoughts

  • The idea of moving to a background job would still be good long term, but this should buy some more runway
  • I wanted to keep this maintainable and avoid turning this into a leetcode exercise. There are some further tweaks available to squeeze a hair more out of it, but I do not think they are meaningful enough for the maintainability cost (ie could do some clever data plucking to avoid instantiating so many line_item+item pairings, but that would get ugly quickly)
  • The distributions_controller#index action has a bunch of statements that are unnecessary, but seem to have negligible impact. The DistributionTotalsService query is repeated (in controller, then export service), but 1) it probably hits ActiveRecord query cache the 2nd time and 2) at least with my test dataset composition, had negligible cost as well. But with different data in production, it could be relevant

Checklist:

  • I have performed a self-review of my own code,
  • I have commented my code, particularly in hard-to-understand areas,
  • I have made corresponding changes to the documentation,
  • I have added tests that prove my fix is effective or that my feature works,
  • New and existing unit tests pass locally with my changes ("bundle exec rake"),
  • Title include "WIP" if work is in progress.
  • I acknowledge that I will not force push my branch once reviews have started.

-->

Resolves #5477

Description

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Documentation update

How Has This Been Tested?

Screenshots

@phoffer phoffer changed the title 5477 exporting distributions timeout 5477 speed up distributions export Jun 18, 2026
@dorner

dorner commented Jun 19, 2026

Copy link
Copy Markdown
Collaborator

@phoffer please see my comment on the issue - it's incredibly unlikely that a single organization has thousands of distributions in a year. It's almost definitely an issue with how we're calling the database.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: Exporting a year of distributions times out

2 participants