5477 speed up distributions export#5605
Open
phoffer wants to merge 5 commits into
Open
Conversation
Collaborator
|
@phoffer please see my comment on the issue - it's incredibly unlikely that a single organization has thousands of distributions in a year. It's almost definitely an issue with how we're calling the database. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This improves the timeout issue in #5477. The two biggest factors for duration (slowness) are number of distributions and the number of items for an organization. (distributions x
organization.items= total calculations)Overall, I think this can speed up processing up by roughly 6x, dependent upon data volumes and composition (ie associated data)
My approach was to see how data volume impacts processing time. First, calculate processing time for various volumes of distributions, and then separately calculate across various volumes of items. All performance evaluation is relative, since my computer is not a typical server, and also not memory constrained for this purpose. I have attempted to have consistent system load across test runs, and run enough to get an idea for the potential speed processing, essentially the best realistic performance possible for each variation of code. All my testing is done on a MBP M2 Pro.
All of this is heavily dependent on what real world data looks like. My testing had few items per distribution, and primarily varied distribution count and organization item count. That could be wildly off from reality, but the performance gains should still be sizable.
I have added a skipped spec, which is what I primarily used for testing for various counts of distributions. Simple tweaks allowed for testing different item counts with the same distribution amount.
A couple general findings:
Summary of changes, commit by commit:
adbfe3a
Add controller spec + perf test spec, and tweak method arg to match actual usage
b41deb0
Switch to
#find_each. This did not have any speed impact on my machine, but memory usage is smoothed, and it will likely have some on production systems with other load and memory constraints. This will be better for the system overalldcf3059
Use
group_byto grouporganization.line_items(once per distribution), rather than doingdistribution.line_items.selectfor each organization item per distribution. This removes about 60% of original processing time per distribution.439ff17
Use a memoized array for organization line items which are not in the distribution. This replaces the 3 entry calculation in the inner most loop with a pre-computed set of zeroes. It also causes a shift to using flat_map instead of shoveling individual values. This entire change removed approximately 60% of processing time remaining after the prior improvement.
664e292
Update the comment for future devs
Other thoughts
distributions_controller#indexaction has a bunch of statements that are unnecessary, but seem to have negligible impact. TheDistributionTotalsServicequery is repeated (in controller, then export service), but 1) it probably hits ActiveRecord query cache the 2nd time and 2) at least with my test dataset composition, had negligible cost as well. But with different data in production, it could be relevantChecklist:
-->
Resolves #5477
Description
Type of change
How Has This Been Tested?
Screenshots