Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avoid/reduce DISTINCT calls #101

Open
jerch opened this issue Feb 4, 2022 · 5 comments
Open

avoid/reduce DISTINCT calls #101

jerch opened this issue Feb 4, 2022 · 5 comments
Labels
enhancement New feature or request

Comments

@jerch
Copy link
Collaborator

jerch commented Feb 4, 2022

In bulk_updater we do a DISTINCT on every query in question. This is needed to avoid updating 1:n relations over and over.

Issues with this:

  • DISTINCT is placed there unconditionally, but in fact is only needed for back relations
  • DISTINCT is a known perf smell for queries
  • does not work with already sliced/limited querysets on mysql (better updatedata command #99 already applies a pk extraction hack for that)

Ideas for better handling:

  • To apply the distinct reduction only on back relations, changes to the graph output and _querysets_for_update are needed with relation type backtracking, also m2m-through handling is affected by this.
  • Depending on the cardinality/selectivity of a back relation set, an explicit pk extraction in python might perform better than query trickery. Problem here: we dont know the selectivity upfront, thus have to query and eval the pk set, which is costly.
@jerch jerch added the enhancement New feature or request label Feb 4, 2022
@jerch
Copy link
Collaborator Author

jerch commented Feb 15, 2022

There is another rather big optimization possible, if the original queryset contains the full table (all() without any filters) - here the WHERE IN filtering is very expensive for the DBMS for no good reason, instead we could simply do a JOIN with a NOT NULL filter.

Repro example with postgres:

  • 1000 Foo, linked to 10000 Bar (100 per Foo), linked to 1M Baz (10 per Bar)
  • task: update all involved Foo.bazzes for Baz.objects.all()
  • currently the resolver does this internally:
for foo in Foo.objects.filter(bar__baz__in=Baz.objects.all()).distinct():
    ... # bulk_update logic

which creates this SQL:

SELECT DISTINCT "exampleapp_foo"."id", "exampleapp_foo"."name", "exampleapp_foo"."bazzes"
FROM "exampleapp_foo"
INNER JOIN "exampleapp_bar"
ON ("exampleapp_foo"."id" = "exampleapp_bar"."foo_id")
INNER JOIN "exampleapp_baz"
ON ("exampleapp_bar"."id" = "exampleapp_baz"."bar_id")
WHERE "exampleapp_baz"."id" IN
    (SELECT "exampleapp_baz"."id" FROM "exampleapp_baz")

with this query plan:

                                                                            QUERY PLAN                                                                             
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=97099.75..97109.75 rows=1000 width=39) (actual time=2470.348..2470.566 rows=1000 loops=1)
   Group Key: exampleapp_foo.id, exampleapp_foo.name, exampleapp_foo.bazzes
   Batches: 1  Memory Usage: 193kB
   ->  Hash Join  (cost=40976.50..89599.75 rows=1000000 width=39) (actual time=387.623..2027.315 rows=1000000 loops=1)
         Hash Cond: (exampleapp_bar.foo_id = exampleapp_foo.id)
         ->  Hash Join  (cost=40941.00..86928.12 rows=1000000 width=4) (actual time=386.481..1715.363 rows=1000000 loops=1)
               Hash Cond: (exampleapp_baz.id = exampleapp_baz_1.id)
               ->  Hash Join  (cost=3723.00..35364.11 rows=1000000 width=8) (actual time=45.022..735.624 rows=1000000 loops=1)
                     Hash Cond: (exampleapp_baz.bar_id = exampleapp_bar.id)
                     ->  Seq Scan on exampleapp_baz  (cost=0.00..20811.00 rows=1000000 width=8) (actual time=2.720..201.237 rows=1000000 loops=1)
                     ->  Hash  (cost=2082.00..2082.00 rows=100000 width=8) (actual time=42.131..42.132 rows=100000 loops=1)
                           Buckets: 131072  Batches: 2  Memory Usage: 2976kB
                           ->  Seq Scan on exampleapp_bar  (cost=0.00..2082.00 rows=100000 width=8) (actual time=0.237..18.653 rows=100000 loops=1)
               ->  Hash  (cost=20811.00..20811.00 rows=1000000 width=4) (actual time=340.597..340.597 rows=1000000 loops=1)
                     Buckets: 131072  Batches: 16  Memory Usage: 3227kB
                     ->  Seq Scan on exampleapp_baz exampleapp_baz_1  (cost=0.00..20811.00 rows=1000000 width=4) (actual time=0.013..134.125 rows=1000000 loops=1)
         ->  Hash  (cost=23.00..23.00 rows=1000 width=39) (actual time=1.131..1.131 rows=1000 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 78kB
               ->  Seq Scan on exampleapp_foo  (cost=0.00..23.00 rows=1000 width=39) (actual time=0.030..0.412 rows=1000 loops=1)
 Planning Time: 2.343 ms
 Execution Time: 2470.974 ms

Ouch! Compare it with this equivalent query:

SELECT DISTINCT ON("exampleapp_foo"."id")
    "exampleapp_foo"."id", "exampleapp_foo"."name", "exampleapp_foo"."bazzes"
FROM "exampleapp_foo"
INNER JOIN "exampleapp_bar"
ON ("exampleapp_foo"."id" = "exampleapp_bar"."foo_id")
INNER JOIN "exampleapp_baz"
ON ("exampleapp_bar"."id" = "exampleapp_baz"."bar_id");
                                                                                  QUERY PLAN                                                                                   
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=0.99..88728.43 rows=1000 width=39) (actual time=0.100..650.928 rows=1000 loops=1)
   ->  Nested Loop  (cost=0.99..86228.43 rows=1000000 width=39) (actual time=0.094..541.827 rows=1000000 loops=1)
         ->  Merge Join  (cost=0.57..6992.43 rows=100000 width=43) (actual time=0.059..68.672 rows=100000 loops=1)
               Merge Cond: (exampleapp_foo.id = exampleapp_bar.foo_id)
               ->  Index Scan using exampleapp_foo_pkey on exampleapp_foo  (cost=0.28..103.22 rows=1000 width=39) (actual time=0.029..0.555 rows=1000 loops=1)
               ->  Index Scan using exampleapp_bar_foo_id_a7b448ed on exampleapp_bar  (cost=0.29..5636.71 rows=100000 width=8) (actual time=0.016..37.969 rows=100000 loops=1)
         ->  Index Only Scan using exampleapp_baz_bar_id_cf1c3625 on exampleapp_baz  (cost=0.42..0.69 rows=10 width=4) (actual time=0.001..0.003 rows=10 loops=100000)
               Index Cond: (bar_id = exampleapp_bar.id)
               Heap Fetches: 0
 Planning Time: 1.862 ms
 Execution Time: 651.105 ms

which is 4 times faster. The gain comes from these changes:

  • DISTINCT ON usage - roughly doubles speed, we prolly should enable this for postgres at every distinct invocation
  • WHERE IN clause removed - again doubling speed, the change creates a big impact due to avoiding a sequential scan and hashing of interim results, instead the planner can use faster index scans

If the foo selection has strong selectivity, we can perform even better with further transforming the query:

SELECT *
FROM "exampleapp_foo"
WHERE "exampleapp_foo"."id" IN  (
    SELECT "exampleapp_bar"."foo_id"
    FROM "exampleapp_bar"
    INNER JOIN "exampleapp_baz"
    ON ("exampleapp_bar"."id" = "exampleapp_baz"."bar_id")
);
                                                                             QUERY PLAN                                                                             
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Semi Join  (cost=0.72..1016.71 rows=1000 width=39) (actual time=0.034..9.329 rows=1000 loops=1)
   ->  Seq Scan on exampleapp_foo  (cost=0.00..23.00 rows=1000 width=39) (actual time=0.011..0.246 rows=1000 loops=1)
   ->  Nested Loop  (cost=0.72..86.24 rows=1000 width=4) (actual time=0.008..0.008 rows=1 loops=1000)
         ->  Index Scan using exampleapp_bar_foo_id_a7b448ed on exampleapp_bar  (cost=0.29..7.01 rows=100 width=8) (actual time=0.003..0.003 rows=1 loops=1000)
               Index Cond: (foo_id = exampleapp_foo.id)
         ->  Index Only Scan using exampleapp_baz_bar_id_cf1c3625 on exampleapp_baz  (cost=0.42..0.69 rows=10 width=4) (actual time=0.003..0.003 rows=1 loops=1000)
               Index Cond: (bar_id = exampleapp_bar.id)
               Heap Fetches: 0
 Planning Time: 0.511 ms
 Execution Time: 9.486 ms

Well, thats now like 260 times faster, mainly due to avoiding the need of any DISTINCT rule. Remaining questions to get an auto-optimization like that rolling:

  • Can we anticipate the selectivity of backrelations somehow? Also this wont performs as good as shown here, if Foo itself is rather big with many "holes". Here additional knowledge of the fk fields might help (whether it can be NULL).
  • Is there an easy way to spot full table queries from django's queryset? Note that this optimization does not apply for single object updates, which is still the normal "everyday" use case.

@jerch
Copy link
Collaborator Author

jerch commented Feb 15, 2022

Follow-up for above:

The big speedup is still preserved, even if we keep the inner WHERE IN clause:

SELECT *
FROM "exampleapp_foo"
WHERE "exampleapp_foo"."id" IN  (
    SELECT "exampleapp_bar"."foo_id"
    FROM "exampleapp_bar"
    INNER JOIN "exampleapp_baz"
    ON ("exampleapp_bar"."id" = "exampleapp_baz"."bar_id")
    WHERE "exampleapp_baz"."id" IN (
        SELECT "exampleapp_baz"."id" FROM "exampleapp_baz"
    )
);

which still runs in

 Planning Time: 2.557 ms
 Execution Time: 12.963 ms

This means, that we can ignore the second question above and just go with a subquery in a WHERE IN clause on the out-most SELECT. This still needs some testing regarding the first question and whether bad selectivity will screw up things.

@jerch
Copy link
Collaborator Author

jerch commented Feb 15, 2022

Indeed, the transformed query performs slightly worse than the original with DISTINCT ON applied, if we add 100k foos linking to no bars/bazs:

  • original (current resolver)
     Planning Time: 3.539 ms
     Execution Time: 2603.841 ms
    
  • original with DISTINCT ON
     Planning Time: 3.088 ms
     Execution Time: 1243.476 ms
    
  • transformed
     Planning Time: 2.710 ms
     Execution Time: 1927.248 ms
    

But the differences are rather small to justify a different code path based on a vague selectivity estimate, which we dont even have without further record stats tracking or additional queries, while we still gain much performance for typical 1:n relations with big n.

@jerch
Copy link
Collaborator Author

jerch commented Feb 15, 2022

Trying to shape the transformed query with ORM syntax is not quite easy, this is the closest I got for now:

Foo.objects.filter(pk__in=Baz.objects.all().select_related('bar').values('bar__foo_id'))

which creates:

SELECT "exampleapp_foo"."id", "exampleapp_foo"."name", "exampleapp_foo"."bazzes"
FROM "exampleapp_foo"
WHERE "exampleapp_foo"."id" IN (
    SELECT U1."foo_id"
    FROM "exampleapp_baz" U0
    INNER JOIN "exampleapp_bar" U1 ON (U0."bar_id" = U1."id")
)

and even runs faster than the DISTINCT ON variant above (finishes in 840ms). It is also nicer for Baz not being bound to all(), as the subquery would naturally extend with other clauses. But correctly resolving the relation from Baz --> ??? --> Foo is quite cumbersome - we have to join with Bar in an awkward way (surprising select_related) just to get the field Foo links to. The awkward join is only determined by the fk direction (ORM does not support this backwards), which makes this almost impossible to get done right in an automated fashion (depending on the relation types in a deps chain the subqueries would have to be changed over and over).

A more direct and easier to code approach would be this:

Foo.objects.filter(pk__in=Bar.objects.filter(baz__in=Baz.objects.all()).values('foo'))

creating the WHERE IN cascade from the fastest solution above:

SELECT "exampleapp_foo"."id", "exampleapp_foo"."name", "exampleapp_foo"."bazzes"
FROM "exampleapp_foo"
WHERE "exampleapp_foo"."id" IN (
    SELECT V0."foo_id"
    FROM "exampleapp_bar" V0
    INNER JOIN "exampleapp_baz" V1 ON (V0."id" = V1."bar_id")
    WHERE V1."id" IN (
        SELECT "exampleapp_baz"."id"
        FROM "exampleapp_baz"
    )
)

Again the inner WHERE IN clause is an sql perf smell and totally superfluous (would get filtered anyway by the INNER JOIN), but I have yet to find a more direct way in the ORM syntax to construct the backrelation join without misusing the __in operator. Due to the additional nonsense work, this again finishes in ~1.8s.

what we have so far:

  • DISTINCT should be avoided with WHERE IN at the out-most SELECT performing much better under most circumstances
  • (prolly) fastest solution would use INNER JOINs to narrow down candidates across multiple relation types (still needs investigation how to construct these straight forward)
  • easier to code solution applies WHERE IN at every backrelation resolve step (maybe as an interim improvement)

things not yet tested/reflected:

  • WHERE IN starts to show bad runtime for big pk sets again (but most likely not worse than DISTINCT itself), this might also lead to perf hell for an interim solution with multiple __in descents
  • cfs with deps to other cfs stack/extend querysets, which might behave worse than with explicit DISTINCT reduction in between (this is also true for distinct and not really tested yet, for heavy stacking cfs like cycling deps in trees this might lead to re-eval of the full tree over and over) - prolly can only be broken with explicit pk pulling to python at some threshold

Edit: Fixed wrong field ref in 2nd variant

@jerch
Copy link
Collaborator Author

jerch commented Feb 15, 2022

Oh well, can construct the joins with extra:

>>> s=Bar.objects.all().extra(tables=['exampleapp_baz'], where=['U0.id=exampleapp_baz.bar_id'])
>>> _=list(Foo.objects.filter(pk__in=s.values('foo')))

leading to this sql:

SELECT "exampleapp_foo"."id", "exampleapp_foo"."name", "exampleapp_foo"."bazzes"
FROM "exampleapp_foo"
WHERE "exampleapp_foo"."id" IN (
    SELECT U0."foo_id"
    FROM "exampleapp_bar" U0 , "exampleapp_baz"
    WHERE (U0.id=exampleapp_baz.bar_id)
)

But seriously - how ugly is that? That old join syntax is imho a clear no-go, as it mixes the join task with filtering in the where clauses. The postgres planner is clever enough to resolve to the same strategy as with INNER JOIN for this particular test, but I am not sure if that will be still the case for more complicated filtering and deeper nested joins.
Furthermore extra is marked as deprecated, so we should not rely on it. The lack of an explicit join in the ORM syntax is really a downer here. So raw sql for the rescue? I really dont like that idea, on the other hand we already partially paved that way with fast_update (which I am not happy about either).

Edit:
investigate if https://github.com/dimagi/django-cte can help here and whether it has support for LATERAL JOIN in postgres

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant