-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
avoid/reduce DISTINCT calls #101
Comments
There is another rather big optimization possible, if the original queryset contains the full table ( Repro example with postgres:
for foo in Foo.objects.filter(bar__baz__in=Baz.objects.all()).distinct():
... # bulk_update logic which creates this SQL: SELECT DISTINCT "exampleapp_foo"."id", "exampleapp_foo"."name", "exampleapp_foo"."bazzes"
FROM "exampleapp_foo"
INNER JOIN "exampleapp_bar"
ON ("exampleapp_foo"."id" = "exampleapp_bar"."foo_id")
INNER JOIN "exampleapp_baz"
ON ("exampleapp_bar"."id" = "exampleapp_baz"."bar_id")
WHERE "exampleapp_baz"."id" IN
(SELECT "exampleapp_baz"."id" FROM "exampleapp_baz") with this query plan:
Ouch! Compare it with this equivalent query: SELECT DISTINCT ON("exampleapp_foo"."id")
"exampleapp_foo"."id", "exampleapp_foo"."name", "exampleapp_foo"."bazzes"
FROM "exampleapp_foo"
INNER JOIN "exampleapp_bar"
ON ("exampleapp_foo"."id" = "exampleapp_bar"."foo_id")
INNER JOIN "exampleapp_baz"
ON ("exampleapp_bar"."id" = "exampleapp_baz"."bar_id");
which is 4 times faster. The gain comes from these changes:
If the SELECT *
FROM "exampleapp_foo"
WHERE "exampleapp_foo"."id" IN (
SELECT "exampleapp_bar"."foo_id"
FROM "exampleapp_bar"
INNER JOIN "exampleapp_baz"
ON ("exampleapp_bar"."id" = "exampleapp_baz"."bar_id")
);
Well, thats now like 260 times faster, mainly due to avoiding the need of any DISTINCT rule. Remaining questions to get an auto-optimization like that rolling:
|
Follow-up for above: The big speedup is still preserved, even if we keep the inner WHERE IN clause: SELECT *
FROM "exampleapp_foo"
WHERE "exampleapp_foo"."id" IN (
SELECT "exampleapp_bar"."foo_id"
FROM "exampleapp_bar"
INNER JOIN "exampleapp_baz"
ON ("exampleapp_bar"."id" = "exampleapp_baz"."bar_id")
WHERE "exampleapp_baz"."id" IN (
SELECT "exampleapp_baz"."id" FROM "exampleapp_baz"
)
); which still runs in
This means, that we can ignore the second question above and just go with a subquery in a WHERE IN clause on the out-most SELECT. This still needs some testing regarding the first question and whether bad selectivity will screw up things. |
Indeed, the transformed query performs slightly worse than the original with DISTINCT ON applied, if we add 100k foos linking to no bars/bazs:
But the differences are rather small to justify a different code path based on a vague selectivity estimate, which we dont even have without further record stats tracking or additional queries, while we still gain much performance for typical 1:n relations with big n. |
Trying to shape the transformed query with ORM syntax is not quite easy, this is the closest I got for now: Foo.objects.filter(pk__in=Baz.objects.all().select_related('bar').values('bar__foo_id')) which creates: SELECT "exampleapp_foo"."id", "exampleapp_foo"."name", "exampleapp_foo"."bazzes"
FROM "exampleapp_foo"
WHERE "exampleapp_foo"."id" IN (
SELECT U1."foo_id"
FROM "exampleapp_baz" U0
INNER JOIN "exampleapp_bar" U1 ON (U0."bar_id" = U1."id")
) and even runs faster than the DISTINCT ON variant above (finishes in 840ms). It is also nicer for Baz not being bound to A more direct and easier to code approach would be this: Foo.objects.filter(pk__in=Bar.objects.filter(baz__in=Baz.objects.all()).values('foo')) creating the WHERE IN cascade from the fastest solution above: SELECT "exampleapp_foo"."id", "exampleapp_foo"."name", "exampleapp_foo"."bazzes"
FROM "exampleapp_foo"
WHERE "exampleapp_foo"."id" IN (
SELECT V0."foo_id"
FROM "exampleapp_bar" V0
INNER JOIN "exampleapp_baz" V1 ON (V0."id" = V1."bar_id")
WHERE V1."id" IN (
SELECT "exampleapp_baz"."id"
FROM "exampleapp_baz"
)
) Again the inner WHERE IN clause is an sql perf smell and totally superfluous (would get filtered anyway by the INNER JOIN), but I have yet to find a more direct way in the ORM syntax to construct the backrelation join without misusing the __in operator. Due to the additional nonsense work, this again finishes in ~1.8s. what we have so far:
things not yet tested/reflected:
Edit: Fixed wrong field ref in 2nd variant |
Oh well, can construct the joins with >>> s=Bar.objects.all().extra(tables=['exampleapp_baz'], where=['U0.id=exampleapp_baz.bar_id'])
>>> _=list(Foo.objects.filter(pk__in=s.values('foo'))) leading to this sql: SELECT "exampleapp_foo"."id", "exampleapp_foo"."name", "exampleapp_foo"."bazzes"
FROM "exampleapp_foo"
WHERE "exampleapp_foo"."id" IN (
SELECT U0."foo_id"
FROM "exampleapp_bar" U0 , "exampleapp_baz"
WHERE (U0.id=exampleapp_baz.bar_id)
) But seriously - how ugly is that? That old join syntax is imho a clear no-go, as it mixes the join task with filtering in the where clauses. The postgres planner is clever enough to resolve to the same strategy as with INNER JOIN for this particular test, but I am not sure if that will be still the case for more complicated filtering and deeper nested joins. Edit: |
In
bulk_updater
we do a DISTINCT on every query in question. This is needed to avoid updating 1:n relations over and over.Issues with this:
Ideas for better handling:
_querysets_for_update
are needed with relation type backtracking, also m2m-through handling is affected by this.The text was updated successfully, but these errors were encountered: