Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CALCITE-6640] RelMdUniqueKeys grows exponentially when key columns are repeated in projections #4013

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

zabetak
Copy link
Contributor

@zabetak zabetak commented Oct 23, 2024

Avoid generating non-minimal/redundant key sets when computing the unique keys for columns that are repeated in the output.

…re repeated in projections

Avoid generating non-minimal/redundant key sets when computing the unique keys for columns that are repeated in the output.
Copy link
Member

@caicancai caicancai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM


resultBuilder.addAll(Util.transform(product, ImmutableBitSet::union));
// select key1, key1, val1, val2, key2 from ...
// the resulting unique keys would be {{0},{4}}, {{1},{4}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this part of the javadoc can be improved a little bit. When I first read this part of the javadoc, I didn’t understand what {0},{4} meant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example,

// Select fields key1, key1, val1, val2, and key2
// The query results will return records with unique key combinations
// Example of unique key combinations:
// {{0}, {4}} indicates that the key1 value of the first record is 0, and the key2 value is 4
// {{1}, {4}} indicates that the key1 value of the second record is 1, and the key2 value is 4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid that the above suggestion is incorrect. The UniqueKeys metadata does not return record/row values but column ordinals.

To understand these comments it is important to have read the Javadoc of the UniqueKeys interface before.
{0} refers to the column at position zero i.e., the first occurrence of column key1 in the query.
{1}refers to the column at position one i.e., the second occurrence of column key1 in the query.
...
{4}refers to the column at position four i.e., column key2 in the query.

If this is not clear from the Javadoc of the UniqueKeys interface then we should improve that part of the documentation and not this internal low-level comments. The part here assumes that the developer understands what the result of this method is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your answer, I agree with you

@soumyakanti3578
Copy link
Contributor

+1 on the request to improve documentation. Otherwise, this LGTM!

@zabetak
Copy link
Contributor Author

zabetak commented Oct 29, 2024

@soumyakanti3578 Can you please elaborate a bit more about the doc improvements that you would like to see? Thanks.

Copy link

sonarcloud bot commented Oct 29, 2024

@soumyakanti3578
Copy link
Contributor

@zabetak It felt to me that it is a bit difficult to understand the documentation explaining the unique key combinations. But I agree with your comment above that this is not the right place to add more detailed documentation. So please ignore my comments regarding doc above. Thanks!

Copy link
Contributor

@julianhyde julianhyde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You've added a test for minimality but I would go further - make sure every value returned by any RelMdUniqueKeys provider is minimal. I think the execution cost will be (pardon the pun) minimal.

Consider adding a method static ImmutableBitSet areMinimal(Iterable<ImmutableBitSet>) (or List<ImmutableBitSet> if it's more efficient). I have a feeling that the current implementation makes N^2 tests but an implementation could make N * (N - 1) / 2 tests (less than half as many).

@@ -978,6 +978,18 @@ public static boolean allContain(Collection<ImmutableBitSet> bitSets, int bit) {
return true;
}

/**
* Returns whether this is a minimal set with respect to the specified collection of bitSets.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give an example in the javadoc?

@zabetak
Copy link
Contributor Author

zabetak commented Oct 30, 2024

@julianhyde Putting minimality logic on every return of the RelMdUniqueKeys handler doesn't feel right to me. Even if the overhead is minimal why adding seemingly redundant code?

Moreover, if the handler goes rogue and starts to generate not minimal keys at some place then chances are that we are going to fail before even arriving to the minimality check/filter.

I don't mind adding the checks/fiters if you feel strongly about it but I see more cons than pros in this approach.

For the record, we already have RelMdUniqueKeys#filterSupersets that is currently used by the Aggregate handler to ensure that keys are minimal. If decide to apply the filter in every other handler then I guess we don't need another method in ImmutableBitSet and probably don't need the minimality check in the tests either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants