Skip to content

Conversation

kmozaid
Copy link
Contributor

@kmozaid kmozaid commented Sep 17, 2025

This PR adds support for multi-value columns in startree index aggregates.

select AVGMV(DivLongestGTimes) from airlineStats limit 10

{
        "dimensionsSplitOrder": [
          "AirlineID"
        ],
        "functionColumnPairs": [
          "AVGMV__DivLongestGTimes",
          "COUNTMV__DivLongestGTimes",
          "SUMMV__DivLongestGTimes"
        ],
        "maxLeafRecords": 10
      }

@kmozaid kmozaid force-pushed the startree_aggregate_on_mv_columns branch 2 times, most recently from a9edceb to 236ba6a Compare September 17, 2025 14:00
@codecov-commenter
Copy link

codecov-commenter commented Sep 17, 2025

Codecov Report

❌ Patch coverage is 47.14286% with 74 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.53%. Comparing base (f4bc04d) to head (9b12962).

Files with missing lines Patch % Lines
...aggregation/function/AvgMVAggregationFunction.java 30.76% 16 Missing and 2 partials ⚠️
...gregation/function/CountMVAggregationFunction.java 0.00% 11 Missing and 2 partials ⚠️
...aggregation/function/SumMVAggregationFunction.java 0.00% 11 Missing and 2 partials ⚠️
...gment/local/aggregator/CountMVValueAggregator.java 58.33% 6 Missing and 4 partials ⚠️
...segment/local/aggregator/SumMVValueAggregator.java 58.33% 6 Missing and 4 partials ⚠️
...segment/local/aggregator/AvgMVValueAggregator.java 70.96% 4 Missing and 5 partials ⚠️
...he/pinot/segment/local/utils/TableConfigUtils.java 66.66% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #16836      +/-   ##
============================================
- Coverage     63.54%   63.53%   -0.01%     
  Complexity     1419     1419              
============================================
  Files          3080     3083       +3     
  Lines        181656   181788     +132     
  Branches      27861    27880      +19     
============================================
+ Hits         115428   115504      +76     
- Misses        57359    57401      +42     
- Partials       8869     8883      +14     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.47% <47.14%> (-0.04%) ⬇️
java-21 63.50% <47.14%> (+0.01%) ⬆️
temurin 63.53% <47.14%> (-0.01%) ⬇️
unittests 63.53% <47.14%> (-0.01%) ⬇️
unittests1 56.34% <45.00%> (-0.02%) ⬇️
unittests2 33.71% <10.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@kmozaid kmozaid marked this pull request as draft September 17, 2025 15:48
@kmozaid kmozaid force-pushed the startree_aggregate_on_mv_columns branch from 236ba6a to 85c1e22 Compare September 18, 2025 09:55
@kmozaid kmozaid force-pushed the startree_aggregate_on_mv_columns branch from 85c1e22 to fdbc976 Compare September 18, 2025 10:14
@kmozaid kmozaid marked this pull request as ready for review September 18, 2025 10:15
@kmozaid kmozaid force-pushed the startree_aggregate_on_mv_columns branch from bc3d919 to 7b8d65d Compare October 6, 2025 05:29
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

Since we need to handle aggregated values the same way as single-valued input, another approach would be to support MV in regular Count, Sum, Avg. Essentially add a SV/MV check in these functions so that they can handle MV columns. This way user can directly use the regular function on MV columns.
IMO this way is more user friendly. Do you want to also try this approach out? You can open a separate PR and we can compare the 2 approaches

@Jackie-Jiang
Copy link
Contributor

You can take a look at #17007 to get an idea for the alternative approach.
cc @yashmayya to also take a look at this PR

@kmozaid
Copy link
Contributor Author

kmozaid commented Oct 15, 2025

Very nice!

Since we need to handle aggregated values the same way as single-valued input, another approach would be to support MV in regular Count, Sum, Avg. Essentially add a SV/MV check in these functions so that they can handle MV columns. This way user can directly use the regular function on MV columns. IMO this way is more user friendly. Do you want to also try this approach out? You can open a separate PR and we can compare the 2 approaches

Thanks for reviewing.
Yes, we can do that but anyway, we need to have support for mv functions since they exists.
Can I create another PR (once current PR is merged) for adding support in regular functions since I need to anyway add support more functions like distinctcounbitmap, distinctcounthll etc, would that work? Currently, I would prefer to get review comments resolved and get this PR merged.

@kmozaid kmozaid requested a review from Jackie-Jiang October 15, 2025 03:54
@kmozaid
Copy link
Contributor Author

kmozaid commented Oct 15, 2025

You can take a look at #17007 to get an idea for the alternative approach. cc @yashmayya to also take a look at this PR

thanks, I have seen this kind of support in DistinctCountSmartHLLAggregationFunction but for StarTree aggregation supports on mv, its not enough, the ValueAggregator needs to change as well. I will raise another PR for supporting Startree aggregates on MV columns in regular functions.

@Jackie-Jiang
Copy link
Contributor

If we end up taking another approach, we will integrate AvgMVValueAggregator in this PR into the existing AvgValueAggregator, and not having AvgMV function for Star-Tree Index. Basically the changes in this PR is no longer needed, and Avg can be used to handle both SV and MV. I feel that is a cleaner solution (no need to handle SV in MV function), and more user friendly.

@kmozaid
Copy link
Contributor Author

kmozaid commented Oct 16, 2025

If we end up taking another approach, we will integrate AvgMVValueAggregator in this PR into the existing AvgValueAggregator, and not having AvgMV function for Star-Tree Index. Basically the changes in this PR is no longer needed, and Avg can be used to handle both SV and MV. I feel that is a cleaner solution (no need to handle SV in MV function), and more user friendly.

Ok I think, I understood what you are suggesting -

  1. We don't need to support StarTree index with MV functions, like MultiValueFieldName__AVGMV (instead one must use regular function MultiValueFieldName__AVG).
  2. Instead of implementing AvgMVValueAggregator, AvgValueAggregator` should be updated for supporting StarTree creation for MultiValueField.
  3. The current implementation of single value functions, for e.g. AvgAggregationFunction function does not support multi value field aggregation, the support needs to be added. The regular function implicitly support StarTree single value requirement. For AVG, it has check blockValSet.getValueType() != DataType.BYTES which is for StarTree and for SUM, it has a case statement handling Double data type which is also the type of StarTree aggregated data.

@kmozaid
Copy link
Contributor Author

kmozaid commented Oct 16, 2025

Avg can be used to handle both SV and MV

AvgMVAggregationFunction code will need to be moved into AvgAggregationFunction, including aggregateGroupBySV and aggregateGroupByMV functions code for handling normal query execution (without StarTree index), and that will make AvgMVAggregationFunction obsolete. Same will need to be done for other functions like SUM, COUNT etc, also StartTree index creation has special handling for count(*) on single value field, which will need to be changed.
I strongly feel that, this kind of change warrant a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants