-
Notifications
You must be signed in to change notification settings - Fork 176
Description
Query Information
PPL Command/Query:
source=test_bool_join
| stats AVG(value) as avg_val, COUNT() as cnt by category, has_flag
| fields category, has_flag, avg_val, cnt
| join left=L right=R ON L.has_flag = R.has_flag
[source=test_bool_join | stats AVG(value) as overall_avg by has_flag]
Expected Result:
The query should return 6 rows (3 with has_flag=true
and 3 with has_flag=false
), with each row from the left side matched to the corresponding aggregated value from the right side based on the boolean field.
Expected output:
{
"datarows": [
["A", true, 10.0, 1, 20.0, true],
["B", true, 20.0, 1, 20.0, true],
["C", true, 30.0, 1, 20.0, true],
["D", false, 40.0, 1, 50.0, false],
["E", false, 50.0, 1, 50.0, false],
["F", false, 60.0, 1, 50.0, false]
],
"total": 6
}
Actual Result:
The query returns only 3 rows (all with has_flag=true
). The rows with has_flag=false
are completely missing from the result set.
{
"datarows": [
["A", true, 10.0, 1, 20.0, true],
["B", true, 20.0, 1, 20.0, true],
["C", true, 30.0, 1, 20.0, true]
],
"total": 3
}
Dataset Information
Dataset/Schema Type
- Custom (details below)
Index Mapping
{
"mappings": {
"properties": {
"category": { "type": "keyword" },
"has_flag": { "type": "boolean" },
"value": { "type": "integer" }
}
}
}
Sample Data
{"category":"A","has_flag":true,"value":10}
{"category":"B","has_flag":true,"value":20}
{"category":"C","has_flag":true,"value":30}
{"category":"D","has_flag":false,"value":40}
{"category":"E","has_flag":false,"value":50}
{"category":"F","has_flag":false,"value":60}
Bug Description
Issue Summary:
PPL join operations on boolean fields return incomplete results when a fields
command is used before the join. Specifically, when the left side of the join has multiple rows with the same boolean value, only rows with true
values are matched and returned. Rows with false
values are completely omitted from the result set.
Steps to Reproduce:
- Create an index with a boolean field:
curl -X PUT "http://localhost:9200/test_bool_join" -H 'Content-Type: application/json' -d'
{
"mappings": {
"properties": {
"category": { "type": "keyword" },
"has_flag": { "type": "boolean" },
"value": { "type": "integer" }
}
}
}'
- Insert test data (6 documents: 3 with
true
, 3 withfalse
):
curl -X POST "http://localhost:9200/test_bool_join/_bulk" -H 'Content-Type: application/json' -d'
{"index":{}}
{"category":"A","has_flag":true,"value":10}
{"index":{}}
{"category":"B","has_flag":true,"value":20}
{"index":{}}
{"category":"C","has_flag":true,"value":30}
{"index":{}}
{"category":"D","has_flag":false,"value":40}
{"index":{}}
{"category":"E","has_flag":false,"value":50}
{"index":{}}
{"category":"F","has_flag":false,"value":60}
'
- Execute the failing query:
curl -X POST "http://localhost:9200/_plugins/_ppl" -H 'Content-Type: application/json' -d'
{
"query": "source=test_bool_join | stats AVG(value) as avg_val, COUNT() as cnt by category, has_flag | fields category, has_flag, avg_val, cnt | join left=L right=R ON L.has_flag = R.has_flag [source=test_bool_join | stats AVG(value) as overall_avg by has_flag]"
}'
- Observe that only 3 rows are returned (all with
has_flag=true
)
Key Finding:
The bug is specifically triggered by the combination of:
- A
fields
command (LogicalProject) before the join - A join condition on a boolean field (or potentially other primitive types)
- Multiple rows on the left side with the same boolean value
Proof: Without the fields
command, the same query returns all 6 rows correctly:
source=test_bool_join
| stats AVG(value) as avg_val by category, has_flag
| join left=L right=R ON L.has_flag = R.has_flag
[source=test_bool_join | stats AVG(value) as overall_avg by has_flag]
This returns all 6 rows as expected.
Impact:
- Queries using joins on boolean fields produce incorrect and incomplete results
- Data analysis involving boolean field joins is unreliable
- The issue is difficult to diagnose because it only manifests when
fields
command is used before the join - Users must avoid using
fields
before joins on boolean fields, or cast boolean to integer as a workaround
Tentative Root Cause Analysis
Based on code analysis and testing, the issue appears to be in the sort pushdown implementation for composite aggregations:
Location: ./opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/PushDownContext.java
, method pushDownSortIntoAggBucket()
, lines 276-282
Analysis:
When a fields
command is used before a join on a boolean field, the physical plan pushes the sort operation down to OpenSearch's composite aggregation (indicated by SORT->[1]
in the PushDownContext). The method pushDownSortIntoAggBucket()
attempts to reorder the composite aggregation sources to match the required sort order for the merge join.
However, the code contains a potential issue:
IntStream.range(0, buckets.size())
.mapToObj(fieldNames::get)
.filter(name -> !selected.contains(name))
.forEach(name -> {
newBuckets.add(buckets.get(bucketNames.indexOf(name)));
newBucketNames.add(name);
});
This code assumes that fieldNames[0..buckets.size()-1]
contains the bucket field names in the same order as the buckets
list. However, after a LogicalProject
(fields command), the output schema may be reordered, breaking this assumption.
Observed Behavior:
- Without
fields
: Calcite performs the sort (EnumerableSort), and the merge join works correctly - With
fields
: Sort is pushed down to OpenSearch, but the composite aggregation sources are not reordered correctly, causing the merge join to fail
Tentative Proposed Fix
The fix may involve changing the iteration logic to use bucketNames
directly instead of indexing into fieldNames
:
// Instead of:
IntStream.range(0, buckets.size()).mapToObj(fieldNames::get)
// Use:
bucketNames.stream()
This ensures that the bucket lookup is based on the actual bucket names rather than assumptions about field ordering.
Note: This fix requires thorough testing and validation to ensure it doesn't break existing functionality.
Workaround
Avoid using the fields
command before joins on boolean fields, or cast boolean fields to integers:
source=test_bool_join
| eval flag_int=cast(has_flag as int)
| stats AVG(value) as avg_val, COUNT() as cnt by category, flag_int
| fields category, flag_int, avg_val, cnt
| join left=L right=R ON L.flag_int = R.flag_int
[source=test_bool_join
| eval flag_int=cast(has_flag as int)
| stats AVG(value) as overall_avg by flag_int]
Environment Information
OpenSearch Version:
OpenSearch 3.4.0-SNAPSHOT (build hash: e0d622ddc73bdab552d94427dfeb3e101b026ce2, build date: 2025-10-13T02:01:20.669798Z)
Additional Details:
- SQL plugin version: bundled with OpenSearch 3.4.0-SNAPSHOT
- The issue affects the Calcite-based PPL engine
- The bug is reproducible with minimal test data
- Related to merge join implementation and sort pushdown optimization
Metadata
Metadata
Assignees
Labels
Type
Projects
Status