Skip to content

[BUG] PPL join on boolean fields returns incomplete results when used with fields command #4529

@alexey-temnikov

Description

@alexey-temnikov

Query Information

PPL Command/Query:

source=test_bool_join 
| stats AVG(value) as avg_val, COUNT() as cnt by category, has_flag 
| fields category, has_flag, avg_val, cnt 
| join left=L right=R ON L.has_flag = R.has_flag 
  [source=test_bool_join | stats AVG(value) as overall_avg by has_flag]

Expected Result:
The query should return 6 rows (3 with has_flag=true and 3 with has_flag=false), with each row from the left side matched to the corresponding aggregated value from the right side based on the boolean field.

Expected output:

{
  "datarows": [
    ["A", true, 10.0, 1, 20.0, true],
    ["B", true, 20.0, 1, 20.0, true],
    ["C", true, 30.0, 1, 20.0, true],
    ["D", false, 40.0, 1, 50.0, false],
    ["E", false, 50.0, 1, 50.0, false],
    ["F", false, 60.0, 1, 50.0, false]
  ],
  "total": 6
}

Actual Result:
The query returns only 3 rows (all with has_flag=true). The rows with has_flag=false are completely missing from the result set.

{
  "datarows": [
    ["A", true, 10.0, 1, 20.0, true],
    ["B", true, 20.0, 1, 20.0, true],
    ["C", true, 30.0, 1, 20.0, true]
  ],
  "total": 3
}

Dataset Information

Dataset/Schema Type

  • Custom (details below)

Index Mapping

{
  "mappings": {
    "properties": {
      "category": { "type": "keyword" },
      "has_flag": { "type": "boolean" },
      "value": { "type": "integer" }
    }
  }
}

Sample Data

{"category":"A","has_flag":true,"value":10}
{"category":"B","has_flag":true,"value":20}
{"category":"C","has_flag":true,"value":30}
{"category":"D","has_flag":false,"value":40}
{"category":"E","has_flag":false,"value":50}
{"category":"F","has_flag":false,"value":60}

Bug Description

Issue Summary:
PPL join operations on boolean fields return incomplete results when a fields command is used before the join. Specifically, when the left side of the join has multiple rows with the same boolean value, only rows with true values are matched and returned. Rows with false values are completely omitted from the result set.

Steps to Reproduce:

  1. Create an index with a boolean field:
curl -X PUT "http://localhost:9200/test_bool_join" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "category": { "type": "keyword" },
      "has_flag": { "type": "boolean" },
      "value": { "type": "integer" }
    }
  }
}'
  1. Insert test data (6 documents: 3 with true, 3 with false):
curl -X POST "http://localhost:9200/test_bool_join/_bulk" -H 'Content-Type: application/json' -d'
{"index":{}}
{"category":"A","has_flag":true,"value":10}
{"index":{}}
{"category":"B","has_flag":true,"value":20}
{"index":{}}
{"category":"C","has_flag":true,"value":30}
{"index":{}}
{"category":"D","has_flag":false,"value":40}
{"index":{}}
{"category":"E","has_flag":false,"value":50}
{"index":{}}
{"category":"F","has_flag":false,"value":60}
'
  1. Execute the failing query:
curl -X POST "http://localhost:9200/_plugins/_ppl" -H 'Content-Type: application/json' -d'
{
  "query": "source=test_bool_join | stats AVG(value) as avg_val, COUNT() as cnt by category, has_flag | fields category, has_flag, avg_val, cnt | join left=L right=R ON L.has_flag = R.has_flag [source=test_bool_join | stats AVG(value) as overall_avg by has_flag]"
}'
  1. Observe that only 3 rows are returned (all with has_flag=true)

Key Finding:
The bug is specifically triggered by the combination of:

  1. A fields command (LogicalProject) before the join
  2. A join condition on a boolean field (or potentially other primitive types)
  3. Multiple rows on the left side with the same boolean value

Proof: Without the fields command, the same query returns all 6 rows correctly:

source=test_bool_join 
| stats AVG(value) as avg_val by category, has_flag 
| join left=L right=R ON L.has_flag = R.has_flag 
  [source=test_bool_join | stats AVG(value) as overall_avg by has_flag]

This returns all 6 rows as expected.

Impact:

  • Queries using joins on boolean fields produce incorrect and incomplete results
  • Data analysis involving boolean field joins is unreliable
  • The issue is difficult to diagnose because it only manifests when fields command is used before the join
  • Users must avoid using fields before joins on boolean fields, or cast boolean to integer as a workaround

Tentative Root Cause Analysis

⚠️ This is a preliminary analysis and requires further investigation.

Based on code analysis and testing, the issue appears to be in the sort pushdown implementation for composite aggregations:

Location: ./opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/PushDownContext.java, method pushDownSortIntoAggBucket(), lines 276-282

Analysis:
When a fields command is used before a join on a boolean field, the physical plan pushes the sort operation down to OpenSearch's composite aggregation (indicated by SORT->[1] in the PushDownContext). The method pushDownSortIntoAggBucket() attempts to reorder the composite aggregation sources to match the required sort order for the merge join.

However, the code contains a potential issue:

IntStream.range(0, buckets.size())
    .mapToObj(fieldNames::get)
    .filter(name -> !selected.contains(name))
    .forEach(name -> {
        newBuckets.add(buckets.get(bucketNames.indexOf(name)));
        newBucketNames.add(name);
    });

This code assumes that fieldNames[0..buckets.size()-1] contains the bucket field names in the same order as the buckets list. However, after a LogicalProject (fields command), the output schema may be reordered, breaking this assumption.

Observed Behavior:

  • Without fields: Calcite performs the sort (EnumerableSort), and the merge join works correctly
  • With fields: Sort is pushed down to OpenSearch, but the composite aggregation sources are not reordered correctly, causing the merge join to fail

Tentative Proposed Fix

⚠️ This is a preliminary proposal and requires investigation and validation.

The fix may involve changing the iteration logic to use bucketNames directly instead of indexing into fieldNames:

// Instead of:
IntStream.range(0, buckets.size()).mapToObj(fieldNames::get)

// Use:
bucketNames.stream()

This ensures that the bucket lookup is based on the actual bucket names rather than assumptions about field ordering.

Note: This fix requires thorough testing and validation to ensure it doesn't break existing functionality.

Workaround

Avoid using the fields command before joins on boolean fields, or cast boolean fields to integers:

source=test_bool_join 
| eval flag_int=cast(has_flag as int) 
| stats AVG(value) as avg_val, COUNT() as cnt by category, flag_int 
| fields category, flag_int, avg_val, cnt 
| join left=L right=R ON L.flag_int = R.flag_int 
  [source=test_bool_join 
   | eval flag_int=cast(has_flag as int) 
   | stats AVG(value) as overall_avg by flag_int]

Environment Information

OpenSearch Version:
OpenSearch 3.4.0-SNAPSHOT (build hash: e0d622ddc73bdab552d94427dfeb3e101b026ce2, build date: 2025-10-13T02:01:20.669798Z)

Additional Details:

  • SQL plugin version: bundled with OpenSearch 3.4.0-SNAPSHOT
  • The issue affects the Calcite-based PPL engine
  • The bug is reproducible with minimal test data
  • Related to merge join implementation and sort pushdown optimization

Metadata

Metadata

Assignees

Labels

PPLPiped processing languagebugSomething isn't working

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions