[BUG] PPL `join` on boolean fields returns incomplete results when used with `fields` command

## Query Information
**PPL Command/Query:**
```ppl
source=test_bool_join 
| stats AVG(value) as avg_val, COUNT() as cnt by category, has_flag 
| fields category, has_flag, avg_val, cnt 
| join left=L right=R ON L.has_flag = R.has_flag 
  [source=test_bool_join | stats AVG(value) as overall_avg by has_flag]
```

**Expected Result:**
The query should return 6 rows (3 with `has_flag=true` and 3 with `has_flag=false`), with each row from the left side matched to the corresponding aggregated value from the right side based on the boolean field.

Expected output:
```json
{
  "datarows": [
    ["A", true, 10.0, 1, 20.0, true],
    ["B", true, 20.0, 1, 20.0, true],
    ["C", true, 30.0, 1, 20.0, true],
    ["D", false, 40.0, 1, 50.0, false],
    ["E", false, 50.0, 1, 50.0, false],
    ["F", false, 60.0, 1, 50.0, false]
  ],
  "total": 6
}
```

**Actual Result:**
The query returns only 3 rows (all with `has_flag=true`). The rows with `has_flag=false` are completely missing from the result set.

```json
{
  "datarows": [
    ["A", true, 10.0, 1, 20.0, true],
    ["B", true, 20.0, 1, 20.0, true],
    ["C", true, 30.0, 1, 20.0, true]
  ],
  "total": 3
}
```

## Dataset Information
**Dataset/Schema Type**
- [x] Custom (details below)

**Index Mapping**
```json
{
  "mappings": {
    "properties": {
      "category": { "type": "keyword" },
      "has_flag": { "type": "boolean" },
      "value": { "type": "integer" }
    }
  }
}
```

**Sample Data**
```json
{"category":"A","has_flag":true,"value":10}
{"category":"B","has_flag":true,"value":20}
{"category":"C","has_flag":true,"value":30}
{"category":"D","has_flag":false,"value":40}
{"category":"E","has_flag":false,"value":50}
{"category":"F","has_flag":false,"value":60}
```

## Bug Description
**Issue Summary:**
PPL join operations on boolean fields return incomplete results when a `fields` command is used before the join. Specifically, when the left side of the join has multiple rows with the same boolean value, only rows with `true` values are matched and returned. Rows with `false` values are completely omitted from the result set.

**Steps to Reproduce:**
1. Create an index with a boolean field:
```bash
curl -X PUT "http://localhost:9200/test_bool_join" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "category": { "type": "keyword" },
      "has_flag": { "type": "boolean" },
      "value": { "type": "integer" }
    }
  }
}'
```

2. Insert test data (6 documents: 3 with `true`, 3 with `false`):
```bash
curl -X POST "http://localhost:9200/test_bool_join/_bulk" -H 'Content-Type: application/json' -d'
{"index":{}}
{"category":"A","has_flag":true,"value":10}
{"index":{}}
{"category":"B","has_flag":true,"value":20}
{"index":{}}
{"category":"C","has_flag":true,"value":30}
{"index":{}}
{"category":"D","has_flag":false,"value":40}
{"index":{}}
{"category":"E","has_flag":false,"value":50}
{"index":{}}
{"category":"F","has_flag":false,"value":60}
'
```

3. Execute the failing query:
```bash
curl -X POST "http://localhost:9200/_plugins/_ppl" -H 'Content-Type: application/json' -d'
{
  "query": "source=test_bool_join | stats AVG(value) as avg_val, COUNT() as cnt by category, has_flag | fields category, has_flag, avg_val, cnt | join left=L right=R ON L.has_flag = R.has_flag [source=test_bool_join | stats AVG(value) as overall_avg by has_flag]"
}'
```

4. Observe that only 3 rows are returned (all with `has_flag=true`)

**Key Finding:**
The bug is specifically triggered by the combination of:
1. A `fields` command (LogicalProject) before the join
2. A join condition on a boolean field (or potentially other primitive types)
3. Multiple rows on the left side with the same boolean value

**Proof**: Without the `fields` command, the same query returns all 6 rows correctly:
```ppl
source=test_bool_join 
| stats AVG(value) as avg_val by category, has_flag 
| join left=L right=R ON L.has_flag = R.has_flag 
  [source=test_bool_join | stats AVG(value) as overall_avg by has_flag]
```
This returns all 6 rows as expected.

**Impact:**
- Queries using joins on boolean fields produce incorrect and incomplete results
- Data analysis involving boolean field joins is unreliable
- The issue is difficult to diagnose because it only manifests when `fields` command is used before the join
- Users must avoid using `fields` before joins on boolean fields, or cast boolean to integer as a workaround

## Tentative Root Cause Analysis
**⚠️ This is a preliminary analysis and requires further investigation.**

Based on code analysis and testing, the issue appears to be in the sort pushdown implementation for composite aggregations:

**Location**: `./opensearch/src/main/java/org/opensearch/sql/opensearch/storage/scan/PushDownContext.java`, method `pushDownSortIntoAggBucket()`, lines 276-282

**Analysis**:
When a `fields` command is used before a join on a boolean field, the physical plan pushes the sort operation down to OpenSearch's composite aggregation (indicated by `SORT->[1]` in the PushDownContext). The method `pushDownSortIntoAggBucket()` attempts to reorder the composite aggregation sources to match the required sort order for the merge join.

However, the code contains a potential issue:
```java
IntStream.range(0, buckets.size())
    .mapToObj(fieldNames::get)
    .filter(name -> !selected.contains(name))
    .forEach(name -> {
        newBuckets.add(buckets.get(bucketNames.indexOf(name)));
        newBucketNames.add(name);
    });
```

This code assumes that `fieldNames[0..buckets.size()-1]` contains the bucket field names in the same order as the `buckets` list. However, after a `LogicalProject` (fields command), the output schema may be reordered, breaking this assumption.

**Observed Behavior**:
- Without `fields`: Calcite performs the sort (EnumerableSort), and the merge join works correctly
- With `fields`: Sort is pushed down to OpenSearch, but the composite aggregation sources are not reordered correctly, causing the merge join to fail

## Tentative Proposed Fix
**⚠️ This is a preliminary proposal and requires investigation and validation.**

The fix may involve changing the iteration logic to use `bucketNames` directly instead of indexing into `fieldNames`:

```java
// Instead of:
IntStream.range(0, buckets.size()).mapToObj(fieldNames::get)

// Use:
bucketNames.stream()
```

This ensures that the bucket lookup is based on the actual bucket names rather than assumptions about field ordering.

**Note**: This fix requires thorough testing and validation to ensure it doesn't break existing functionality.

## Workaround
Avoid using the `fields` command before joins on boolean fields, or cast boolean fields to integers:

```ppl
source=test_bool_join 
| eval flag_int=cast(has_flag as int) 
| stats AVG(value) as avg_val, COUNT() as cnt by category, flag_int 
| fields category, flag_int, avg_val, cnt 
| join left=L right=R ON L.flag_int = R.flag_int 
  [source=test_bool_join 
   | eval flag_int=cast(has_flag as int) 
   | stats AVG(value) as overall_avg by flag_int]
```

## Environment Information
**OpenSearch Version:**
OpenSearch 3.4.0-SNAPSHOT (build hash: e0d622ddc73bdab552d94427dfeb3e101b026ce2, build date: 2025-10-13T02:01:20.669798Z)

**Additional Details:**
- SQL plugin version: bundled with OpenSearch 3.4.0-SNAPSHOT
- The issue affects the Calcite-based PPL engine
- The bug is reproducible with minimal test data
- Related to merge join implementation and sort pushdown optimization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] PPL `join` on boolean fields returns incomplete results when used with `fields` command #4529

Query Information

Dataset Information

Bug Description

Tentative Root Cause Analysis

Tentative Proposed Fix

Workaround

Environment Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] PPL join on boolean fields returns incomplete results when used with fields command #4529

Description

Query Information

Dataset Information

Bug Description

Tentative Root Cause Analysis

Tentative Proposed Fix

Workaround

Environment Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[BUG] PPL `join` on boolean fields returns incomplete results when used with `fields` command #4529