Skip to content

fix: skip empty metadata in intersect_metadata_for_union to prevent s…#21127

Open
RafaelHerrero wants to merge 4 commits intoapache:mainfrom
RafaelHerrero:fix/union-metadata-intersection-19049
Open

fix: skip empty metadata in intersect_metadata_for_union to prevent s…#21127
RafaelHerrero wants to merge 4 commits intoapache:mainfrom
RafaelHerrero:fix/union-metadata-intersection-19049

Conversation

@RafaelHerrero
Copy link

Which issue does this PR close?

Rationale for this change

We're building a SQL engine on top of DataFusion and hit this while running benchmarks. A UNION ALL query against Parquet files that carry field metadata (like PARQUET:field_id or InfluxDB's iox::column::type). When one branch of the union has a NULL literal, intersect_metadata_for_union intersects the metadata from the data source with the empty metadata from the NULL — and since intersecting anything with an empty set gives empty, all metadata gets wiped out.

Later, when optimize_projections prunes columns and recompute_schema rebuilds the Union schema, the logical schema has {} while the physical schema still has the original metadata from Parquet. This causes:

Internal error: Physical input schema should be the same as the one
converted from logical input schema.
Differences:
  - field metadata at index 0 [usage_idle]: (physical) {"iox::column::type": "..."} vs (logical) {}

As @erratic-pattern and @alamb discussed in the issue, empty metadata from NULL literals isn't saying "this field has no metadata" — it's saying "I don't know." It shouldn't erase metadata from branches that actually have it.

I fixed this in intersect_metadata_for_union directly rather than patching optimize_projections or recompute_schema, since that's where the bad intersection happens and it covers all code paths that derive Union schemas.

What changes are included in this PR?

One change to intersect_metadata_for_union in datafusion/expr/src/expr.rs: branches with empty metadata are skipped during intersection instead of participating. Non-empty branches still intersect normally (conflicting values still get dropped). If every branch is empty, the result is empty — same as before.

Are these changes tested?

Added 7 unit tests for intersect_metadata_for_union:

  • Same metadata across branches — preserved
  • Conflicting non-empty values — dropped (existing behavior, unchanged)
  • One branch has metadata, other is empty — metadata preserved (the fix)
  • Empty branch comes first — still works
  • All branches empty — empty result
  • Mix of empty and conflicting non-empty — intersects only the non-empty ones
  • No inputs — empty result

The full end-to-end reproduction needs Parquet files with field metadata as described in the issue. The unit tests cover the intersection logic directly.

Are there any user-facing changes?

No API changes. UNION ALL queries combining metadata-carrying sources with NULL literals will stop failing with schema mismatch errors.

RafaelHerrero and others added 2 commits March 23, 2026 00:12
…chema mismatch

When a UNION ALL combines columns from sources with field metadata
(e.g. Parquet) and NULL literals (which have no metadata), the
intersect_metadata_for_union function was dropping all metadata
because intersecting anything with an empty set yields an empty set.

After optimize_projections prunes unused columns and recompute_schema
rebuilds the Union via Union::try_new, the logical schema ends up
with empty metadata while the physical schema retains the original
field metadata from Parquet, causing a physical/logical schema
mismatch error.

The fix treats empty metadata as a non-vote in the intersection:
branches with no metadata (NULL literals, computed expressions) are
skipped, so only branches with actual metadata participate. When
non-empty branches conflict, their metadata is still correctly
intersected as before.

Closes apache#19049
@adriangb
Copy link
Contributor

Could we add an SLT reproducer?

Add a regression test to metadata.slt that exercises the
optimize_projections column pruning path on a UNION ALL with
NULL literals and a table with field metadata.
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Mar 23, 2026
@RafaelHerrero
Copy link
Author

Added an SLT reproducer in metadata.slt. The test uses table_with_metadata (which has field-level metadata) in a UNION ALL with NULL literals, and includes an unused column (id) so that optimize_projections prunes it — triggering the recompute_schema → intersect_metadata_for_union path that was dropping metadata

@alamb
Copy link
Contributor

alamb commented Mar 24, 2026

Thank yoU @RafaelHerrero

@erratic-pattern and chance you are able to help review this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error with union and optimize_projections: Physical input schema should be the same as the one converted from logical input schema

3 participants