Skip to content

[AMORO-4131] Add non-partitioned table tests for RewriteFiles and OverwriteFiles#4137

Open
Takch02 wants to merge 5 commits intoapache:masterfrom
Takch02:add-non-partitioned-tests
Open

[AMORO-4131] Add non-partitioned table tests for RewriteFiles and OverwriteFiles#4137
Takch02 wants to merge 5 commits intoapache:masterfrom
Takch02:add-non-partitioned-tests

Conversation

@Takch02
Copy link

@Takch02 Takch02 commented Mar 20, 2026

Motivation

Currently, TestRewriteFiles and TestOverwriteFiles only verify the behavior of partitioned tables, expecting a CannotAlterHiveLocationException when partial rewrites or overwrites attempt to reference multiple directories. Non-partitioned tables, however, are not bound by Hive's strict 1-partition-1-folder constraint and should successfully commit these operations. This PR adds the missing coverage for non-partitioned tables.

Key Changes

Added NonPartitioned methods in TestRewriteFiles and TestOverwriteFiles using Assume.assumeFalse(isPartitionedTable()).

Verified successful commits instead of expecting exceptions, as non-partitioned tables gracefully handle files across multiple directories.

Introduced Dynamic Assertions: Replaced hardcoded integer values with dynamic size assertions to prevent fragile tests. This ensures the tests remain robust regardless of table properties (e.g., Primary Key bucketing or target file size limits).

Comprehensive State Verification: Implemented a two-step verification process for each test:

Action Verification: Checked the exact number of files added in the latest transaction using lastedAddedFiles().

State Verification: Checked the total number of live files in the current snapshot using baseStore.newScan().planFiles().

Example of Verification Logic:

Rewrite: Verified afterFiles.size() and totalLiveFiles against addFiles.size().

Overwrite: Verified afterFiles.size() against secondDataFiles.size(), and totalLiveFiles against firstDataFiles.size() + secondDataFiles.size().

@github-actions github-actions bot added the module:mixed-hive Hive moduel for Mixed Format label Mar 20, 2026
@Takch02
Copy link
Author

Takch02 commented Mar 20, 2026

I re-uploaded it after checking the CI and adhering to code conventions.

@codecov-commenter
Copy link

codecov-commenter commented Mar 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.90%. Comparing base (cc03949) to head (042e849).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##             master    #4137   +/-   ##
=========================================
  Coverage     29.90%   29.90%           
  Complexity     4160     4160           
=========================================
  Files           670      670           
  Lines         53533    53533           
  Branches       6797     6797           
=========================================
+ Hits          16008    16010    +2     
+ Misses        36343    36341    -2     
  Partials       1182     1182           
Flag Coverage Δ
core 29.90% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Takch02 Takch02 changed the title Add non-partitioned table tests for RewriteFiles and OverwriteFiles Add non-partitioned table tests for RewriteFiles and OverwriteFiles #4131 Mar 24, 2026
@Takch02 Takch02 changed the title Add non-partitioned table tests for RewriteFiles and OverwriteFiles #4131 [AMORO-4131] Add non-partitioned table tests for RewriteFiles and OverwriteFiles Mar 24, 2026
@Takch02
Copy link
Author

Takch02 commented Mar 24, 2026

@xxubai
Please review the PR! Thank you.

Copy link
Contributor

@xxubai xxubai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the non-partitioned coverage. I think the new assertions are still a bit weak for the expected-success cases.

For unpartitioned tables, the current implementation still updates the Hive table location to only the directory of the first added file. So if the rewritten/overwritten files are actually under multiple directories, the commit may succeed, but Hive may still see only part of the data.

Because of that, checking only lastedAddedFiles() and planFiles() does not seem sufficient here. These assertions show that the Iceberg commit succeeded, but they do not verify that the Hive-side result is actually correct.

@Takch02
Copy link
Author

Takch02 commented Mar 25, 2026

Thanks for adding the non-partitioned coverage. I think the new assertions are still a bit weak for the expected-success cases.

For unpartitioned tables, the current implementation still updates the Hive table location to only the directory of the first added file. So if the rewritten/overwritten files are actually under multiple directories, the commit may succeed, but Hive may still see only part of the data.

Because of that, checking only lastedAddedFiles() and planFiles() does not seem sufficient here. These assertions show that the Iceberg commit succeeded, but they do not verify that the Hive-side result is actually correct.

Thank you for the review!

You are right. Currently, I was only checking the Iceberg assertion and not looking at the Hive side.

I fully understand that in non-partitioned tables, if files exist in multiple directories, the Hive location might only point to the first folder, meaning only a portion of the data can be visible.

I will add the actual data validation code for the Hive side and push it soon.

@Takch02
Copy link
Author

Takch02 commented Mar 25, 2026

Hi @xxubai, I added Hive-side validation, and as you mentioned, Hive can only see the files located in the directory of the first added file. The files in other directories become invisible to Hive, which causes the validation with all afterFiles to fail.

    int totalLiveFiles = Lists.newArrayList(baseStore.newScan().planFiles()).size();
    Assert.assertEquals(addFiles.size(), totalLiveFiles);

      Table hiveTable = TEST_HMS.getHiveClient()
              .getTable(getMixedTable().id().getDatabase(), getMixedTable().id().getTableName());
      String hiveLocation = hiveTable.getSd().getLocation();

      List<DataFile> filesVisibleToHive = afterFiles.stream()
              .filter(f -> f.path().toString().startsWith(hiveLocation))
              .collect(Collectors.toList());

      UpdateHiveFilesTestHelpers.validateHiveTableValues(
              TEST_HMS.getHiveClient(), getMixedTable(), filesVisibleToHive); // OK

      UpdateHiveFilesTestHelpers.validateHiveTableValues(
              TEST_HMS.getHiveClient(), getMixedTable(), afterFiles);  // Failed

Given this behavior, how should we handle this in the test? Should we expect a CannotAlterHiveLocationException here as well (similar to how partitioned tables behave when spanning multiple directories), or should we just assert that only a subset of files is visible to Hive to document the current limitation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:mixed-hive Hive moduel for Mixed Format

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants