[AMORO-4131] Add non-partitioned table tests for RewriteFiles and OverwriteFiles#4137
[AMORO-4131] Add non-partitioned table tests for RewriteFiles and OverwriteFiles#4137Takch02 wants to merge 5 commits intoapache:masterfrom
Conversation
Test : Add NonPartition Test case of RewriteFile
|
I re-uploaded it after checking the CI and adhering to code conventions. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #4137 +/- ##
=========================================
Coverage 29.90% 29.90%
Complexity 4160 4160
=========================================
Files 670 670
Lines 53533 53533
Branches 6797 6797
=========================================
+ Hits 16008 16010 +2
+ Misses 36343 36341 -2
Partials 1182 1182
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@xxubai |
xxubai
left a comment
There was a problem hiding this comment.
Thanks for adding the non-partitioned coverage. I think the new assertions are still a bit weak for the expected-success cases.
For unpartitioned tables, the current implementation still updates the Hive table location to only the directory of the first added file. So if the rewritten/overwritten files are actually under multiple directories, the commit may succeed, but Hive may still see only part of the data.
Because of that, checking only lastedAddedFiles() and planFiles() does not seem sufficient here. These assertions show that the Iceberg commit succeeded, but they do not verify that the Hive-side result is actually correct.
Thank you for the review! You are right. Currently, I was only checking the Iceberg assertion and not looking at the Hive side. I fully understand that in non-partitioned tables, if files exist in multiple directories, the Hive location might only point to the first folder, meaning only a portion of the data can be visible. I will add the actual data validation code for the Hive side and push it soon. |
|
Hi @xxubai, I added Hive-side validation, and as you mentioned, Hive can only see the files located in the directory of the first added file. The files in other directories become invisible to Hive, which causes the validation with all afterFiles to fail. Given this behavior, how should we handle this in the test? Should we expect a CannotAlterHiveLocationException here as well (similar to how partitioned tables behave when spanning multiple directories), or should we just assert that only a subset of files is visible to Hive to document the current limitation? |
Motivation
Currently, TestRewriteFiles and TestOverwriteFiles only verify the behavior of partitioned tables, expecting a CannotAlterHiveLocationException when partial rewrites or overwrites attempt to reference multiple directories. Non-partitioned tables, however, are not bound by Hive's strict 1-partition-1-folder constraint and should successfully commit these operations. This PR adds the missing coverage for non-partitioned tables.
Key Changes
Added NonPartitioned methods in TestRewriteFiles and TestOverwriteFiles using Assume.assumeFalse(isPartitionedTable()).
Verified successful commits instead of expecting exceptions, as non-partitioned tables gracefully handle files across multiple directories.
Introduced Dynamic Assertions: Replaced hardcoded integer values with dynamic size assertions to prevent fragile tests. This ensures the tests remain robust regardless of table properties (e.g., Primary Key bucketing or target file size limits).
Comprehensive State Verification: Implemented a two-step verification process for each test:
Action Verification: Checked the exact number of files added in the latest transaction using lastedAddedFiles().
State Verification: Checked the total number of live files in the current snapshot using baseStore.newScan().planFiles().
Example of Verification Logic:
Rewrite: Verified afterFiles.size() and totalLiveFiles against addFiles.size().
Overwrite: Verified afterFiles.size() against secondDataFiles.size(), and totalLiveFiles against firstDataFiles.size() + secondDataFiles.size().