-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SUPPORT] MOR table behavior for Spark Bulk insert to COW #12133
Comments
Currently, for this case Lines 65 to 68 in 5ccb19b
First insert uses SingleFileHandleCreateFactory , but the second insert will use AppendHandleFactory , and create log file.
I don't understand how Bulk insert to COW table with Simple bucket index should work by design. When we inserting data, that should update previous data, should we create new parquet file with new data, and call inline compaction (due to COW table type), or merge and write data to new parquet file, then it's not bulk insert? |
Bulk_insert should only be executed once IMO, for second update, you should use upsert operation instead. |
@geserdugarov Thats surprising. We should ideally never have log files regardless of what operation you use. |
@geserdugarov I exactly tried your code with current master and it is working as expected. I dont see any log files. Can you please check once. Thanks. |
@ad1happy2go, sorry, I wrote wrong version of Spark first.
As you can see, there is calling of I'm using Ubuntu 22.04, and run modified test in IntelliJ IDEA. To prevent removing of data files, I placed breakpoint right after the second insert, and check filesystem in debug mode. Rechecked mentioned test again for commit 66597e5, and still see: /tmp/spark-bee1984c-74df-4067-8e14-09a61e02e0c8$ tree -a
.
├── dt=2021-01-05
│ ├── 00000001-9e90-410e-bdf3-4ea189ba93ac-0_1-14-12_20241023121839934.parquet
│ ├── .00000001-9e90-410e-bdf3-4ea189ba93ac-0_1-14-12_20241023121839934.parquet.crc
│ ├── .00000001-9e90-410e-bdf3-4ea189ba93ac-0_20241023121845855.log.1_0-30-31
│ ├── ..00000001-9e90-410e-bdf3-4ea189ba93ac-0_20241023121845855.log.1_0-30-31.crc
│ ├── .hoodie_partition_metadata
│ └── ..hoodie_partition_metadata.crc |
I've simplified provided test to: test("Test MOR as COW") {
withSQLConf("hoodie.datasource.write.operation" -> "bulk_insert") {
spark.sql(
s"""
|create table mor_as_cow (
| id int,
| dt int
|) using hudi
| tblproperties (
| primaryKey = 'id',
| type = 'cow',
| preCombineField = 'dt',
| hoodie.index.type = 'BUCKET',
| hoodie.index.bucket.engine = 'SIMPLE',
| hoodie.bucket.index.num.buckets = '2',
| hoodie.datasource.write.row.writer.enable = 'false')
| location '/tmp/mor_as_cow'
""".stripMargin)
spark.sql(s"insert into mor_as_cow values (5, 10)")
spark.sql(s"insert into mor_as_cow values (9, 30)")
}
} In a result: tree -a /tmp/mor_as_cow/
./mor_as_cow/
├── 00000000-5b7e-4294-a60d-686ea422d0cc-0_0-14-12_20241023123912660.parquet
├── .00000000-5b7e-4294-a60d-686ea422d0cc-0_0-14-12_20241023123912660.parquet.crc
├── .00000000-5b7e-4294-a60d-686ea422d0cc-0_20241023123917968.log.1_0-30-31
├── ..00000000-5b7e-4294-a60d-686ea422d0cc-0_20241023123917968.log.1_0-30-31.crc |
@ad1happy2go I've prepared local Spark 3.5.3 cluster and reproduced this bug using PySpark. The script is available here: After INSERT INTO cow_or_mor VALUES (5, 10);
INSERT INTO cow_or_mor VALUES (9, 30); for SELECT * FROM cow_or_mor; I got:
We see only one row, and missed the second one with tree -a /tmp/write-COW-get-MOR
# .
# ├── 00000000-dad4-4358-aaad-767a76e43e70-0_0-14-12_20241025153558259.parquet
# ├── .00000000-dad4-4358-aaad-767a76e43e70-0_0-14-12_20241025153558259.parquet.crc
# ├── .00000000-dad4-4358-aaad-767a76e43e70-0_20241025153606721.log.1_0-30-28
# ├── ..00000000-dad4-4358-aaad-767a76e43e70-0_20241025153606721.log.1_0-30-28.crc |
@geserdugarov |
@geserdugarov I tried your code with the current master but can't see any issue. Can you please try once and confirm please. Thanks a lot. Code
|
@geserdugarov I also tried building the code using the commit you mentioned, 66597e5 but still see the expected correct behavior. |
@ad1happy2go , I will recheck on current master. |
Issue is reproduced on 3d81ea0: tree -a
.
├── 00000000-6344-45e3-abbb-b78fd08cfc12-0_0-13-11_20241106134327326.parquet
├── .00000000-6344-45e3-abbb-b78fd08cfc12-0_0-13-11_20241106134327326.parquet.crc
├── .00000000-6344-45e3-abbb-b78fd08cfc12-0_20241106134336247.log.1_0-29-27
├── ..00000000-6344-45e3-abbb-b78fd08cfc12-0_20241106134336247.log.1_0-29-27.crc
├── .hoodie
Hoodie table properties: cat ./.hoodie/hoodie.properties
#Updated at 2024-11-06T06:43:32.140Z
#Wed Nov 06 13:43:32 NOVT 2024
hoodie.table.precombine.field=dt
hoodie.table.version=8
hoodie.database.name=default
hoodie.table.initial.version=8
hoodie.datasource.write.hive_style_partitioning=true
hoodie.table.metadata.partitions.inflight=
hoodie.table.checksum=2880991016
hoodie.table.keygenerator.type=NON_PARTITION
hoodie.table.create.schema={"type"\:"record","name"\:"cow_or_mor_record","namespace"\:"hoodie.cow_or_mor","fields"\:[{"name"\:"id","type"\:["int","null"]},{"name"\:"dt","type"\:["int","null"]}]}
hoodie.archivelog.folder=archived
hoodie.table.name=cow_or_mor
hoodie.record.merge.strategy.id=eeb8d96f-b1e4-49fd-bbf8-28ac514178e5
hoodie.compaction.payload.class=org.apache.hudi.common.model.DefaultHoodieRecordPayload
hoodie.table.type=COPY_ON_WRITE <-- COW
hoodie.datasource.write.partitionpath.urlencode=false
hoodie.datasource.write.drop.partition.columns=false
hoodie.table.metadata.partitions=files
hoodie.timeline.layout.version=1
hoodie.record.merge.mode=EVENT_TIME_ORDERING
hoodie.table.recordkey.fields=id Code: spark.sql("CREATE TABLE cow_or_mor ("
" id int,"
" dt int"
") USING HUDI "
"TBLPROPERTIES ("
" 'primaryKey' = 'id',"
" 'type' = 'cow',"
" 'preCombineField' = 'dt',"
" 'hoodie.index.type' = 'BUCKET',"
" 'hoodie.index.bucket.engine' = 'SIMPLE',"
" 'hoodie.bucket.index.num.buckets' = '2',"
" 'hoodie.datasource.write.row.writer.enable' = 'false'"
") LOCATION '" + tmp_dir_path + "';")
spark.sql("SET hoodie.datasource.write.operation=bulk_insert")
spark.sql("INSERT INTO cow_or_mor VALUES (5, 10);")
spark.sql("INSERT INTO cow_or_mor VALUES (9, 30);") |
@geserdugarov Thanks a lot. Thats a good catch. I am able to reproduce this issue with above code. We will work on this. |
@geserdugarov I have also confirmed the same behaviour with Hudi 0.15.X and spark master. It only happens for bucket index, so thinking if this is expected. |
bulk_insert is designed for executed with bootstrap purpose, because the whole pipeline just ignore any updates and do a one-shot write of parquets. If you do another bulk_insert based on the first one, it will be seen as an update and trigger log file write. |
Then using this logic, I propose fix as #12245. Need to wait CI results to be sure that it doesn't brake other specific cases. |
I've already created an issue HUDI-8394, but what to highlight and discuss this problem here.
I suppose, this is a critical issue with current master when:
hoodie.datasource.write.row.writer.enable = false
,Describe the problem you faced
When I try to bulk insert to COW table, I see in file system parquet and log files, which is MOR table behavior.
I've checked that table is COW type.
But files are not for COW table:
To Reproduce
To reproduce, existed test
Test Bulk Insert Into Bucket Index Table
could be modified and used:Expected behavior
For COW table, only parquet files should be created.
Environment Description
Hudi version : current master
Spark version : 3.4
The text was updated successfully, but these errors were encountered: