What's Changed
New Features 🎉
- feat: support DROP INDEX DDL by @LuciferYang in #371
- feat: add Map type support by @summaryzb in #379
- feat: support float16 vector type by @wombatu-kun in #378
- feat: support alter table set/unset properties by @wombatu-kun in #358
- feat: support alter table set unenforced primary key by @wombatu-kun in #359
- feat: support rename table by @wombatu-kun in #352
- feat: propagate enable_stable_row_ids through spark write path by @ivscheianu in #351
- feat(benchmark): add --file-format-version option to TPC-DS data generator by @xuzha in #411
- feat: add zonemap-based fragment pruning and storage-partitioned join (SPJ) support by @beinan in #396
- feat: add use_large_var_types option to avoid 2GB Arrow vector overflow by @beinan in #413
- feat(benchmark): add TPC-DS btree index creation for fragment pruning by @summaryzb in #433
- feat: support compression config via Spark TBLPROPERTIES by @ivscheianu in #428
- feat: add byte-based batch flushing to prevent OOM on large rows by @beinan in #420
- feat: expose blob_pack_file_size_threshold write option by @hamersaw in #447
- feat: support reading non-microsecond Arrow timestamp columns by @summaryzb in #444
- feat: require clustered distribution on write for SPJ by @beinan in #445
- feat: support Lance index metadata in Spark indexing by @jackye1995 in #481
- feat: add param
rows_per_rangefor range-based btree index built by @fangbo in #439 - feat: add custom Lance metrics to trace read-path scan performance by @summaryzb in #460
- feat: preserve Arrow Date(MILLISECOND) columns through Spark roundtrip by @summaryzb in #464
- fix: widen pruned nested struct schemas to preserve Arrow child ordinals by @butnaruandrei in #442
Bug Fixes 🐛
- fix: strip quotes from visitStringLiteral and fail explicitly on unrecognized build_mode by @puchengy in #375
- fix: escape single quotes in filter pushdown SQL compilation by @LuciferYang in #377
- fix: prevent resource leaks in read path close/error handling by @LuciferYang in #376
- fix: decouple benchmark module from lance-spark version dependency by @summaryzb in #370
- fix: rename NamedArgument to LanceNamedArgument to avoid Iceberg classpath collision by @LuciferYang in #383
- fix: add --fail flag to curl downloads in docker/Dockerfile by @wombatu-kun in #405
- fix: update columns concurrent write conflict issue by @jerryjch in #345
- fix: pass namespace and storage parameters for add/update column by @bryanck in #422
- fix: add clean-bundle target for reliable source change detection by @ivscheianu in #426
- fix(benchmark): accurately materialize tpcds query by @summaryzb in #415
- fix: implement equals/hashCode on LanceScan to enable ReusedExchange by @LuciferYang in #427
- fix: intercept Spark 4.0+ native CREATE INDEX to prevent NPE by @beinan in #412
- fix: report post-pruning statistics to enable BroadcastHashJoin with SPJ by @beinan in #425
- fix: race condition in QueuedArrowBatchWriteBuffer losing final batch by @hamersaw in #431
- fix: preserve use_large_var_types on staged commit path by @beinan in #443
- fix: remove Array filter pushdown workaround (upstream lance#… by @summaryzb in #441
- fix: exclude netty from bundle jars to prevent split-package conflicts by @hamersaw in #458
- fix: roll fragments on partition-value transitions by @hamersaw in #463
- fix: no such method error in lance arrow util due to transitive json4s usage by @ivscheianu in #465
- fix: propagate index_details from distributed index creation by @LuciferYang in #475
- fix(spark): gss initiate failed on hms executors; spark.sql.catalog read options not applied by @xiaguanglei in #476
- fix: reject DECIMAL256 columns at schema resolution time with actiona… by @summaryzb in #492
Documentation 📚
- docs: add Spark 4.1 to supported versions by @hamersaw in #391
- docs: add use_large_var_types write option documentation by @beinan in #424
- docs: fixed repo name from lancedb/lance-spark to lance-format/lance-spark by @wombatu-kun in #438
- docs: add Lance Spark Glue/S3 agent skill by @jackye1995 in #489
Performance Improvements 🚀
- perf: optimize LIMIT pushdown by pruning splits using fragment row counts by @beinan in #395
- refactor: remove Java-side dataset cache, rely on Rust-side Session by @LuciferYang in #353
- perf: report projection-aware stats so BroadcastHashJoin fires on pruned scans by @LuciferYang in #435
Other Changes
- refactor: move integration tests to top-level directory by @hamersaw in #393
- refactor: remove dead SchemaConverter JsonArrow code by @LuciferYang in #409
- refactor: consolidate dataset-open logic into Utils.OpenDatasetBuilder by @LuciferYang in #384
- refactor: refactor the vector data expose to rust side to improve performance and prevent OOM by @fangbo in #467
New Contributors
- @puchengy made their first contribution in #375
- @summaryzb made their first contribution in #370
- @wombatu-kun made their first contribution in #378
- @xuzha made their first contribution in #411
- @jerryjch made their first contribution in #345
- @xiaguanglei made their first contribution in #476
- @butnaruandrei made their first contribution in #442
Full Changelog: v0.3.0...v0.4.0