Release v0.4.0 · lance-format/lance-spark

What's Changed

New Features 🎉

feat: support DROP INDEX DDL by @LuciferYang in #371
feat: add Map type support by @summaryzb in #379
feat: support float16 vector type by @wombatu-kun in #378
feat: support alter table set/unset properties by @wombatu-kun in #358
feat: support alter table set unenforced primary key by @wombatu-kun in #359
feat: support rename table by @wombatu-kun in #352
feat: propagate enable_stable_row_ids through spark write path by @ivscheianu in #351
feat(benchmark): add --file-format-version option to TPC-DS data generator by @xuzha in #411
feat: add zonemap-based fragment pruning and storage-partitioned join (SPJ) support by @beinan in #396
feat: add use_large_var_types option to avoid 2GB Arrow vector overflow by @beinan in #413
feat(benchmark): add TPC-DS btree index creation for fragment pruning by @summaryzb in #433
feat: support compression config via Spark TBLPROPERTIES by @ivscheianu in #428
feat: add byte-based batch flushing to prevent OOM on large rows by @beinan in #420
feat: expose blob_pack_file_size_threshold write option by @hamersaw in #447
feat: support reading non-microsecond Arrow timestamp columns by @summaryzb in #444
feat: require clustered distribution on write for SPJ by @beinan in #445
feat: support Lance index metadata in Spark indexing by @jackye1995 in #481
feat: add param rows_per_range for range-based btree index built by @fangbo in #439
feat: add custom Lance metrics to trace read-path scan performance by @summaryzb in #460
feat: preserve Arrow Date(MILLISECOND) columns through Spark roundtrip by @summaryzb in #464
fix: widen pruned nested struct schemas to preserve Arrow child ordinals by @butnaruandrei in #442

Bug Fixes 🐛

fix: strip quotes from visitStringLiteral and fail explicitly on unrecognized build_mode by @puchengy in #375
fix: escape single quotes in filter pushdown SQL compilation by @LuciferYang in #377
fix: prevent resource leaks in read path close/error handling by @LuciferYang in #376
fix: decouple benchmark module from lance-spark version dependency by @summaryzb in #370
fix: rename NamedArgument to LanceNamedArgument to avoid Iceberg classpath collision by @LuciferYang in #383
fix: add --fail flag to curl downloads in docker/Dockerfile by @wombatu-kun in #405
fix: update columns concurrent write conflict issue by @jerryjch in #345
fix: pass namespace and storage parameters for add/update column by @bryanck in #422
fix: add clean-bundle target for reliable source change detection by @ivscheianu in #426
fix(benchmark): accurately materialize tpcds query by @summaryzb in #415
fix: implement equals/hashCode on LanceScan to enable ReusedExchange by @LuciferYang in #427
fix: intercept Spark 4.0+ native CREATE INDEX to prevent NPE by @beinan in #412
fix: report post-pruning statistics to enable BroadcastHashJoin with SPJ by @beinan in #425
fix: race condition in QueuedArrowBatchWriteBuffer losing final batch by @hamersaw in #431
fix: preserve use_large_var_types on staged commit path by @beinan in #443
fix: remove Array filter pushdown workaround (upstream lance#… by @summaryzb in #441
fix: exclude netty from bundle jars to prevent split-package conflicts by @hamersaw in #458
fix: roll fragments on partition-value transitions by @hamersaw in #463
fix: no such method error in lance arrow util due to transitive json4s usage by @ivscheianu in #465
fix: propagate index_details from distributed index creation by @LuciferYang in #475
fix(spark): gss initiate failed on hms executors; spark.sql.catalog read options not applied by @xiaguanglei in #476
fix: reject DECIMAL256 columns at schema resolution time with actiona… by @summaryzb in #492

Documentation 📚

docs: add Spark 4.1 to supported versions by @hamersaw in #391
docs: add use_large_var_types write option documentation by @beinan in #424
docs: fixed repo name from lancedb/lance-spark to lance-format/lance-spark by @wombatu-kun in #438
docs: add Lance Spark Glue/S3 agent skill by @jackye1995 in #489

Performance Improvements 🚀

perf: optimize LIMIT pushdown by pruning splits using fragment row counts by @beinan in #395
refactor: remove Java-side dataset cache, rely on Rust-side Session by @LuciferYang in #353
perf: report projection-aware stats so BroadcastHashJoin fires on pruned scans by @LuciferYang in #435

Other Changes

refactor: move integration tests to top-level directory by @hamersaw in #393
refactor: remove dead SchemaConverter JsonArrow code by @LuciferYang in #409
refactor: consolidate dataset-open logic into Utils.OpenDatasetBuilder by @LuciferYang in #384
refactor: refactor the vector data expose to rust side to improve performance and prevent OOM by @fangbo in #467

New Contributors

@puchengy made their first contribution in #375
@summaryzb made their first contribution in #370
@wombatu-kun made their first contribution in #378
@xuzha made their first contribution in #411
@jerryjch made their first contribution in #345
@xiaguanglei made their first contribution in #476
@butnaruandrei made their first contribution in #442

Full Changelog: v0.3.0...v0.4.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.4.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Features 🎉

Bug Fixes 🐛

Documentation 📚

Performance Improvements 🚀

Other Changes

New Contributors

Contributors

Uh oh!