Replies: 1 comment 1 reply
-
|
Thanks for writing this up. I think the overall direction makes sense, but I think there are two important clarifications on the Lance side.
That said, I agree with the product need here. Iceberg needs the final file size for manifest reporting, and doing an extra storage round-trip after close is not ideal.
One more nit: I don't think So my take is:
If you'd like, a focused PR for the first item could still be a good starting point, but I think it should be framed as "add final file size to the Rust file writer API and expose it through JNI", instead of "JNI currently discards bytes written". |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
We have a working Iceberg file format integration using the Java SDK (
LanceFileReader/LanceFileWriter). Draft PR on Iceberg: apache/iceberg#15751Two small JNI changes would close the biggest gaps:
1.
getBytesWritten()on LanceFileWriterRust
FileWriter::finish()returnsResult<u64>— total bytes written.The JNI layer discards it:
Fix: return
jlongfromcloseNative, addpublic long getBytesWritten()to the Java class. Non-breaking —
close()stays void.Iceberg needs this for file size reporting in manifests and split planning.
Currently working around it with an extra storage round-trip after close.
2. Column statistics via JNI (after PR #5639)
Once #5639 merges, the Java SDK needs a way to read per-column min/max/null_count
from
LanceFileReader. Something likegetColumnStatistics()returning thestats from the global buffer.
Iceberg uses these for file-level pruning — skipping files whose min/max don't
overlap the query predicate. Without them, every Lance file gets scanned.
Future work on iceberg-lance
getBytesWritten()existsHappy to contribute PRs if the approach looks right.
Beta Was this translation helpful? Give feedback.
All reactions