Java SDK: Two JNI gaps blocking Iceberg integration #6276

tardunge · 2026-03-24T14:18:58Z

tardunge
Mar 24, 2026

We have a working Iceberg file format integration using the Java SDK (LanceFileReader / LanceFileWriter). Draft PR on Iceberg: apache/iceberg#15751

Two small JNI changes would close the biggest gaps:

1. `getBytesWritten()` on LanceFileWriter

Rust FileWriter::finish() returns Result<u64> — total bytes written.
The JNI layer discards it:

// lance-jni/src/file_writer.rs, closeNative:
Ok(_) => {}  // discards the u64

Fix: return jlong from closeNative, add public long getBytesWritten()
to the Java class. Non-breaking — close() stays void.

Iceberg needs this for file size reporting in manifests and split planning.
Currently working around it with an extra storage round-trip after close.

2. Column statistics via JNI (after PR #5639)

Once #5639 merges, the Java SDK needs a way to read per-column min/max/null_count
from LanceFileReader. Something like getColumnStatistics() returning the
stats from the global buffer.

Iceberg uses these for file-level pruning — skipping files whose min/max don't
overlap the query predicate. Without them, every Lance file gets scanned.

Future work on iceberg-lance

Predicate pushdown — we handle this in the integration layer
Schema evolution — we handle this via Arrow field ID metadata
Split planning — we can build this once getBytesWritten() exists

Happy to contribute PRs if the approach looks right.

Xuanwo · 2026-03-26T10:44:09Z

Xuanwo
Mar 26, 2026
Maintainer

Thanks for writing this up. I think the overall direction makes sense, but I think there are two important clarifications on the Lance side.

For getBytesWritten(), the current issue is not only that JNI drops the return value. In the current Rust implementation, lance_file::writer::FileWriter::finish() returns the number of rows written, not the final file size. So if we want LanceFileWriter.getBytesWritten() in Java, we likely need a Rust-side API change first, and then a JNI / Java wrapper on top.

That said, I agree with the product need here. Iceberg needs the final file size for manifest reporting, and doing an extra storage round-trip after close is not ideal.

Exposing column statistics through the Java SDK also seems reasonable, but I would strongly prefer a stable typed API on LanceFileReader over exposing "global buffer" details directly. Something like statistics keyed by field id would give us a better abstraction boundary and avoid coupling Java callers to internal file layout details.

One more nit: I don't think getBytesWritten() alone is sufficient for split planning. It helps with final file size, but split planning likely also needs a reliable way to derive or record split boundaries during writing. So I would treat that as related work, not fully unlocked by this one API.

So my take is:

getBytesWritten() / final file size: reasonable request, but it is not just a JNI-only change
column statistics API: reasonable after feat: add a MVP for column statistics at dataset level on Rust and Python side #5639 lands, preferably as a typed reader API instead of a raw global-buffer exposure
split planning: probably needs additional design beyond final file size alone

If you'd like, a focused PR for the first item could still be a good starting point, but I think it should be framed as "add final file size to the Rust file writer API and expose it through JNI", instead of "JNI currently discards bytes written".

1 reply

tardunge Mar 26, 2026
Author

@Xuanwo Thanks for getting back appreciate your time and feedback.

Thanks for correcting, that the rust layer returns num_rows written. We can extend this to return num_bytes as well for supporting iceberg.
Agreed.
For split planning, I have documented an approach by maintaining the position_offset and row_num, mapping and store it in lance file metadata while writing, and reader can benefit from this pre-computation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Java SDK: Two JNI gaps blocking Iceberg integration #6276

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Java SDK: Two JNI gaps blocking Iceberg integration #6276

Uh oh!

Uh oh!

tardunge Mar 24, 2026

1. getBytesWritten() on LanceFileWriter

2. Column statistics via JNI (after PR #5639)

Future work on iceberg-lance

Replies: 1 comment · 1 reply

Uh oh!

Xuanwo Mar 26, 2026 Maintainer

Uh oh!

tardunge Mar 26, 2026 Author

tardunge
Mar 24, 2026

1. `getBytesWritten()` on LanceFileWriter

Replies: 1 comment 1 reply

Xuanwo
Mar 26, 2026
Maintainer

tardunge Mar 26, 2026
Author