Skip to content

feat(lsm): expose lsm api to python#6259

Open
zhangyue19921010 wants to merge 5 commits intolance-format:mainfrom
zhangyue19921010:lsm-python
Open

feat(lsm): expose lsm api to python#6259
zhangyue19921010 wants to merge 5 commits intolance-format:mainfrom
zhangyue19921010:lsm-python

Conversation

@zhangyue19921010
Copy link
Copy Markdown
Contributor

Expose LSM related API to Python

Writer:

  • initialize_mem_wal
  • mem_wal_writer

Reader:

  • LsmScanner
  • LsmPointLookupPlanner
  • LsmVectorSearchPlanner

Optimize:

  • mark_generations_as_merged

@github-actions github-actions bot added enhancement New feature or request python labels Mar 23, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Review: feat(lsm): expose lsm api to python

Overall this is a well-structured PR with good test coverage and clean Python/Rust layering. A few issues to flag:

P0: GIL not released in ExecutionPlanReader::next()

In python/src/mem_wal.rs, the ExecutionPlanReader::next() implementation calls rt().spawn(None, ...), passing None instead of Some(py). This means the GIL is not released while the tokio future executes. Since next() is called from Python (which holds the GIL), this blocks the GIL for the entire duration of each batch fetch. If any downstream async work (or another Python thread) needs the GIL, this will deadlock.

Every other async call site in this PR correctly uses rt().block_on(Some(py), ...) to release the GIL. However, next() does not have access to py since it implements Iterator. Consider restructuring, e.g. collecting batches eagerly in to_reader while holding py, or using pyo3-async-runtimes to properly yield the GIL during iteration.

P1: put() collects all batches into memory

The put() method in PyRegionWriter materializes the entire input stream into a Vec before writing. Per project guidelines, this can OOM on large writes. Consider streaming batches to the writer incrementally instead of collecting them all at once.

Minor

  • Dead code: _unwrap_region_id() in python/python/lance/mem_wal.py is defined but never called. The UUID validation is already done on the Rust side. Remove it or wire it up.
  • Docstring typo: dataset.py line ~158 in mem_wal_writer docstring: initialize_mem_wal is missing its opening backtick.

@zhangyue19921010
Copy link
Copy Markdown
Contributor Author

PR Review: feat(lsm): expose lsm api to python

All fixed.

@zhangyue19921010
Copy link
Copy Markdown
Contributor Author

Hi @jackye1995 Sorry to bother you. Would u mind to take a look for this PR at your convenience? Really Appreciate!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant