Skip to content

Commit c38bc76

Browse files
sumedhsakdeoclaude
andcommitted
docs: Update API documentation for ScanOrder refactoring
Replace ScanOrder enum examples with new class-based API: - TaskOrder() for default behavior - ArrivalOrder(concurrent_streams=N) for streaming - ArrivalOrder(concurrent_streams=N, max_buffered_batches=M) for memory control Add configuration guidance table and update ordering semantics. Rename concurrent_files → concurrent_streams throughout examples. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent e06c01a commit c38bc76

File tree

1 file changed

+12
-11
lines changed

1 file changed

+12
-11
lines changed

mkdocs/docs/api.md

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -362,30 +362,30 @@ for buf in tbl.scan().to_arrow_batch_reader(batch_size=1000):
362362
print(f"Buffer contains {len(buf)} rows")
363363
```
364364

365-
By default, each file's batches are materialized in memory before being yielded (`order=ScanOrder.TASK`). For large files that may exceed available memory, use `order=ScanOrder.ARRIVAL` to yield batches as they are produced without materializing entire files:
365+
By default, each file's batches are materialized in memory before being yielded (`TaskOrder()`). For large files that may exceed available memory, use `ArrivalOrder()` to yield batches as they are produced without materializing entire files:
366366

367367
```python
368-
from pyiceberg.table import ScanOrder
368+
from pyiceberg.table import ArrivalOrder
369369
370-
for buf in tbl.scan().to_arrow_batch_reader(order=ScanOrder.ARRIVAL, batch_size=1000):
370+
for buf in tbl.scan().to_arrow_batch_reader(order=ArrivalOrder(), batch_size=1000):
371371
print(f"Buffer contains {len(buf)} rows")
372372
```
373373

374-
For maximum throughput, use `concurrent_files` to read multiple files in parallel with arrival order. Batches are yielded as they arrive from any file — ordering across files is not guaranteed:
374+
For maximum throughput, use `concurrent_streams` to read multiple files in parallel with arrival order. Batches are yielded as they arrive from any file — ordering across files is not guaranteed:
375375

376376
```python
377-
from pyiceberg.table import ScanOrder
377+
from pyiceberg.table import ArrivalOrder
378378
379-
for buf in tbl.scan().to_arrow_batch_reader(order=ScanOrder.ARRIVAL, concurrent_files=4, batch_size=1000):
379+
for buf in tbl.scan().to_arrow_batch_reader(order=ArrivalOrder(concurrent_streams=4), batch_size=1000):
380380
print(f"Buffer contains {len(buf)} rows")
381381
```
382382

383383
**Ordering semantics:**
384384

385385
| Configuration | File ordering | Within-file ordering |
386386
|---|---|---|
387-
| `ScanOrder.TASK` (default) | Batches grouped by file, in task submission order | Row order |
388-
| `ScanOrder.ARRIVAL` | Interleaved across files (no grouping guarantee) | Row order within each file |
387+
| `TaskOrder()` (default) | Batches grouped by file, in task submission order | Row order |
388+
| `ArrivalOrder()` | Interleaved across files (no grouping guarantee) | Row order within each file |
389389

390390
Within each file, batch ordering always follows row order. The `limit` parameter is enforced correctly regardless of configuration.
391391

@@ -394,11 +394,12 @@ Within each file, batch ordering always follows row order. The `limit` parameter
394394
| Use case | Recommended config |
395395
|---|---|
396396
| Small tables, simple queries | Default — no extra args needed |
397-
| Large tables, memory-constrained | `order=ScanOrder.ARRIVAL` — one file at a time, minimal memory |
398-
| Maximum throughput with bounded memory | `order=ScanOrder.ARRIVAL, concurrent_files=N` — tune N to balance throughput vs memory |
397+
| Large tables, memory-constrained | `order=ArrivalOrder()` — one file at a time, minimal memory |
398+
| Maximum throughput with bounded memory | `order=ArrivalOrder(concurrent_streams=N)` — tune N to balance throughput vs memory |
399+
| Fine-grained memory control | `order=ArrivalOrder(concurrent_streams=N, max_buffered_batches=M)` — tune both parameters |
399400
| Fine-grained batch control | Add `batch_size=N` to any of the above |
400401

401-
**Note:** `ScanOrder.ARRIVAL` yields batches in arrival order (interleaved across files when `concurrent_files > 1`). For deterministic file ordering, use the default `ScanOrder.TASK` mode. `batch_size` is usually an advanced tuning knob — the PyArrow default of 131,072 rows works well for most workloads.
402+
**Note:** `ArrivalOrder()` yields batches in arrival order (interleaved across files when `concurrent_streams > 1`). For deterministic file ordering, use the default `TaskOrder()` mode. `batch_size` is usually an advanced tuning knob — the PyArrow default of 131,072 rows works well for most workloads.
402403

403404
To avoid any type inconsistencies during writing, you can convert the Iceberg table schema to Arrow:
404405

0 commit comments

Comments
 (0)