You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: Update API documentation for ScanOrder refactoring
Replace ScanOrder enum examples with new class-based API:
- TaskOrder() for default behavior
- ArrivalOrder(concurrent_streams=N) for streaming
- ArrivalOrder(concurrent_streams=N, max_buffered_batches=M) for memory control
Add configuration guidance table and update ordering semantics.
Rename concurrent_files → concurrent_streams throughout examples.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
Copy file name to clipboardExpand all lines: mkdocs/docs/api.md
+12-11Lines changed: 12 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -362,30 +362,30 @@ for buf in tbl.scan().to_arrow_batch_reader(batch_size=1000):
362
362
print(f"Buffer contains {len(buf)} rows")
363
363
```
364
364
365
-
By default, each file's batches are materialized in memory before being yielded (`order=ScanOrder.TASK`). For large files that may exceed available memory, use `order=ScanOrder.ARRIVAL` to yield batches as they are produced without materializing entire files:
365
+
By default, each file's batches are materialized in memory before being yielded (`TaskOrder()`). For large files that may exceed available memory, use `ArrivalOrder()` to yield batches as they are produced without materializing entire files:
366
366
367
367
```python
368
-
from pyiceberg.table import ScanOrder
368
+
from pyiceberg.table import ArrivalOrder
369
369
370
-
for buf in tbl.scan().to_arrow_batch_reader(order=ScanOrder.ARRIVAL, batch_size=1000):
370
+
for buf in tbl.scan().to_arrow_batch_reader(order=ArrivalOrder(), batch_size=1000):
371
371
print(f"Buffer contains {len(buf)} rows")
372
372
```
373
373
374
-
For maximum throughput, use `concurrent_files` to read multiple files in parallel with arrival order. Batches are yielded as they arrive from any file — ordering across files is not guaranteed:
374
+
For maximum throughput, use `concurrent_streams` to read multiple files in parallel with arrival order. Batches are yielded as they arrive from any file — ordering across files is not guaranteed:
375
375
376
376
```python
377
-
from pyiceberg.table import ScanOrder
377
+
from pyiceberg.table import ArrivalOrder
378
378
379
-
for buf in tbl.scan().to_arrow_batch_reader(order=ScanOrder.ARRIVAL, concurrent_files=4, batch_size=1000):
379
+
for buf in tbl.scan().to_arrow_batch_reader(order=ArrivalOrder(concurrent_streams=4), batch_size=1000):
| `ScanOrder.TASK` (default) | Batches grouped by file, in task submission order | Row order |
388
-
| `ScanOrder.ARRIVAL` | Interleaved across files (no grouping guarantee) | Row order within each file |
387
+
| `TaskOrder()` (default) | Batches grouped by file, in task submission order | Row order |
388
+
| `ArrivalOrder()` | Interleaved across files (no grouping guarantee) | Row order within each file |
389
389
390
390
Within each file, batch ordering always follows row order. The `limit` parameter is enforced correctly regardless of configuration.
391
391
@@ -394,11 +394,12 @@ Within each file, batch ordering always follows row order. The `limit` parameter
394
394
| Use case | Recommended config |
395
395
|---|---|
396
396
| Small tables, simple queries | Default — no extra args needed |
397
-
| Large tables, memory-constrained | `order=ScanOrder.ARRIVAL` — one file at a time, minimal memory |
398
-
| Maximum throughput with bounded memory | `order=ScanOrder.ARRIVAL, concurrent_files=N` — tune N to balance throughput vs memory |
397
+
| Large tables, memory-constrained | `order=ArrivalOrder()` — one file at a time, minimal memory |
398
+
| Maximum throughput with bounded memory | `order=ArrivalOrder(concurrent_streams=N)` — tune N to balance throughput vs memory |
399
+
| Fine-grained memory control | `order=ArrivalOrder(concurrent_streams=N, max_buffered_batches=M)` — tune both parameters |
399
400
| Fine-grained batch control | Add `batch_size=N` to any of the above |
400
401
401
-
**Note:** `ScanOrder.ARRIVAL` yields batches in arrival order (interleaved across files when `concurrent_files > 1`). For deterministic file ordering, use the default `ScanOrder.TASK` mode. `batch_size` is usually an advanced tuning knob — the PyArrow default of 131,072 rows works well for most workloads.
402
+
**Note:** `ArrivalOrder()` yields batches in arrival order (interleaved across files when `concurrent_streams > 1`). For deterministic file ordering, use the default `TaskOrder()` mode. `batch_size` is usually an advanced tuning knob — the PyArrow default of 131,072 rows works well for most workloads.
402
403
403
404
To avoid any type inconsistencies during writing, you can convert the Iceberg table schema to Arrow:
0 commit comments