`OutOfBoundsError` when streaming parquet files with `low_memory=True`

## 🐛 Bug



### To Reproduce

I'm trying to stream my parquet dataset with `low_memory=True` option but I encounter `OutOfBoundsError`. Having no problem when I set `low_memory=False`. The parquet files are compressed in `zstd`. 

<details>
  <summary>Error</summary>

```python
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 382, in __getitem__                                                                           
[rank6]:     return self.cache[index]                                                                                                                                                                                  
[rank6]:            ~~~~~~~~~~^^^^^^^                                                                                                                                                                                  
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/cache.py", line 145, in __getitem__                                                                             
[rank6]:     return self._reader.read(index)                                                                                                                                                                           
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                           
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/reader.py", line 388, in read                                                                                   
[rank6]:     item = self._item_loader.load_item_from_chunk(                                                                                                                                                            
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                            
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 619, in load_item_from_chunk                                                              
[rank6]:     return self._get_item_with_low_memory(chunk_index, chunk_filepath, relative_index)                                                                                                                        
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                        
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 678, in _get_item_with_low_memory                                                         
[rank6]:     return row_group_df.row(row_index_within_group, named=True)  # type: ignore                                                                                                                               
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                               
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 10853, in row                                                                                    
[rank6]:     row = self._df.row_tuple(index)                                                                                                                                                                           
[rank6]:           ^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                           
[rank6]: polars.exceptions.OutOfBoundsError: index 791 is out of bounds for sequence of length 791   
```
</details>

<details>
  <summary>Code sample</summary>

```python
def _load_dataset(self):
    # Indexed folder
    data_path = Path(self.data_path)
    # Low memory
    item_loader = ParquetLoader(low_memory=True)
    dset = {
        "train": TripleStreamingDataset(
            str(data_path / "split=train"),
            item_loader=item_loader,
        ),
        "val": TripleStreamingDataset(
            str(data_path / "split=val"),
            item_loader=item_loader,
        ),
        "test": TripleStreamingDataset(
            str(data_path / "split=test"),
            item_loader=item_loader,
        ),
    }

    return dset

def setup(self, stage=None):
    # load
    dset = self._load_dataset()

    # train, val, test split
    self.train_dataset = dset["train"]
    self.val_dataset = dset["val"]
    self.test_dataset = dset["test"]

def train_dataloader(self):
    return StreamingDataLoader(
        self.train_dataset,
        batch_size=self.batch_size,
        **self.common_opts,
    )

def val_dataloader(self):
    return StreamingDataLoader(
        self.val_dataset,
        batch_size=self.batch_size * 2,
        **self.common_opts,
    )

def test_dataloader(self):
    return StreamingDataLoader(
        self.test_dataset,
        batch_size=self.batch_size * 2,
        **self.common_opts,
    )
```

</details>

### Expected behavior

Work without any problem

### Additional context



<details>
  <summary>Environment detail</summary>

- PyTorch Version: `2.6.0`
- OS: Ubuntu 20.04
- How you installed PyTorch: `uv pip`
- Build command you used: `...`
- Python version: `3.11`
- CUDA/cuDNN version: `12.4`
- GPU models and configuration: `A100 * 8`
- Any other relevant information: Using `lighting` + DDP for training

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`OutOfBoundsError` when streaming parquet files with `low_memory=True` #553

🐛 Bug

To Reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

OutOfBoundsError when streaming parquet files with low_memory=True #553

Description

🐛 Bug

To Reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`OutOfBoundsError` when streaming parquet files with `low_memory=True` #553