Skip to content

OutOfBoundsError when streaming parquet files with low_memory=True #553

@kyoungrok0517

Description

@kyoungrok0517

🐛 Bug

To Reproduce

I'm trying to stream my parquet dataset with low_memory=True option but I encounter OutOfBoundsError. Having no problem when I set low_memory=False. The parquet files are compressed in zstd.

Error
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 382, in __getitem__                                                                           
[rank6]:     return self.cache[index]                                                                                                                                                                                  
[rank6]:            ~~~~~~~~~~^^^^^^^                                                                                                                                                                                  
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/cache.py", line 145, in __getitem__                                                                             
[rank6]:     return self._reader.read(index)                                                                                                                                                                           
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                           
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/reader.py", line 388, in read                                                                                   
[rank6]:     item = self._item_loader.load_item_from_chunk(                                                                                                                                                            
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                            
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 619, in load_item_from_chunk                                                              
[rank6]:     return self._get_item_with_low_memory(chunk_index, chunk_filepath, relative_index)                                                                                                                        
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                        
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 678, in _get_item_with_low_memory                                                         
[rank6]:     return row_group_df.row(row_index_within_group, named=True)  # type: ignore                                                                                                                               
[rank6]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                               
[rank6]:   File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 10853, in row                                                                                    
[rank6]:     row = self._df.row_tuple(index)                                                                                                                                                                           
[rank6]:           ^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                           
[rank6]: polars.exceptions.OutOfBoundsError: index 791 is out of bounds for sequence of length 791   
Code sample
def _load_dataset(self):
    # Indexed folder
    data_path = Path(self.data_path)
    # Low memory
    item_loader = ParquetLoader(low_memory=True)
    dset = {
        "train": TripleStreamingDataset(
            str(data_path / "split=train"),
            item_loader=item_loader,
        ),
        "val": TripleStreamingDataset(
            str(data_path / "split=val"),
            item_loader=item_loader,
        ),
        "test": TripleStreamingDataset(
            str(data_path / "split=test"),
            item_loader=item_loader,
        ),
    }

    return dset

def setup(self, stage=None):
    # load
    dset = self._load_dataset()

    # train, val, test split
    self.train_dataset = dset["train"]
    self.val_dataset = dset["val"]
    self.test_dataset = dset["test"]

def train_dataloader(self):
    return StreamingDataLoader(
        self.train_dataset,
        batch_size=self.batch_size,
        **self.common_opts,
    )

def val_dataloader(self):
    return StreamingDataLoader(
        self.val_dataset,
        batch_size=self.batch_size * 2,
        **self.common_opts,
    )

def test_dataloader(self):
    return StreamingDataLoader(
        self.test_dataset,
        batch_size=self.batch_size * 2,
        **self.common_opts,
    )

Expected behavior

Work without any problem

Additional context

Environment detail
  • PyTorch Version: 2.6.0
  • OS: Ubuntu 20.04
  • How you installed PyTorch: uv pip
  • Build command you used: ...
  • Python version: 3.11
  • CUDA/cuDNN version: 12.4
  • GPU models and configuration: A100 * 8
  • Any other relevant information: Using lighting + DDP for training

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is neededwaiting on authorWaiting for user input or feedback.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions