-
Notifications
You must be signed in to change notification settings - Fork 74
Open
Labels
bugSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is neededwaiting on authorWaiting for user input or feedback.Waiting for user input or feedback.
Description
🐛 Bug
To Reproduce
I'm trying to stream my parquet dataset with low_memory=True
option but I encounter OutOfBoundsError
. Having no problem when I set low_memory=False
. The parquet files are compressed in zstd
.
Error
[rank6]: File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/dataset.py", line 382, in __getitem__
[rank6]: return self.cache[index]
[rank6]: ~~~~~~~~~~^^^^^^^
[rank6]: File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/cache.py", line 145, in __getitem__
[rank6]: return self._reader.read(index)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/reader.py", line 388, in read
[rank6]: item = self._item_loader.load_item_from_chunk(
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 619, in load_item_from_chunk
[rank6]: return self._get_item_with_low_memory(chunk_index, chunk_filepath, relative_index)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/litdata/streaming/item_loader.py", line 678, in _get_item_with_low_memory
[rank6]: return row_group_df.row(row_index_within_group, named=True) # type: ignore
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: File "/root/csquare-site-encoder/model/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py", line 10853, in row
[rank6]: row = self._df.row_tuple(index)
[rank6]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank6]: polars.exceptions.OutOfBoundsError: index 791 is out of bounds for sequence of length 791
Code sample
def _load_dataset(self):
# Indexed folder
data_path = Path(self.data_path)
# Low memory
item_loader = ParquetLoader(low_memory=True)
dset = {
"train": TripleStreamingDataset(
str(data_path / "split=train"),
item_loader=item_loader,
),
"val": TripleStreamingDataset(
str(data_path / "split=val"),
item_loader=item_loader,
),
"test": TripleStreamingDataset(
str(data_path / "split=test"),
item_loader=item_loader,
),
}
return dset
def setup(self, stage=None):
# load
dset = self._load_dataset()
# train, val, test split
self.train_dataset = dset["train"]
self.val_dataset = dset["val"]
self.test_dataset = dset["test"]
def train_dataloader(self):
return StreamingDataLoader(
self.train_dataset,
batch_size=self.batch_size,
**self.common_opts,
)
def val_dataloader(self):
return StreamingDataLoader(
self.val_dataset,
batch_size=self.batch_size * 2,
**self.common_opts,
)
def test_dataloader(self):
return StreamingDataLoader(
self.test_dataset,
batch_size=self.batch_size * 2,
**self.common_opts,
)
Expected behavior
Work without any problem
Additional context
Environment detail
- PyTorch Version:
2.6.0
- OS: Ubuntu 20.04
- How you installed PyTorch:
uv pip
- Build command you used:
...
- Python version:
3.11
- CUDA/cuDNN version:
12.4
- GPU models and configuration:
A100 * 8
- Any other relevant information: Using
lighting
+ DDP for training
bhimrazy
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workinghelp wantedExtra attention is neededExtra attention is neededwaiting on authorWaiting for user input or feedback.Waiting for user input or feedback.