ddp_spawn causing "Cannot allocate memory" #20796

jonathanrenusch · 2025-05-05T14:31:39Z

jonathanrenusch
May 5, 2025

Multi-GPU Training on In-Memory Graphs

Problem: I'm aiming to perform multi-GPU training on an in-memory graph dataset (approximately 100 GB). The standard Distributed Data Parallel (DDP) implementation in PyTorch appears to create a full copy of the dataset for each distributed process. This means that with n GPUs, the entire 100 GB dataset is loaded into CPU RAM n times, followed by the DistributedSampler partitioning the data for each GPU. This leads to significant and unnecessary RAM consumption on the CPU.

Desired Solution: My goal is to load the 100 GB dataset into CPU RAM only once and then utilize the DistributedSampler (or another suitable shuffled sampler) to provide distinct partitions of this single dataset instance to each of the n distributed processes running on the GPUs. This approach should avoid redundant data loading and substantially reduce CPU RAM usage.

Attempts and Challenges:

The Lightning documentation on sharing datasets across process boundaries seems to address this exact scenario. However, I've encountered persistent errors when attempting to implement the suggested solutions.

Specifically:

My current training script works correctly with the default ddp strategy.
Switching to the ddp_spawn strategy results in the attached error message (see below).
I have already tried the following without success:
- Setting the num_workers for the DataLoader to 1.
- Using a batch_size of 1 to minimize per-process memory.

Dataset Structure:

My dataset is a list of dictionaries stored in native PyTorch .pt format. Each dictionary contains the following keys, with corresponding PyTorch tensors as values:

node_features: Tensor of node features.
edge_indices: Tensor representing the graph's edge connections.
labels: Tensor of target labels.

Constraints:

Due to downstream deployment requirements, using PyTorch Geometric Data objects is not a viable option. Therefore, I require a solution that works with a custom Dataset and a custom_fn.

Error Message (when using ddp_spawn):

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "Classifier/train_binary_GNN_multiGPUfix.py", line 281, in <module>
    main(args)
  File "Classifier/train_binary_GNN_multiGPUfix.py", line 264, in main
    trainer.fit(
  File "root_env/lib/python3.11/site-packages/pytorch_lightning/trainer/trainer.py", line 538, in fit
    call._call_and_handle_interrupt(
  File "root_env/lib/python3.11/site-packages/pytorch_lightning/trainer/call.py", line 46, in _call_and_handle_interrupt
    return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root_env/lib/python3.11/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py", line 136, in launch
    process_context = mp.start_processes(
                      ^^^^^^^^^^^^^^^^^^^
  File "root_env/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 268, in start_processes
    idx, process, tf_name = start_process(i)
                            ^^^^^^^^^^^^^^^^
  File "/root_env/lib/python3.11/site-packages/torch/multiprocessing/spawn.py", line 263, in start_process
    process.start()
  File "/root_env/lib/python3.11/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
  File "/root_env/lib/python3.11/multiprocessing/context.py", line 288, in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
  File "/root_env/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/root_env/lib/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/root_env/lib/python3.11/multiprocessing/popen_spawn_posix.py", line 47, in _launch
    reduction.dump(process_obj, fp)
  File "/root_env/lib/python3.11/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "/root_env/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 618, in reduce_storage
    fd, size = storage._share_fd_cpu_()
               ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root_env/lib/python3.11/site-packages/torch/storage.py", line 437, in wrapper
    return fn(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root_env/lib/python3.11/site-packages/torch/storage.py", line 512, in _share_fd_cpu_
    return super()._share_fd_cpu_(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: unable to mmap 40 bytes from file </torch_932198_819163289_63003>: Cannot allocate memory (12)

Next Steps (if current approach fails):

If a solution for shared in-memory datasets with ddp_spawn cannot be found, I will be forced to explore alternative strategies such as:

Building an on-disk dataset with efficient data loading.
Reverting to plain PyTorch distributed training (potentially with more complex manual data management).

Call for Help:

Any insights, suggestions, or code examples demonstrating how to correctly implement shared in-memory datasets with ddp_spawn (or an alternative multi-GPU strategy that avoids redundant loading) would be greatly appreciated! I'm particularly interested in understanding how to properly configure the Dataset, DataLoader, and potentially a custom collate_fn or worker_init_fn within the Lightning framework to achieve this.

Custom_fn:

def custom_fn(batch):
    features_list = [item["features"] for item in batch]
    edge_index_list = [item["edge_index"] for item in batch]
    labels_list = [item["labels"] for item in batch]

    node_counts = [features.size(0) for features in features_list]
    node_offsets = torch.tensor([0] + node_counts[:-1]).cumsum(dim=0)

    features = torch.cat(features_list, dim=0)
    labels = torch.cat(labels_list, dim=0).unsqueeze(1)

    edge_index = torch.cat(
        [edge_index + node_offsets[idx] for idx, edge_index in enumerate(edge_index_list)], dim=1
    )

    if edge_index.max() >= features.size(0):
        raise ValueError(f"Edge index exceeds bounds: {edge_index.max()} >= {features.size(0)}")
    return {"x": features, "edge_index": edge_index, "y": labels, "batch_size": len(batch)}

Lightning Data Module:

class DatasetSp(Dataset): 
    def __init__(self, 
                 data_path: str,
                 ):
        print("Loading datasets...")
        # self.data = torch.load(data_path, weights_only=True, map_location="cpu")
        self.data = torch.load(data_path, weights_only=True)
        print("Dataset loaded")
    
    def __getitem__(self, idx):
        # Assuming each item in the dataset is a dictionary with keys 'features', 'edge_index', and 'labels'
        # utility method to help with debugging essentially
        return {
            "features": self.data[idx]["features"].contiguous(),
            "edge_index": self.data_list[idx]["edge_index"].contiguous(),
            "labels": self.data[idx]["labels"].contiguous(),   
        }

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

class DataModule(LightningDataModule):
    def __init__(
        self,
        train_path: str,
        test_path: str,
        val_path: str,
        batch_size: int = 1024,
        num_workers: int = 4,
    ) -> None:

        super().__init__()
        self.train_path = train_path
        self.val_path = val_path
        self.test_path = test_path
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.train_dataset = DatasetSp(train_path)

        self.val_dataset = DatasetSp(val_path)

        self.test_dataset = DatasetSp(test_path)
 

    def train_dataloader(self):
           
        return DataLoader(
            self.train_dataset,
            batch_size=self.batch_size,
            shuffle=True, 
            num_workers=self.num_workers,
            # num_workers=0,
            prefetch_factor=4,
            collate_fn=custom_fn_ONNX_GNN,

            pin_memory=True,
        )

    def val_dataloader(self):
        return DataLoader(
            self.val_dataset,
            batch_size=self.batch_size,
            shuffle=False,  # Disable shuffle here
            num_workers=self.num_workers,
            # num_workers=0,
            prefetch_factor=4,
            collate_fn=custom_fn_ONNX_GNN,
            pin_memory=True,
        )

    def test_dataloader(self):
        return DataLoader(
            self.test_dataset,
            batch_size=self.batch_size,
            shuffle=False,  # Disable shuffle here
            num_workers=self.num_workers,
            # num_workers=0,
            prefetch_factor=4,
            collate_fn=custom_fn_ONNX_GNN,
            pin_memory=True,
        )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ddp_spawn causing "Cannot allocate memory" #20796

{{title}}

Replies: 0 comments

Select a reply

ddp_spawn causing "Cannot allocate memory" #20796

jonathanrenusch May 5, 2025

Multi-GPU Training on In-Memory Graphs

Replies: 0 comments

jonathanrenusch
May 5, 2025