When are buffers moved to gpu? #12207

grudloff · 2022-03-03T14:38:33Z

grudloff
Mar 3, 2022

I have an issue with a weighted mse function that I instantiate in the setup, with a buffer as parameter. Something like this:

@torch.jit.script
def weighted_mse_func(weights, y, y_hat):
    # weighted regression loss
    reg_loss = torch.dot(weights,torch.mean(F.mse_loss(y_hat, y, reduction='none'), dim=0))
    return reg_loss

def weighted_mse(weights):
    def func(y, y_hat):
        return weighted_mse_func(weights, y, y_hat)
    return func


class model(pl.LightningModule):
    def __init__(self, weights):
        weights = torch.tensor(weights.copy(), dtype=self.dtype, device=self.device)
        self.register_buffer("weights", weights)
    
    def setup(self, stage):
        super().setup(stage)
        self.loss = weighted_mse(self.weights)

When initializing training on the GPU I get an error because self.weights is on CPU and not in GPU, if after the error I check the device of the buffer it's on GPU. So if I re-run the trainer, it works fine, also works fine if I call model.cuda() before training. What is going on? Why is the buffer not in GPU on the setup where it fails, but it is afterward? Is something asynchronous going on here?

Answered by grudloff

Mar 7, 2022

It seems that pytorch doesn't move buffers parameters in-place (like it is done for parameters), this results in references to buffers being useless if they are moved from one device to another. This issue is discussed in pytorch/pytorch#43815.

View full answer

rohitgr7 · 2022-03-03T21:37:41Z

rohitgr7
Mar 3, 2022

try:

class model(pl.LightningModule):
    def __init__(self, weights):
        super().__init__()
        self.register_buffer("weights", torch.tensor(weights.copy(), dtype=self.dtype))
    
    def on_fit_start(self):
        self.loss = weighted_mse(self.weights)

5 replies

grudloff Mar 4, 2022
Author

Now I get the following error:

AttributeError: 'model' object has no attribute 'loss'

rohitgr7 Mar 4, 2022

oh! you must be accessing self.loss somewhere before it's even assigned. Can you share the stack trace?

alternatively, you can also just do

def training_step(self, ...):
    ...
    some_loss = weighted_mse_func(self.weights, y, y_hat)

to avoid using a loss function wrapper here.

grudloff Mar 4, 2022
Author

Yes! that certainly is an option, but I would rather use the wrapper. But still, it's strange, since the wrapped function has a reference to the buffer, so if the buffer is moved to the GPU it should be reflected in that wrapped function. Maybe the problem is another one?

Before the error, this is printed:

C:\Users\Usuario\.conda\envs\pytorch-gpu\lib\site-packages\pytorch_lightning\trainer\connectors\callback_connector.py:90: LightningDeprecationWarning: Setting `Trainer(progress_bar_refresh_rate=0)` is deprecated in v1.5 and will be removed in v1.7. Please pass `pytorch_lightning.callbacks.progress.TQDMProgressBar` with `refresh_rate` directly to the Trainer's `callbacks` argument instead. Or, to disable the progress bar pass `enable_progress_bar = False` to the Trainer.
  rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
C:\Users\Usuario\.conda\envs\pytorch-gpu\lib\site-packages\pytorch_lightning\trainer\configuration_validator.py:139: LightningDeprecationWarning: The `LightningModule.get_progress_bar_dict` method was deprecated in v1.5 and will be removed in v1.7. Please use the `ProgressBarBase.get_metrics` instead.
  rank_zero_deprecation(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
C:\Users\Usuario\.conda\envs\pytorch-gpu\lib\site-packages\pytorch_lightning\trainer\data_loading.py:110: UserWarning: The dataloader, val_dataloader 0, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 12 which is the number of cpus on this machine) in the `DataLoader` init to improve performance.
  rank_zero_warn(

Notice it says GPU available: True, used: True but then afterwards it prints LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]. Maybe the issue has to do with that?

Edit: Now I see that what is shown there is the list of cuda devices, hence the list with just one entry cuda0.

grudloff Mar 4, 2022
Author

@rohitgr7 I made a minimal example of the issue with the boring model here. Please, take a look

grudloff Mar 4, 2022
Author

Ok, I think I have some insight into what is going on. The wrapped function doesn't hold a reference to the original weight, instead, it makes a copy. The same happens if I use an object, like this:

class loss_wrapped():
    def __init__(self, weight):
        self.weight = weight
    def __call__(self, x):
        return loss_func(self.weight, x)

So I can fix it by giving a reference to the module, and then from that accessing the weight, like this:

class loss_wrapped():
    def __init__(self, model):
        self.model = model
    def __call__(self, x):
        return loss_func(self.model.weight, x)

This isn't ideal, but at least now I know what is going on.

grudloff · 2022-03-07T20:54:33Z

grudloff
Mar 7, 2022
Author

It seems that pytorch doesn't move buffers parameters in-place (like it is done for parameters), this results in references to buffers being useless if they are moved from one device to another. This issue is discussed in pytorch/pytorch#43815.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When are buffers moved to gpu? #12207

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

When are buffers moved to gpu? #12207

grudloff Mar 3, 2022

Replies: 2 comments · 5 replies

rohitgr7 Mar 3, 2022

grudloff Mar 4, 2022 Author

rohitgr7 Mar 4, 2022

grudloff Mar 4, 2022 Author

grudloff Mar 4, 2022 Author

grudloff Mar 4, 2022 Author

grudloff Mar 7, 2022 Author

grudloff
Mar 3, 2022

Replies: 2 comments 5 replies

rohitgr7
Mar 3, 2022

grudloff Mar 4, 2022
Author

grudloff Mar 4, 2022
Author

grudloff Mar 4, 2022
Author

grudloff Mar 4, 2022
Author

grudloff
Mar 7, 2022
Author