Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When I was testing the pipedream code with version-updated torch, I encountered the following error (1.1.0 -> 1.11.0): #78

Open
lengien opened this issue Apr 11, 2023 · 9 comments

Comments

@lengien
Copy link

lengien commented Apr 11, 2023

  File "main_with_runtime_1.py", line 580, in <module>
    main()
  File "main_with_runtime_1.py", line 307, in main
    train(train_loader, r, optimizer, epoch)
  File "main_with_runtime_1.py", line 356, in train
    r.run_forward()
  File "../runtime_3.py", line 511, in run_forward
    self._run_forward(tensors)
  File "../runtime_3.py", line 559, in _run_forward
    for input_name in input_names])
  File "/opt/conda/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/pipeline/runtime/image_classification/models/alexnet/gpus=4_straight/stage2.py", line 25, in forward
    out5 = self.layer5(out4)
  File "/opt/conda/envs/torch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/envs/torch/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 447, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/opt/conda/envs/torch/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 444, in _conv_forward
    self.padding, self.dilation, self.groups)
 (Triggered internally at  ../torch/csrc/autograd/python_anomaly_mode.cpp:104.)
  allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
Traceback (most recent call last):
  File "main_with_runtime_1.py", line 580, in <module>
    main()
  File "main_with_runtime_1.py", line 307, in main
    train(train_loader, r, optimizer, epoch)
  File "main_with_runtime_1.py", line 407, in train
    r.run_backward()
  File "../runtime_3.py", line 648, in run_backward
    for output_name in outputs]))
  File "/opt/conda/envs/torch/lib/python3.7/site-packages/torch/autograd/__init__.py", line 175, in backward
    allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [256, 256, 3, 3]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

The logic of Pipedream is that some stages will perform multiple forward passes before performing one backward pass. It seems that there may be issues with this in the new version of Torch. I would like to ask how to avoid this problem.

Versions
Collecting environment information...
PyTorch version: 1.11.0+cu115
Is debug build: False
CUDA used to build PyTorch: 11.5
ROCM used to build PyTorch: N/A

OS: Ubuntu 16.04.6 LTS (x86_64)
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
Clang version: Could not collect
CMake version: version 3.5.1
Libc version: glibc-2.17

Python version: 3.7.13 (default, Mar 29 2022, 02:18:16) [GCC 7.5.0] (64-bit runtime)
Python platform: Linux-4.15.0-204-generic-x86_64-with-debian-stretch-sid
Is CUDA available: True
CUDA runtime version: 10.1.163
CUDA_MODULE_LOADING set to:
GPU models and configuration:
GPU 0: Tesla P100-PCIE-12GB
GPU 1: Tesla P100-PCIE-12GB
GPU 2: Tesla P100-PCIE-12GB
GPU 3: Tesla P100-PCIE-12GB
GPU 4: Tesla P100-PCIE-12GB
GPU 5: Tesla P100-PCIE-12GB
GPU 6: Tesla P100-PCIE-12GB
GPU 7: Tesla P100-PCIE-12GB

Nvidia driver version: 515.65.01
cuDNN version: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 20
On-line CPU(s) list: 0-19
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 20
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Stepping: 1
CPU MHz: 2200.102
BogoMIPS: 4404.71
Hypervisor vendor: vertical
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
L3 cache: 16384K
NUMA node0 CPU(s): 0-19
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt arat

Versions of relevant libraries:
[pip3] msgpack-numpy==0.4.3.2
[pip3] numpy==1.21.5
[pip3] torch==1.11.0+cu115
[pip3] torchvision==0.12.0+cu115
[conda] cudatoolkit 10.2.89 hfd86e86_1
[conda] magma-cuda100 2.1.0 5 local
[conda] mkl 2019.1 144
[conda] mkl-include 2019.1 144
[conda] msgpack-numpy 0.4.3.2 py37_0
[conda] numpy 1.21.5 py37h7a5d4dd_2
[conda] numpy-base 1.21.5 py37hb8be1f0_2
[conda] torch 1.11.0+cu115 pypi_0 pypi
[conda] torchvision 0.12.0+cu115 pypi_0 pypi

@xglds99
Copy link

xglds99 commented Jun 6, 2023

I also met the same problem. Have you solved it? If you have solved it, I hope you can tell me your solution. Thank you

@lengien
Copy link
Author

lengien commented Jun 6, 2023

I also met the same problem. Have you solved it? If you have solved it, I hope you can tell me your solution. Thank you
I found a new framework pippy,https://github.com/pytorch/PiPPy,which can use by torch2.0

@hnust-xxq
Copy link

Can you teach me how to successfully run this code? I always encounter errors when applying this patch file because my graphics card is an RTX 4090D, Driver Version: 550.78, PyTorch 2.1.2, Python 3.10 (Ubuntu 22.04), CUDA 11.8. I don't know how to adjust to meet the conditions for this code. Can you help me? Or can you provide the various configurations you used to run this code

@lengien
Copy link
Author

lengien commented Oct 26, 2024 via email

@hnust-xxq
Copy link

Pytorch version may be too high,i run code successfully by docker. nvidia-docker pull nvcr.io/nvidia/pytorch:19.05-py3 ... @.***  

------------------ 原始邮件 ------------------ 发件人: "msr-fiddle/pipedream" @.>; 发送时间: 2024年10月26日(星期六) 下午3:14 @.>; @.@.>; 主题: Re: [msr-fiddle/pipedream] When I was testing the pipedream code with version-updated torch, I encountered the following error (1.1.0 -> 1.11.0): (Issue #78) Can you teach me how to successfully run this code? I always encounter errors when applying this patch file because my graphics card is an RTX 4090D, Driver Version: 550.78, PyTorch 2.1.2, Python 3.10 (Ubuntu 22.04), CUDA 11.8. I don't know how to adjust to meet the conditions for this code. Can you help me? Or can you provide the various configurations you used to run this code — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

I'm using a cloud server, and I'm allocated a Docker container. Can I download another Docker and use Docker commands? I don't think that's possible, right? Do you know how to work with cloud servers?

@lengien
Copy link
Author

lengien commented Oct 26, 2024 via email

@hnust-xxq
Copy link

Could you please tell me the version of your GPU server? I only have the Docker container of the server and cannot use Docker to create environments, so I need to download all the dependencies myself. Thank you! My English is not very good, so I apologize if I offend you in any way.

try with torch==1.1.0 or try another pipeline library, pippy.pippy can be run with a new version of torch  ... @.***  

------------------ 原始邮件 ------------------ 发件人: "msr-fiddle/pipedream" @.>; 发送时间: 2024年10月26日(星期六) 下午3:57 @.>; @.@.>; 主题: Re: [msr-fiddle/pipedream] When I was testing the pipedream code with version-updated torch, I encountered the following error (1.1.0 -> 1.11.0): (Issue #78) Pytorch version may be too high,i run code successfully by docker. nvidia-docker pull nvcr.io/nvidia/pytorch:19.05-py3 ... @.***   … ------------------ 原始邮件 ------------------ 发件人: "msr-fiddle/pipedream" @.>; 发送时间: 2024年10月26日(星期六) 下午3:14 @.>; @.@.>; 主题: Re: [msr-fiddle/pipedream] When I was testing the pipedream code with version-updated torch, I encountered the following error (1.1.0 -> 1.11.0): (Issue #78) Can you teach me how to successfully run this code? I always encounter errors when applying this patch file because my graphics card is an RTX 4090D, Driver Version: 550.78, PyTorch 2.1.2, Python 3.10 (Ubuntu 22.04), CUDA 11.8. I don't know how to adjust to meet the conditions for this code. Can you help me? Or can you provide the various configurations you used to run this code — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.> I'm using a cloud server, and I'm allocated a Docker container. Can I download another Docker and use Docker commands? I don't think that's possible, right? Do you know how to work with cloud servers? — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.>

@hnust-xxq
Copy link

Is PyTorch v1.1.0 feasible? The command nvidia-docker pull nvcr.io/nvidia/pytorch:19.05-py3 corresponds to PyTorch v1.0.0. Did you encounter any issues when applying the patch

@lengien
Copy link
Author

lengien commented Oct 26, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants