Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault while using onnxruntime==1.21.0 #24144

Open
vmnit opened this issue Mar 24, 2025 · 11 comments
Open

segmentation fault while using onnxruntime==1.21.0 #24144

vmnit opened this issue Mar 24, 2025 · 11 comments
Labels
core runtime issues related to core runtime

Comments

@vmnit
Copy link

vmnit commented Mar 24, 2025

onnxruntime getting crashed with segmentation fault when using version 1.21.0. It is not crashing when using 1.20.1 release.

Steps to reproduce:

import onnxruntime as ort
sess_options = ort.SessionOptions()
sess = ort.InferenceSession('hf_Qwen2-7B-Instruct_model.onnx', sess_options)

hf_Qwen2-7B-Instruct_model.onnx.gz

@yuslepukhin yuslepukhin added the core runtime issues related to core runtime label Mar 24, 2025
@yuslepukhin
Copy link
Member

The model is referring to an external weights file. Would you like to supply it?

Image

@yuslepukhin
Copy link
Member

here is the exception message that is issued. I am not seeing a segmentation fault on the main build:

unknown file: error: C++ exception with description "Load model from D :/dev/data/SegmentationFault_gh_24144/hf_Qwen2 - 7B - Instruct_model.onnx failed:Load model D :/dev/data/SegmentationFault_gh_24144/hf_Qwen2 - 7B - Instruct_model.onnx failed" thrown in the test body.

@vmnit
Copy link
Author

vmnit commented Mar 25, 2025

Hi @yuslepukhin,

Thanks for looking into it.
The data file is huge around 29 GB. Can you please suggest a way to share the data file?

@vmnit
Copy link
Author

vmnit commented Mar 25, 2025

Hi @yuslepukhin ,

I'm using model from the following location: https://huggingface.co/Qwen/Qwen2-7B-Instruct/tree/main
Can you please try generating onnx_model from there because I'm unable to find a way to upload the big data file?

@yuslepukhin
Copy link
Member

Please, share exactly what you did.

Also, please, share any console messages, enable logging and share, specifically what makes you think there is a segmentation fault. Also, please, fill out the template as to the version of your Linux OS etc.

@amd-vivekag
Copy link

@yuslepukhin I'm trying to create a script which can reproduce the issue at your end. I'll try to share with you soon.

@vmnit
Copy link
Author

vmnit commented Mar 27, 2025

Steps to reproduce:

  1. Create virtual environment: python -m venv myenv.env
  2. Activate it: source myenv.env/bin/activate
  3. Pip upgrade: pip install --upgrade pip
  4. Install some libraries: pip install onnx optimum[exporters] onnxruntime
  5. Set CACHE_DIR: export CACHE_DIR=<SOME_PATH>
  6. run script: python test_seg_fault.py
# test_seg_fault.py
import os
from optimum.exporters.onnx import main_export

cache_dir = os.environ["CACHE_DIR"]
os.environ["HF_HOME"] = cache_dir
os.environ["HUGGINGFACE_HUB_CACHE"] = cache_dir

main_export(
        "Qwen/Qwen2-7B-Instruct",
        os.getcwd(),
        task='text-generation',
        cache_dir=cache_dir,
        local_files_only=False,
        monolith=True,
        framework="pt",
        optimize=None,
        )

import onnxruntime as ort
sess_options = ort.SessionOptions()

print("before ort.InferenceSession")
sess = ort.InferenceSession('model.onnx', sess_options)

print(sess)

Please let me know if you need any information from my side in this regard.

Thanks

@yuslepukhin
Copy link
Member

yuslepukhin commented Mar 27, 2025

I have followed the procedure and got the model. Produced a debug build from the tip of main.
Tried from a C++ test and your python script simply loading the model.
I did not have a repro. At one-point physical memory usage clocked 31Gb and total commit was 44 Gb, so it had its share of page faults, but the process completed normally. The closest release is next month.

Image

@vmnit
Copy link
Author

vmnit commented Mar 28, 2025

Hi @yuslepukhin,

Were you able to run the complete script without any segmentation fault? If yes, can you please check the onnxruntime version?
I'm able to reproduce with following library versions:

Successfully installed MarkupSafe-3.0.2 certifi-2025.1.31 charset-normalizer-3.4.1 coloredlogs-15.0.1 filelock-3.18.0 flatbuffers-25.2.10 fsspec-2025.3.0 huggingface-hub-0.29.3 humanfriendly-10.0 idna-3.10 jinja2-3.1.6 mpmath-1.3.0 networkx-3.4.2 numpy-2.2.4 nvidia-
cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 n
vidia-cusparselt-cu12-0.6.2 nvidia-nccl-cu12-2.21.5 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.4.127 onnx-1.17.0 onnxruntime-1.21.0 optimum-1.24.0 packaging-24.2 pillow-11.1.0 protobuf-6.30.2 pyyaml-6.0.2 regex-2024.11.6 requests-2.32.3 safetensors-0.5.3 sym
py-1.13.1 timm-1.0.15 tokenizers-0.21.1 torch-2.6.0 torchvision-0.21.0 tqdm-4.67.1 transformers-4.48.3 triton-3.2.0 typing-extensions-4.13.0 urllib3-2.3.0

While running inference step, I was getting Seg Fault: sess = ort.InferenceSession('model.onnx', sess_options)
I was not getting print after that. But if you are getting valid sess object then it is working for you, it seems.

@yuslepukhin
Copy link
Member

yuslepukhin commented Mar 28, 2025

The bug reproduces with 1.21.0, but is not there with the latest code.

D:\dev\data\SegmentationFault_gh_24144$ pip list
Package Version


certifi 2025.1.31
charset-normalizer 3.4.1
colorama 0.4.6
coloredlogs 15.0.1
filelock 3.18.0
flatbuffers 25.2.10
fsspec 2025.3.0
huggingface-hub 0.29.3
humanfriendly 10.0
idna 3.10
Jinja2 3.1.6
MarkupSafe 3.0.2
mpmath 1.3.0
networkx 3.4.2
numpy 1.24.3
onnx 1.17.0
onnxruntime 1.22.0
optimum 1.24.0
packaging 24.2
pip 25.0.1
protobuf 6.30.2
pyreadline3 3.5.4
PyYAML 6.0.2
regex 2024.11.6
requests 2.32.3
safetensors 0.5.3
setuptools 65.5.0
sympy 1.13.1
tokenizers 0.21.1
torch 2.6.0
tqdm 4.67.1
transformers 4.50.2
typing_extensions 4.13.0
urllib3 2.3.0

@vmnit
Copy link
Author

vmnit commented Mar 29, 2025

The bug reproduces with 1.21.0, but is not there with the latest code.

D:\dev\data\SegmentationFault_gh_24144$ pip list Package Version

certifi 2025.1.31 charset-normalizer 3.4.1 colorama 0.4.6 coloredlogs 15.0.1 filelock 3.18.0 flatbuffers 25.2.10 fsspec 2025.3.0 huggingface-hub 0.29.3 humanfriendly 10.0 idna 3.10 Jinja2 3.1.6 MarkupSafe 3.0.2 mpmath 1.3.0 networkx 3.4.2 numpy 1.24.3 onnx 1.17.0 onnxruntime 1.22.0 optimum 1.24.0 packaging 24.2 pip 25.0.1 protobuf 6.30.2 pyreadline3 3.5.4 PyYAML 6.0.2 regex 2024.11.6 requests 2.32.3 safetensors 0.5.3 setuptools 65.5.0 sympy 1.13.1 tokenizers 0.21.1 torch 2.6.0 tqdm 4.67.1 transformers 4.50.2 typing_extensions 4.13.0 urllib3 2.3.0

@yuslepukhin It is great that you are able to reproduce the issue. I think we should add this as a testcase to avoid such a regression in the future. What do you say? Let me know if you want me to add it. If yes, can you please share some documentation or guide me on how to add it and verify the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core runtime issues related to core runtime
Projects
None yet
Development

No branches or pull requests

3 participants