Skip to content

Commit ddff4ef

Browse files
ezyangfacebook-github-bot
authored andcommitted
Don't use RTLD_GLOBAL to load _C. (pytorch#31162)
Summary: Pull Request resolved: pytorch#31162 This should help us resolve a multitude of weird segfaults and crashes when PyTorch is imported along with other packages. Those would often happen because libtorch symbols were exposed globally and could be used as a source of relocations in shared libraries loaded after libtorch. Fixes pytorch#3059. Some of the subtleties in preparing this patch: * Getting ASAN to play ball was a pain in the ass. The basic problem is that when we load with `RTLD_LOCAL`, we now may load a library multiple times into the address space; this happens when we have custom C++ extensions. Since the libraries are usually identical, this is usually benign, but it is technically undefined behavior and UBSAN hates it. I sprayed a few ways of getting things to "work" correctly: I preload libstdc++ (so that it is seen consistently over all library loads) and added turned off vptr checks entirely. Another possibility is we should have a mode where we use RTLD_GLOBAL to load _C, which would be acceptable in environments where you're sure C++ lines up correctly. There's a long comment in the test script going into more detail about this. * Making some of our shared library dependencies load with `RTLD_LOCAL` breaks them. OpenMPI and MKL don't work; they play linker shenanigans to look up their symbols which doesn't work when loaded locally, and if we load a library with `RLTD_LOCAL` we aren't able to subsequently see it with `ctypes`. To solve this problem, we employ a clever device invented by apaszke: we create a dummy library `torch_global_deps` with dependencies on all of the libraries which need to be loaded globally, and then load that with `RTLD_GLOBAL`. As long as none of these libraries have C++ symbols, we can avoid confusion about C++ standard library. Signed-off-by: Edward Z. Yang <ezyang@fb.com> Differential Revision: D19262579 Test Plan: Imported from OSS Pulled By: ezyang fbshipit-source-id: 06a48a5d2c9036aacd535f7e8a4de0e8fe1639f2
1 parent 8614860 commit ddff4ef

6 files changed

Lines changed: 115 additions & 46 deletions

File tree

.jenkins/pytorch/test.sh

Lines changed: 32 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -75,22 +75,44 @@ fi
7575
# if you're not careful. Check this if you made some changes and the
7676
# ASAN test is not working
7777
if [[ "$BUILD_ENVIRONMENT" == *asan* ]]; then
78+
# Suppress vptr violations arising from multiple copies of pybind11
7879
export ASAN_OPTIONS=detect_leaks=0:symbolize=1:strict_init_order=true
79-
# We suppress the vptr volation, since we have separate copies of
80-
# libprotobuf in both libtorch.so and libcaffe2.so, and it causes
81-
# the following problem:
82-
# test_cse (__main__.TestJit) ... torch/csrc/jit/export.cpp:622:38:
83-
# runtime error: member call on address ... which does not point
84-
# to an object of type 'google::protobuf::MessageLite'
85-
# ...: note: object is of type 'onnx_torch::ModelProto'
86-
#
87-
# This problem should be solved when libtorch.so and libcaffe2.so are
88-
# merged.
8980
export UBSAN_OPTIONS=print_stacktrace=1:suppressions=$PWD/ubsan.supp
9081
export PYTORCH_TEST_WITH_ASAN=1
9182
export PYTORCH_TEST_WITH_UBSAN=1
9283
# TODO: Figure out how to avoid hard-coding these paths
9384
export ASAN_SYMBOLIZER_PATH=/usr/lib/llvm-5.0/bin/llvm-symbolizer
85+
export TORCH_USE_RTLD_GLOBAL=1
86+
# NB: We load libtorch.so with RTLD_GLOBAL for UBSAN, unlike our
87+
# default behavior.
88+
#
89+
# The reason for this is that without RTLD_GLOBAL, if we load multiple
90+
# libraries that depend on libtorch (as is the case with C++ extensions), we
91+
# will get multiple copies of libtorch in our address space. When UBSAN is
92+
# turned on, it will do a bunch of virtual pointer consistency checks which
93+
# won't work correctly. When this happens, you get a violation like:
94+
#
95+
# member call on address XXXXXX which does not point to an object of
96+
# type 'std::_Sp_counted_base<__gnu_cxx::_Lock_policy::_S_atomic>'
97+
# XXXXXX note: object is of type
98+
# 'std::_Sp_counted_ptr<torch::nn::LinearImpl*, (__gnu_cxx::_Lock_policy)2>'
99+
#
100+
# (NB: the textual types of the objects here are misleading, because
101+
# they actually line up; it just so happens that there's two copies
102+
# of the type info floating around in the address space, so they
103+
# don't pointer compare equal. See also
104+
# https://github.com/google/sanitizers/issues/1175
105+
#
106+
# UBSAN is kind of right here: if we relied on RTTI across C++ extension
107+
# modules they would indeed do the wrong thing; but in our codebase, we
108+
# don't use RTTI (because it doesn't work in mobile). To appease
109+
# UBSAN, however, it's better if we ensure all the copies agree!
110+
#
111+
# By the way, an earlier version of this code attempted to load
112+
# libtorch_python.so with LD_PRELOAD, which has a similar effect of causing
113+
# it to be loaded globally. This isn't really a good idea though, because
114+
# it depends on a ton of dynamic libraries that most programs aren't gonna
115+
# have, and it applies to child processes.
94116
export LD_PRELOAD=/usr/lib/llvm-5.0/lib/clang/5.0.0/lib/linux/libclang_rt.asan-x86_64.so
95117
# Increase stack size, because ASAN red zones use more stack
96118
ulimit -s 81920

caffe2/CMakeLists.txt

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1085,6 +1085,32 @@ if(USE_CUDA)
10851085
endif()
10861086

10871087

1088+
# Note [Global dependencies]
1089+
# Some libraries (e.g. OpenMPI) like to dlopen plugins after they're initialized,
1090+
# and they assume that all of their symbols will be available in the global namespace.
1091+
# On the other hand we try to be good citizens and avoid polluting the symbol
1092+
# namespaces, so libtorch is loaded with all its dependencies in a local scope.
1093+
# That usually leads to missing symbol errors at run-time, so to avoid a situation like
1094+
# this we have to preload those libs in a global namespace.
1095+
add_library(torch_global_deps SHARED ${TORCH_SRC_DIR}/csrc/empty.c)
1096+
set_target_properties(torch_global_deps PROPERTIES LINKER_LANGUAGE C)
1097+
if (USE_MPI)
1098+
target_link_libraries(torch_global_deps ${MPI_CXX_LIBRARIES})
1099+
endif()
1100+
target_link_libraries(torch_global_deps ${MKL_LIBRARIES})
1101+
# The CUDA libraries are linked here for a different reason: in some
1102+
# cases we load these libraries with ctypes, and if they weren't opened
1103+
# with RTLD_GLOBAL, we'll do the "normal" search process again (and
1104+
# not find them, because they're usually in non-standard locations)
1105+
if (USE_CUDA)
1106+
target_link_libraries(torch_global_deps ${TORCH_CUDA_LIBRARIES})
1107+
target_link_libraries(torch_global_deps ${Caffe2_PUBLIC_CUDA_DEPENDENCY_LIBS})
1108+
target_link_libraries(torch_global_deps torch::cudart)
1109+
endif()
1110+
1111+
install(TARGETS torch_global_deps DESTINATION "${TORCH_INSTALL_LIB_DIR}")
1112+
1113+
10881114
# ---[ Caffe2 HIP sources.
10891115
if(USE_ROCM)
10901116
# Call again since Caffe2_HIP_INCLUDE is extended with ATen include dirs.

torch/__init__.py

Lines changed: 56 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,10 @@
1313
import os
1414
import sys
1515
import platform
16+
import ctypes
1617
from ._utils import _import_dotted_name
17-
from ._utils_internal import get_file_path, prepare_multiprocessing_environment
18+
from ._utils_internal import get_file_path, prepare_multiprocessing_environment, \
19+
USE_RTLD_GLOBAL_WITH_LIBTORCH
1820
from .version import __version__
1921
from ._six import string_classes as _string_classes
2022

@@ -33,61 +35,81 @@
3335
# Load the extension module
3436
################################################################################
3537

36-
# Loading the extension with RTLD_GLOBAL option allows to not link extension
37-
# modules against the _C shared object. Their missing THP symbols will be
38-
# automatically filled by the dynamic loader.
39-
import os as _dl_flags
40-
41-
# if we have numpy, it *must* be imported before the call to setdlopenflags()
42-
# or there is risk that later c modules will segfault when importing numpy
43-
try:
44-
import numpy as _np
45-
except ImportError:
46-
pass
47-
4838
if platform.system() == 'Windows':
49-
# first get nvToolsExt PATH
50-
def get_nvToolsExt_path():
51-
NVTOOLEXT_HOME = _dl_flags.getenv('NVTOOLSEXT_PATH', 'C:\\Program Files\\NVIDIA Corporation\\NvToolsExt')
39+
NVTOOLSEXT_PATH = os.getenv('NVTOOLSEXT_PATH', 'C:\\Program Files\\NVIDIA Corporation\\NvToolsExt')
5240

53-
if _dl_flags.path.exists(NVTOOLEXT_HOME):
54-
return _dl_flags.path.join(NVTOOLEXT_HOME, 'bin', 'x64')
55-
else:
56-
return ''
41+
if os.path.exists(NVTOOLSEXT_PATH):
42+
nvtoolsext_lib_path = os.path.join(NVTOOLSEXT_PATH, 'bin', 'x64')
43+
else:
44+
nvtoolsext_lib_path = ''
5745

58-
py_dll_path = _dl_flags.path.join(sys.exec_prefix, 'Library', 'bin')
59-
th_dll_path = _dl_flags.path.join(_dl_flags.path.dirname(__file__), 'lib')
46+
py_dll_path = os.path.join(sys.exec_prefix, 'Library', 'bin')
47+
th_dll_path = os.path.join(os.path.dirname(__file__), 'lib')
6048

61-
dll_paths = [th_dll_path, py_dll_path, get_nvToolsExt_path(), _dl_flags.environ['PATH']]
49+
dll_paths = [th_dll_path, py_dll_path, nvtoolsext_lib_path, os.environ['PATH']]
6250

6351
# then add the path to env
64-
_dl_flags.environ['PATH'] = ';'.join(dll_paths)
52+
os.environ['PATH'] = ';'.join(dll_paths)
6553

66-
else:
67-
# first check if the os package has the required flags
54+
55+
# See Note [Global dependencies]
56+
def _load_global_deps():
57+
if platform.system() == 'Windows':
58+
return
59+
60+
lib_name = 'libtorch_global_deps' + ('.dylib' if platform.system() == 'Darwin' else '.so')
61+
here = os.path.abspath(__file__)
62+
lib_path = os.path.join(os.path.dirname(here), 'lib', lib_name)
63+
64+
ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
65+
66+
67+
if (USE_RTLD_GLOBAL_WITH_LIBTORCH or os.getenv('TORCH_USE_RTLD_GLOBAL')) and \
68+
platform.system() != 'Windows':
69+
# Do it the hard way. You might want to load libtorch with RTLD_GLOBAL in a
70+
# few circumstances:
71+
#
72+
# 1. You're in a build environment (e.g., fbcode) where
73+
# libtorch_global_deps is not available, but you still need
74+
# to get mkl to link in with RTLD_GLOBAL or it will just
75+
# not work.
76+
#
77+
# 2. You're trying to run PyTorch under UBSAN and you need
78+
# to ensure that only one copy of libtorch is loaded, so
79+
# vptr checks work properly
80+
#
81+
# If you're using this setting, you must verify that all the libraries
82+
# you load consistently use the same libstdc++, or you may have
83+
# mysterious segfaults.
84+
#
85+
import os as _dl_flags
6886
if not hasattr(_dl_flags, 'RTLD_GLOBAL') or not hasattr(_dl_flags, 'RTLD_LAZY'):
6987
try:
7088
# next try if DLFCN exists
7189
import DLFCN as _dl_flags
7290
except ImportError:
7391
# as a last attempt, use compile-time constants
7492
import torch._dl as _dl_flags
75-
7693
old_flags = sys.getdlopenflags()
7794
sys.setdlopenflags(_dl_flags.RTLD_GLOBAL | _dl_flags.RTLD_LAZY)
95+
from torch._C import *
96+
sys.setdlopenflags(old_flags)
97+
del old_flags
98+
del _dl_flags
7899

79-
del _dl_flags
80-
81-
from torch._C import *
100+
else:
101+
# Easy way. You want this most of the time, because it will prevent
102+
# C++ symbols from libtorch clobbering C++ symbols from other
103+
# libraries, leading to mysterious segfaults.
104+
#
105+
# See Note [Global dependencies]
106+
_load_global_deps()
107+
from torch._C import *
82108

83109
__all__ += [name for name in dir(_C)
84110
if name[0] != '_' and
85111
not name.endswith('Base')]
86112

87-
if platform.system() != 'Windows':
88-
sys.setdlopenflags(old_flags)
89-
del old_flags
90-
91113
################################################################################
92114
# Define basic utilities
93115
################################################################################

torch/_utils_internal.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,3 +54,4 @@ def get_source_lines_and_file(obj):
5454

5555
TEST_MASTER_ADDR = '127.0.0.1'
5656
TEST_MASTER_PORT = 29500
57+
USE_RTLD_GLOBAL_WITH_LIBTORCH = False

torch/csrc/empty.c

Whitespace-only changes.

ubsan.supp

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1 @@
1-
vptr:libtorch.so
21
vptr:libtorch_python.so
3-
vptr:libcaffe2.so

0 commit comments

Comments
 (0)