You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Have I specified the code to reproduce the issue
(Yes/No): yes
Environment in which the code is executed (e.g., Local
(Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc): - TensorFlow
version (you are using): 2.3.2- TFX Version: 0.26.1- Python version:3.6.7
Describe the current behavior
In tfma module, model_util.py :109 get_model_type function, it will load model into gpu and inference the type of model
then return model's type.
if model_path:
try:
keras_model = tf.keras.models.load_model(model_path)
# In some cases, tf.keras.models.load_model can successfully load a
# saved_model but it won't actually be a keras model.
if isinstance(keras_model, tf.keras.models.Model):
return constants.TF_KERAS
except Exception: # pylint: disable=broad-except
pass
if tags:
if tags and eval_constants.EVAL_TAG in tags:
return constants.TF_ESTIMATOR
else:
return constants.TF_GENERIC
signature_name = None
if model_spec:
if model_spec.signature_name:
signature_name = model_spec.signature_name
else:
signature_name = tf.saved_model.DEFAULT_SERVING_SIGNATURE_DEF_KEY
if signature_name == eval_constants.EVAL_TAG:
return constants.TF_ESTIMATOR
else:
return constants.TF_GENERIC
But thanks to this function, the loaded model keep staying in the gpu memory and takes large amount of memory,
when the evaluator goes on, it starts to do the evaluation, another load process is started and cause an OOM error
Describe the expected behavior
When the get model type finished the GPU memory should be released or we just keep passing the loaded model to the following process, instead load it again.
Standalone code to reproduce the issue Providing a bare minimum test case or
step(s) to reproduce the problem will greatly help us to debug the issue. If
possible, please share a link to Colab/Jupyter/any notebook.
Just try a larger pretrained model like multilingual bert and put it into pipeline with evaluator.
Name of your Organization (Optional)
Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.
""
RuntimeError: Traceback (most recent call last):
File "apache_beam/runners/common.py", line 1280, in apache_beam.runners.common.DoFnRunner._invoke_lifecycle_method
File "apache_beam/runners/common.py", line 516, in apache_beam.runners.common.DoFnInvoker.invoke_setup
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow_model_analysis/model_util.py", line 758, in setup
super(ModelSignaturesDoFn, self).setup()
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow_model_analysis/model_util.py", line 574, in setup
model_load_time_callback=self._set_model_load_seconds)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow_model_analysis/types.py", line 160, in load
return self._shared_handle.acquire(construct_fn)
File "/home/luban/.local/lib/python3.6/site-packages/tfx_bsl/beam/shared.py", line 238, in acquire
return _shared_map.acquire(self._key, constructor_fn, tag)
File "/home/luban/.local/lib/python3.6/site-packages/tfx_bsl/beam/shared.py", line 194, in acquire
result = control_block.acquire(constructor_fn, tag)
File "/home/luban/.local/lib/python3.6/site-packages/tfx_bsl/beam/shared.py", line 89, in acquire
result = constructor_fn()
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow_model_analysis/types.py", line 169, in with_load_times
model = self.construct_fn()
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow_model_analysis/model_util.py", line 550, in construct_fn
model = tf.compat.v1.saved_model.load_v2(eval_saved_model_path, tags=tags)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 603, in load
return load_internal(export_dir, tags, options)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 633, in load_internal
ckpt_options)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 130, in init
self._load_all()
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 141, in _load_all
self._load_nodes()
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 296, in _load_nodes
slot_name=slot_variable_proto.slot_name)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 764, in add_slot
initial_value=initial_value)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/variables.py", line 262, in call
return cls._variable_v2_call(*args, **kwargs)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/variables.py", line 256, in _variable_v2_call
shape=shape)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/variables.py", line 237, in
previous_getter = lambda **kws: default_variable_creator_v2(None, **kws)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 2646, in default_variable_creator_v2
shape=shape)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/variables.py", line 264, in call
return super(VariableMetaclass, cls).call(*args, **kwargs)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1518, in init
distribute_strategy=distribute_strategy)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1651, in _init_from_args
initial_value() if init_from_fn else initial_value,
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/keras/initializers/initializers_v2.py", line 137, in call
return super(Zeros, self).call(shape, dtype=_get_dtype(dtype))
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/init_ops_v2.py", line 132, in call
return array_ops.zeros(shape, dtype)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2747, in wrapped
tensor = fun(*args, **kwargs)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2806, in zeros
output = fill(shape, constant(zero, dtype=dtype), name=name)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 239, in fill
result = gen_array_ops.fill(dims, value, name=name)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3402, in fill
_ops.raise_from_not_ok_status(e, name)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 6843, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[119547,768] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Fill]
""
The text was updated successfully, but these errors were encountered:
i just suggest that all the functions not the only one I mentioned above but all the functions including tfma/api/model_eval_lib.py:380 should fix this once for all
I'm not sure how to control the memory in this case. The model is loaded and then thrown away which should release the memory. This is likely related to [1]. For now you can set the model_type explicitly in the ModelSpec to avoid doing any work in the get_model_type call.
I'm not sure how to control the memory in this case. The model is loaded and then thrown away which should release the memory. This is likely related to [1]. For now you can set the model_type explicitly in the ModelSpec to avoid doing any work in the get_model_type call.
System information
(Yes/No): yes
(Linux/MacOS/Windows), Interactive Notebook, Google Cloud, etc): - TensorFlow
version (you are using): 2.3.2- TFX Version: 0.26.1- Python version:3.6.7
Describe the current behavior
In tfma module, model_util.py :109 get_model_type function, it will load model into gpu and inference the type of model
then return model's type.
But thanks to this function, the loaded model keep staying in the gpu memory and takes large amount of memory,
when the evaluator goes on, it starts to do the evaluation, another load process is started and cause an OOM error
Describe the expected behavior
When the get model type finished the GPU memory should be released or we just keep passing the loaded model to the following process, instead load it again.
Standalone code to reproduce the issue Providing a bare minimum test case or
step(s) to reproduce the problem will greatly help us to debug the issue. If
possible, please share a link to Colab/Jupyter/any notebook.
Just try a larger pretrained model like multilingual bert and put it into pipeline with evaluator.
Name of your Organization (Optional)
Other info / logs Include any logs or source code that would be helpful to
diagnose the problem. If including tracebacks, please include the full
traceback. Large logs and files should be attached.
""
RuntimeError: Traceback (most recent call last):
File "apache_beam/runners/common.py", line 1280, in apache_beam.runners.common.DoFnRunner._invoke_lifecycle_method
File "apache_beam/runners/common.py", line 516, in apache_beam.runners.common.DoFnInvoker.invoke_setup
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow_model_analysis/model_util.py", line 758, in setup
super(ModelSignaturesDoFn, self).setup()
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow_model_analysis/model_util.py", line 574, in setup
model_load_time_callback=self._set_model_load_seconds)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow_model_analysis/types.py", line 160, in load
return self._shared_handle.acquire(construct_fn)
File "/home/luban/.local/lib/python3.6/site-packages/tfx_bsl/beam/shared.py", line 238, in acquire
return _shared_map.acquire(self._key, constructor_fn, tag)
File "/home/luban/.local/lib/python3.6/site-packages/tfx_bsl/beam/shared.py", line 194, in acquire
result = control_block.acquire(constructor_fn, tag)
File "/home/luban/.local/lib/python3.6/site-packages/tfx_bsl/beam/shared.py", line 89, in acquire
result = constructor_fn()
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow_model_analysis/types.py", line 169, in with_load_times
model = self.construct_fn()
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow_model_analysis/model_util.py", line 550, in construct_fn
model = tf.compat.v1.saved_model.load_v2(eval_saved_model_path, tags=tags)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 603, in load
return load_internal(export_dir, tags, options)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 633, in load_internal
ckpt_options)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 130, in init
self._load_all()
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 141, in _load_all
self._load_nodes()
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 296, in _load_nodes
slot_name=slot_variable_proto.slot_name)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py", line 764, in add_slot
initial_value=initial_value)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/variables.py", line 262, in call
return cls._variable_v2_call(*args, **kwargs)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/variables.py", line 256, in _variable_v2_call
shape=shape)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/variables.py", line 237, in
previous_getter = lambda **kws: default_variable_creator_v2(None, **kws)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/variable_scope.py", line 2646, in default_variable_creator_v2
shape=shape)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/variables.py", line 264, in call
return super(VariableMetaclass, cls).call(*args, **kwargs)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1518, in init
distribute_strategy=distribute_strategy)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/resource_variable_ops.py", line 1651, in _init_from_args
initial_value() if init_from_fn else initial_value,
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/keras/initializers/initializers_v2.py", line 137, in call
return super(Zeros, self).call(shape, dtype=_get_dtype(dtype))
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/init_ops_v2.py", line 132, in call
return array_ops.zeros(shape, dtype)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2747, in wrapped
tensor = fun(*args, **kwargs)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 2806, in zeros
output = fill(shape, constant(zero, dtype=dtype), name=name)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/util/dispatch.py", line 201, in wrapper
return target(*args, **kwargs)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/array_ops.py", line 239, in fill
result = gen_array_ops.fill(dims, value, name=name)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3402, in fill
_ops.raise_from_not_ok_status(e, name)
File "/home/luban/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 6843, in raise_from_not_ok_status
six.raise_from(core._status_to_exception(e.code, message), None)
File "", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[119547,768] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:Fill]
""
The text was updated successfully, but these errors were encountered: