Broken with deeplabv3+

tl;dr: the code works fine without ngraph; with ngraph enabled, it dies with the errors show below.

Details:

Been trying to get ngraph working with Google's deeplab v3+, without any luck.  The code is being run inside a docker container (the nvcr.io/nvidia/tensorflow:18.12-py3 image) on an nvidia dgx2 (16 GPUs).

Versions:

Python 3.5.2 (default, Nov 12 2018, 13:43:14) 
[GCC 5.4.0 20160609] on linux
TensorFlow version installed: 1.12.0 (unknown)
nGraph bridge built with: 1.12.0 (v1.12.0-0-ga6d8ffa)

The docker container was started with the following command line:

nvidia-docker run -it \
        --rm \
        --shm-size=1g \
        --ulimit memlock=-1 \
        --ulimit stack=67108864 \
        --privileged=true \
        -v /raid/wingated:/raid/wingated \
        -v /home/wingated:/home/wingated \
        -v /mnt/pccfs:/mnt/pccfs \
        nvcr.io/nvidia/tensorflow:18.12-py3

Here are the errors.  I have no idea how to diagnose this.  :)

[snip]
INFO:tensorflow:Restoring parameters from /raid/wingated/cancer/deeplab_data/init_models/xception/model.ckpt
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /raid/wingated/cancer/deeplab_data/logs/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:Error reported to Coordinator: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 297, in stop_on_exception
    yield
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/coordinator.py", line 495, in run
    self.run_loop()
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/supervisor.py", line 1034, in run_loop
    self._sv.global_step])
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Node ConstantFolding/clone_1/scaled_clone_loss_recip in cluster 1064 has assigned device /job:localhost/replica:0/task:0/device:GPU:1 but another node with assigned device /job:localhost/replica:0/task:0/device:CPU:0 has already been seen in the same cluster


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken with deeplabv3+ #427

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Broken with deeplabv3+ #427

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions