-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ Setup/examples ] Initial Installation Issues - docker compose errors #25
Comments
Hi @gaspard-met I'll try to recreate your environment and help out. When you say you have a clearml server running on Kubernetes, how is that running locally? Microk8s? If not, when you use a web browser, can you actually go to the server via localhost? |
Hello @thepycoder To reproduce:
$ minikube start --driver=docker \
--container-runtime=containerd \
--nodes=1
$ kubectl get po
NAME READY STATUS RESTARTS AGE
alertmanager-84b874c6f8-nxnqm 1/1 Running 0 18h
clearml-apiserver-7b46876f44-gpm4v 1/1 Running 3 (8d ago) 8d
clearml-elastic-master-0 1/1 Running 0 8d
clearml-fileserver-5c968587b4-2zmqx 1/1 Running 0 8d
clearml-k8sagent-5d468b6d47-269qp 1/1 Running 0 5d19h
clearml-mongodb-6b94888687-r4x7d 1/1 Running 0 8d
clearml-redis-master-0 1/1 Running 1 (7d1h ago) 8d
clearml-serving-inference-85bcf97f69-w5b2b 1/1 Running 2 (171m ago) 18h
clearml-serving-statistics-6ffb8459bc-vhktv 1/1 Running 2 (171m ago) 18h
clearml-serving-triton-666f97b8d6-k8lsd 1/1 Running 2 (171m ago) 18h
clearml-webserver-7d86c649dd-txczl 1/1 Running 0 8d
grafana-84b7f5c559-wnfdx 1/1 Running 0 18h
kafka-cb849765-7kng5 1/1 Running 0 18h
prometheus-6f5868884b-9h5h8 1/1 Running 0 18h
zookeeper-6795454fbf-gqfjh 1/1 Running 0 18h
# ClearML SDK configuration file
api {
# Notice: 'host' is the api server (default port 8008), not the web server.
api_server: http://127.0.0.1:46555
web_server: http://127.0.0.1:38063
files_server: http://127.0.0.1:42347
# Credentials are generated using the webapp, http://127.0.0.1:45595/settings
# Override with os environment: CLEARML_API_ACCESS_KEY / CLEARML_API_SECRET_KEY
credentials {"access_key":
...... which I am sure is correct cuz
$ clearml-serving model list
clearml-serving - CLI for launching ClearML serving engine
Notice! serving service ID not provided, selecting the first active service
List model serving and endpoints, control task id=1f38787f5f7a4ab6b860532369f0aa57
Info: syncing model endpoint configuration, state hash=253e8350252883f7e599572903a5cf63
Endpoints:
{
"test_pytorch_mnist/1": {
"engine_type": "triton",
"serving_url": "test_pytorch_mnist",
"model_id": "3ed0f8563b56482eb9726230f1171ef1",
"version": "1",
"preprocess_artifact": "py_code_test_pytorch_mnist_1",
"input_size": [
1,
28,
28
],
"input_type": "float32",
"input_name": "INPUT__0",
"output_size": [
-1,
10
],
"output_type": "float32",
"output_name": "OUTPUT__0",
"auxiliary_cfg": null
}
}
Model Monitoring:
{}
Canary:
{}
CLEARML_SERVING_TASK_ID=ClearML Serving Task ID
CLEARML_SERVING_PORT=8080
CLEARML_USE_GUNICORN=true
CLEARML_EXTRA_PYTHON_PACKAGES=
CLEARML_SERVING_NUM_PROCESS=2
CLEARML_SERVING_POLL_FREQ=1.0
CLEARML_DEFAULT_KAFKA_SERVE_URL=clearml-serving-kafka:9092
WEB_CONCURRENCY=
SERVING_PORT=8080
GUNICORN_NUM_PROCESS=2
GUNICORN_SERVING_TIMEOUT=
GUNICORN_MAX_REQUESTS=0
GUNICORN_EXTRA_ARGS=
UVICORN_SERVE_LOOP=asyncio
UVICORN_EXTRA_ARGS=
UVICORN_LOG_LEVEL=warning
CLEARML_DEFAULT_BASE_SERVE_URL=http://127.0.0.1:8080/serve
CLEARML_DEFAULT_TRITON_GRPC_ADDR=clearml-serving-triton:8001
Starting Gunicorn server
Retrying (Retry(total=239, connect=239, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9b57d87610>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /auth.login
Retrying (Retry(total=238, connect=238, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9aefd01760>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /auth.login
Retrying (Retry(total=237, connect=237, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f9aefc207f0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /auth.login
......
......
else
echo "Starting Gunicorn server"
# start service
PYTHONPATH=$(pwd) python3 -m gunicorn \
--preload clearml_serving.serving.main:app \
--workers $GUNICORN_NUM_PROCESS \
--worker-class uvicorn.workers.UvicornWorker \
--max-requests $GUNICORN_MAX_REQUESTS \
--timeout $GUNICORN_SERVING_TIMEOUT \
--bind 0.0.0.0:$SERVING_PORT \
$GUNICORN_EXTRA_ARGS
fi Seems this gunicorn app failed to communicate with something Thanks for any help! :) |
Thanks for the detailed writeup @Muscle-Oliver ! So I've taken a look and it seems like a specific parameter is missing from the helm chart. In order to set this IP address, you'll have to edit the following parameter in the serving docker-compose yaml file: But it seems that the particular env var |
Thanks for the quick reply @thepycoder ! May I ask what problem it suggests by And, I'm just wondering whether this Thanks for any further update :) ☕ |
Hi, I just issued a PR mentioning this issue, can you pls check it and letting me know if this is what you are expecting? |
Since that change is not breaking, I just merged PR and released clearml-serving-0.4.0 . |
Thanks for the update @valeriano-manassero ! So, I also tried The log of pod Retrying (Retry(total=234, connect=234, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ff3de505880>: Failed to establish a new connection: [Errno -2] Name or service not known')': /auth.login which means gunicorn failed to start ? (or it did, but not working as expected) Then I root@clearml-serving-inference-85bcf97f69-9jsdh:~/clearml# sh clearml_serving/serving/entrypoint.sh
CLEARML_SERVING_TASK_ID=ClearML Serving Task ID
CLEARML_SERVING_PORT=8080
CLEARML_USE_GUNICORN=true
EXTRA_PYTHON_PACKAGES=
CLEARML_SERVING_NUM_PROCESS=2
CLEARML_SERVING_POLL_FREQ=1.0
CLEARML_DEFAULT_KAFKA_SERVE_URL=clearml-serving-kafka:9092
CLEARML_DEFAULT_KAFKA_SERVE_URL=clearml-serving-kafka:9092
WEB_CONCURRENCY=
SERVING_PORT=8080
GUNICORN_NUM_PROCESS=2
GUNICORN_SERVING_TIMEOUT=
GUNICORN_EXTRA_ARGS=
UVICORN_SERVE_LOOP=asyncio
UVICORN_EXTRA_ARGS=
CLEARML_DEFAULT_BASE_SERVE_URL=http://127.0.0.1:8080/serve
CLEARML_DEFAULT_TRITON_GRPC_ADDR=clearml-serving-triton:8001
Starting Gunicorn server
Retrying (Retry(total=239, connect=239, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f5bfe7e4610>: Failed to establish a new connection: [Errno -2] Name or service not known')': /auth.login
Retrying (Retry(total=238, connect=238, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f5b984f2520>: Failed to establish a new connection: [Errno -2] Name or service not known')': /auth.login
^CTraceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
conn = connection.create_connection(
File "/usr/local/lib/python3.9/site-packages/urllib3/util/connection.py", line 72, in create_connection
for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
File "/usr/local/lib/python3.9/socket.py", line 954, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 703, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 398, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 239, in request
super(HTTPConnection, self).request(method, url, body=body, headers=headers)
File "/usr/local/lib/python3.9/http/client.py", line 1285, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/usr/local/lib/python3.9/http/client.py", line 1331, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.9/http/client.py", line 1280, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.9/http/client.py", line 1040, in _send_output
self.send(msg)
File "/usr/local/lib/python3.9/http/client.py", line 980, in send
self.connect()
File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 205, in connect
conn = self._new_conn()
File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f5b984f27f0>: Failed to establish a new connection: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.9/site-packages/gunicorn/__main__.py", line 7, in <module>
run()
File "/usr/local/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py", line 67, in run
WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
File "/usr/local/lib/python3.9/site-packages/gunicorn/app/base.py", line 231, in run
super().run()
File "/usr/local/lib/python3.9/site-packages/gunicorn/app/base.py", line 72, in run
Arbiter(self).run()
File "/usr/local/lib/python3.9/site-packages/gunicorn/arbiter.py", line 58, in __init__
self.setup(app)
File "/usr/local/lib/python3.9/site-packages/gunicorn/arbiter.py", line 118, in setup
self.app.wsgi()
File "/usr/local/lib/python3.9/site-packages/gunicorn/app/base.py", line 67, in wsgi
self.callable = self.load()
File "/usr/local/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py", line 58, in load
return self.load_wsgiapp()
File "/usr/local/lib/python3.9/site-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
return util.import_app(self.app_uri)
File "/usr/local/lib/python3.9/site-packages/gunicorn/util.py", line 359, in import_app
mod = importlib.import_module(module)
File "/usr/local/lib/python3.9/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "<frozen importlib._bootstrap_external>", line 850, in exec_module
File "<frozen importlib._bootstrap>", line 228, in _call_with_frames_removed
File "/root/clearml/clearml_serving/serving/main.py", line 49, in <module>
serving_task = ModelRequestProcessor._get_control_plane_task(task_id=serving_service_task_id)
File "/root/clearml/clearml_serving/serving/model_request_processor.py", line 1094, in _get_control_plane_task
task = Task.get_task(task_id=task_id)
File "/usr/local/lib/python3.9/site-packages/clearml/task.py", line 796, in get_task
return cls.__get_task(
File "/usr/local/lib/python3.9/site-packages/clearml/task.py", line 3523, in __get_task
return cls(private=cls.__create_protection, task_id=task_id, log_to_backend=False)
File "/usr/local/lib/python3.9/site-packages/clearml/task.py", line 169, in __init__
super(Task, self).__init__(**kwargs)
File "/usr/local/lib/python3.9/site-packages/clearml/backend_interface/task/task.py", line 152, in __init__
super(Task, self).__init__(id=task_id, session=session, log=log)
File "/usr/local/lib/python3.9/site-packages/clearml/backend_interface/base.py", line 145, in __init__
super(IdObjectBase, self).__init__(session, log, **kwargs)
File "/usr/local/lib/python3.9/site-packages/clearml/backend_interface/base.py", line 39, in __init__
self._session = session or self._get_default_session()
File "/usr/local/lib/python3.9/site-packages/clearml/backend_interface/base.py", line 115, in _get_default_session
InterfaceBase._default_session = Session(
File "/usr/local/lib/python3.9/site-packages/clearml/backend_api/session/session.py", line 207, in __init__
self.refresh_token()
File "/usr/local/lib/python3.9/site-packages/clearml/backend_api/session/token_manager.py", line 112, in refresh_token
self._set_token(self._do_refresh_token(self.__token, exp=self.req_token_expiration_sec))
File "/usr/local/lib/python3.9/site-packages/clearml/backend_api/session/session.py", line 736, in _do_refresh_token
res = self._send_request(
File "/usr/local/lib/python3.9/site-packages/clearml/backend_api/session/session.py", line 358, in _send_request
res = self.__http_session.request(
File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python3.9/site-packages/clearml/backend_api/utils.py", line 85, in send
return super(SessionWithTimeout, self).send(request, **kwargs)
File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 655, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 439, in send
resp = conn.urlopen(
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 813, in urlopen
return self.urlopen(
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 813, in urlopen
return self.urlopen(
File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 788, in urlopen
retries.sleep()
File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 432, in sleep
self._sleep_backoff()
File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 416, in _sleep_backoff
time.sleep(backoff)
KeyboardInterrupt Will this help to explain the connection error? Or does Thank you for any reply! |
Just a question: I see |
Thanks for the helm update. How can I get this I can infer from the logs that the startup problem may results from some sort of connection error. But I have no idea where exactly the gunicorn was connecting to. As the #!/bin/bash
# print configuration
echo CLEARML_SERVING_TASK_ID="$CLEARML_SERVING_TASK_ID"
echo CLEARML_SERVING_PORT="$CLEARML_SERVING_PORT"
echo CLEARML_USE_GUNICORN="$CLEARML_USE_GUNICORN"
echo EXTRA_PYTHON_PACKAGES="$EXTRA_PYTHON_PACKAGES"
echo CLEARML_SERVING_NUM_PROCESS="$CLEARML_SERVING_NUM_PROCESS"
echo CLEARML_SERVING_POLL_FREQ="$CLEARML_SERVING_POLL_FREQ"
echo CLEARML_DEFAULT_KAFKA_SERVE_URL="$CLEARML_DEFAULT_KAFKA_SERVE_URL"
echo CLEARML_DEFAULT_KAFKA_SERVE_URL="$CLEARML_DEFAULT_KAFKA_SERVE_URL"
SERVING_PORT="${CLEARML_SERVING_PORT:-8080}"
GUNICORN_NUM_PROCESS="${CLEARML_SERVING_NUM_PROCESS:-4}"
GUNICORN_SERVING_TIMEOUT="${GUNICORN_SERVING_TIMEOUT:-600}"
UVICORN_SERVE_LOOP="${UVICORN_SERVE_LOOP:-asyncio}"
# set default internal serve endpoint (for request pipelining)
CLEARML_DEFAULT_BASE_SERVE_URL="${CLEARML_DEFAULT_BASE_SERVE_URL:-http://127.0.0.1:$SERVING_PORT/serve}"
CLEARML_DEFAULT_TRITON_GRPC_ADDR="${CLEARML_DEFAULT_TRITON_GRPC_ADDR:-127.0.0.1:8001}"
# print configuration
echo WEB_CONCURRENCY="$WEB_CONCURRENCY"
echo SERVING_PORT="$SERVING_PORT"
echo GUNICORN_NUM_PROCESS="$GUNICORN_NUM_PROCESS"
echo GUNICORN_SERVING_TIMEOUT="$GUNICORN_SERVING_PORT"
echo GUNICORN_EXTRA_ARGS="$GUNICORN_EXTRA_ARGS"
echo UVICORN_SERVE_LOOP="$UVICORN_SERVE_LOOP"
echo UVICORN_EXTRA_ARGS="$UVICORN_EXTRA_ARGS"
echo CLEARML_DEFAULT_BASE_SERVE_URL="$CLEARML_DEFAULT_BASE_SERVE_URL"
echo CLEARML_DEFAULT_TRITON_GRPC_ADDR="$CLEARML_DEFAULT_TRITON_GRPC_ADDR"
# runtime add extra python packages
if [ ! -z "$EXTRA_PYTHON_PACKAGES" ]
then
python3 -m pip install $EXTRA_PYTHON_PACKAGES
fi
if [ -z "$CLEARML_USE_GUNICORN" ]
then
echo "Starting Uvicorn server"
PYTHONPATH=$(pwd) python3 -m uvicorn \
clearml_serving.serving.main:app --host 0.0.0.0 --port $SERVING_PORT --loop $UVICORN_SERVE_LOOP \
$UVICORN_EXTRA_ARGS
else
echo "Starting Gunicorn server"
# start service
PYTHONPATH=$(pwd) python3 -m gunicorn \
--preload clearml_serving.serving.main:app \
--workers $GUNICORN_NUM_PROCESS \
--worker-class uvicorn.workers.UvicornWorker \
--timeout $GUNICORN_SERVING_TIMEOUT \
--bind 0.0.0.0:$SERVING_PORT \
$GUNICORN_EXTRA_ARGS
fi Maybe we can set this clear by reproducing the startup process of the gunicorn in the |
I probably found the issue:
Once you will not get anymore connection errors, you can connect to the inference service simply doing a Let me know if this helps. |
@valeriano-manassero Thanks! That's it! Thanks to your reminder, I finally noticed that the configs of I clearml:
apiAccessKey: "ClearML API Access Key"
apiSecretKey: "ClearML API Secret Key"
apiHost: http://clearml-server-apiserver:8008
filesHost: http://clearml-server-fileserver:8081
webHost: http://clearml-server-webserver:80
servingTaskId: "ClearML Serving Task ID"
...... where all the Host addresses don't match with my current services of clearml (I installed clearml via The correct services should be (version 1.4.0):
No more connection error! |
Hello, i have the same issue with connection errors. On my Ubuntu machine:
On the same Ubuntu machine i tried:
my example.env file:
my conf file:
|
Hi, @Mithmi i host my main clearml server and clearml serving server by utilizing different docker-compose files. I have resolved this issue by hosting main and serving composes under the same network. I hope this stackoverflow will help you to figure it out for your case. |
Hello clearml team,
Congrats on the release of
clearml-serving
V2 🎉I really wanted to check it out, and I'm having difficulties running the basic setup and
scikit-learn
example commands on my side.I want to run the Installation and the Toy model (scikit learn) deployment example
I have a self-hosted clearml
Server
built with the helm chart on Kubernetes.The environment variables of
clearml-serving/docker/docker-compose.yml
where defined in themyexemple.env
file, and starts like this :Upon running docker-compose , both
clearml-serving-inference
andclearml-serving-statistics
return errors:I think the issue comes from the communication with the Kafka service, but I do not know how to solve this.
Has anyone encountered this issue and solved it before, since it's the default installation on the doc ?
Haven't found any related issues on any of the GitHub repos
Thanks for the help 🤖
The text was updated successfully, but these errors were encountered: