Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feedback] Feedback for ray + uv #50961

Open
cszhu opened this issue Feb 27, 2025 · 20 comments
Open

[Feedback] Feedback for ray + uv #50961

cszhu opened this issue Feb 27, 2025 · 20 comments
Assignees

Comments

@cszhu
Copy link

cszhu commented Feb 27, 2025

Hello everyone! As of Ray 2.43.0, we have launched a new integration with uv run that we are super excited to share with you all. This will serve as the main Github issue to track any issues or feedback that ya'll might have while using this.

Please share any success stories, configs, or just cool discoveries that you might have while running uv + Ray! We are excited to hear from you.

To read more about uv + Ray, check out our new blog post here.

@pcmoritz pcmoritz pinned this issue Feb 27, 2025
@cabreraalex
Copy link

Hey ya'll! Would be great to have some more formal docs or guide to get this working besides the blog post. I don't know how to use our current JobConfig + anyscale.job.submit workflow with this new method

@pcmoritz
Copy link
Contributor

pcmoritz commented Mar 5, 2025

@cabreraalex Thanks for your feedback, I'm currently working on the anyscale.job.submit workflow and will update here after that's deployed. And yes you are right, we also need to work on more formal docs 👍

@krzysztof-gre
Copy link

Hi. The ray docker image does not include the uv binary, what blocks using the new feature in containerized setup.

docker run --rm -it rayproject/ray:2.43.0 sh
$ uv
sh: 1: uv: not found

@pcmoritz
Copy link
Contributor

pcmoritz commented Mar 7, 2025

@cabreraalex In the latest release 0.26.4 of the anyscale CLI (https://pypi.org/project/anyscale/), the py_executable support is now implemented for JobConfig and the job submit workflow. You need a cluster image that has UV installed and also unsets RAY_RUNTIME_ENV_HOOK (that's a wrinkle we'd like to remove going forward), for example like this

FROM anyscale/ray:2.43.0-slim-py312-cu125

RUN curl -LsSf https://astral.sh/uv/install.sh | sh
RUN echo "unset RAY_RUNTIME_ENV_HOOK" >> /home/ray/.bashrc

and then you can e.g. use it like the following -- create a working_dir with the following files:

main.py

import ray

@ray.remote
def f():
    import emoji
    return emoji.emojize("Python rocks :thumbs_up:")

print(ray.get(f.remote()))

pyproject.toml

[project]
name = "test"
version = "0.1"
dependencies = ["emoji", "ray"]

job.yaml

name: test-uv
image_uri: <your image here>
working_dir: .
py_executable: "uv run"
entrypoint: uv run main.py
# If there is an error, do not retry.
max_retries: 0

And submit your job with anyscale job submit -f job.yaml. Instead of using a yaml you can also submit it via the SDK like

import anyscale
from anyscale.job.models import JobConfig

config = JobConfig(
    name="my-job",
    entrypoint="uv run main.py",
    working_dir=".",
    max_retries=0,
    image_uri="<your image here>",
    py_executable="uv run",
)

anyscale.job.submit(config)

@cabreraalex
Copy link

Fantastic, will test it out, thanks!

@hongbo-miao
Copy link

Just found out this ticket. I opened two tickets which related with uv

@schmidt-ai
Copy link

I'm also getting a uv: not found in the raylet logs of a ray cluster I started on EC2. Do I need to follow the same steps (unset RAY_RUNTIME_ENV_HOOK + install uv)? Do I need to do this on a) the head node, b) the worker nodes, or both? I'm using the rayproject/ray:2.43.0-py312-cpu image.

@sveint
Copy link

sveint commented Mar 14, 2025

I have some issues using uv on a remote cluster:

#51368

@JettScythe
Copy link

JettScythe commented Mar 14, 2025

currently in the process of trying to use this.
I have a ingress that handles multiple downstream models, with quite a few deps.
When bringing up the cluster I use:

setup_commands:
  - sudo apt-get update -y && sudo apt install -y espeak-ng espeak-ng-data libespeak-ng1 libpcaudio0 libportaudio2 libpq-dev
  - curl -LsSf https://astral.sh/uv/install.sh | sh  # Install uv
  - echo 'export RAY_RUNTIME_ENV_HOOK=ray._private.runtime_env.uv_runtime_env_hook.hook' >> ~/.bashrc
  - pip install ray[all]==2.43.0

# Command to start ray on the head node.
head_start_ray_commands:
  - ray stop
  - >-
    RAY_health_check_initial_delay_ms=999999999999999999
    ray start
    --head
    --port=6379
    --object-manager-port=8076
    --autoscaling-config=~/ray_bootstrap_config.yaml

# Command to start ray on worker nodes.
worker_start_ray_commands:
  - ray stop
  - >-
    RAY_health_check_initial_delay_ms=999999999999999999
    ray start
    --address=$RAY_HEAD_IP:6379
    --object-manager-port=8076

and I start serve with uv run --verbose serve run deployments.ingress:router

Unfortunately it seems like it spends too much time redownloading / building deps, and ray eventually just decides to restart the raylet (causing an endless loop) or force-kills the worker causing it to crash.
The RAY_health_check_initial_delay_ms=999999999999999999 was an attempt around that, but no luck so far.

@terrykong
Copy link

terrykong commented Mar 17, 2025

Thanks for writing such an awesome feature!

I'm giving it a try in an application with cuda/torch dependencies and for 2 workers, the application starts in around 1-2minutes, but when I scale to 8 workers, it takes much longer and I see a lot of these messages:

(raylet, ip=W.X.Y.Z) [2025-03-16 19:32:22,168 E 4106941 4106941] (raylet) worker_pool.cc:581: Some workers of the worker process(4114829) have not registered within the timeout. The process is still alive, probably it's hanging during start.

Is there any way to debug what's going on or why it's taking so long for the other processes to start? Is it possible that the cache isn't working as expected? (edit: turns out I just needed to propagate UV_CACHE_DIR to all the workers)

Also, as a general feedback, could the ray logger log the dependencies all the workers ended up using assuming back on the driver assuming they can each run with a different py_executable?

@FredrikNoren
Copy link

FredrikNoren commented Mar 17, 2025

Would it be possible to show the output of uv when submitting a job? Currently I'm just seeing:

uv run cli.py cluster show_deps
2025-03-17 11:28:33,592 INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_48f447cb6ecbfe4b.zip.
2025-03-17 11:28:33,594 INFO packaging.py:575 -- Creating a file package for local module '.'.
Job submitted with ID show_deps--2025-03-17_11-28-33
Job submission server address: http://localhost:8265

and it's been stuck like that for a while; not sure if it's doing anything or if it's working on installing dependencies.

EDIT: it was taking a long time because I tried to be fancy and set the UV_CACHE_DIR to an EFS dir, which was really slow. I removed that and it was much faster

@terrykong
Copy link

terrykong commented Mar 18, 2025

I've also noticed that when working_dir is large due to random untracked files in my python project repo (close to 100MiB limit) the uv pip install of each worker can be very slow (presumably it's copying for the install?). I can always add these files to .gitignore to bring the size down, but this seems cumbersome. Could there be a feature to ensure that only the tracked files get uploaded? Or perhaps configure the max working_dir size to something much smaller to allow library maintainers to clamp down on these large uploads to give helpful error messages to the user?

@FredrikNoren
Copy link

FredrikNoren commented Mar 19, 2025

@pcmoritz Hm it seems like it's re-installing all dependencies each time a worker is invoked. Is there some way to avoid that? I'm also getting errors that its failing, like:

(raylet, ip=172.31.81.185) errorerror: : failed to remove directory `/tmp/ray/session_2025-03-18_01-47-01_879621_311/runtime_resources/working_dir_files/_ray_pkg_686cd24b492b1031/.venv/lib/python3.12/site-packages/transformers/utils`: No such file or directory (os error 2)failed to remove directory `/tmp/ray/session_2025-03-18_01-47-01_879621_311/runtime_resources/working_dir_files/_ray_pkg_686cd24b492b1031/.venv/lib/python3.12/site-packages/transformers/utils`: No such file or directory (os error 2)
(raylet, ip=172.31.81.185) Uninstalled 1 package in 137ms [repeated 3x across cluster]
(raylet, ip=172.31.81.185) Installed 30 packages in 3.41s [repeated 4x across cluster]

EDIT: I added "env_vars": { "UV_PROJECT_ENVIRONMENT": "/home/ray/uv_shared" } to my runtime_env, which should make all of the actors use the same venv. It seems like it's working better but not sure it's the right solution.

@b-phi
Copy link

b-phi commented Mar 19, 2025

When using the following

runtime_env = {
    'uv': 'requirements.txt',
}
run.init('address', runtime_env=runtime_env)

I get this error from an overly long file name

error: failed to read from file [/tmp/ray/session_2025-03-19_12-12-38_586279_1/runtime_resources/uv/b4087477499d4d98f60ddd904f5146a19992f52e/exec_cwd/eyJ2ZXIiOjEsImlzdSI6MTc0MjQxMTg0NywiZW5jIjoiQTEyOEdDTSIsInRhZyI6ImJlWnZVWHI2RGFISXRhV3J2b2JlQmciLCJleHAiOjE3NDI0NTUwNDcsImFsZyI6IkExMjhHQ01LVyIsIml2IjoiSlVNTmxaemw1bkN0UklsZyJ9.vZ8egZQlA_pVoXh_Sjh4eQ.ZQdf-SopXIJ4q6Fe.Mu6xPuPr-PG21tdRV8aLNcspyWv0gtiH-dGhmUiKqaGtY5WzXr9Qs4bB5wyDK9b_bkT7LlZopfs8eeli15VXF_vgJ_WVbIfwdpKW-lJ4YUB2yrYCG9TTVXi5aAZgCSQe5H7tt1AZX1la-DWhNc3XEpSO-QSwwkZNnl70oVCex8W3nkWgXBkQkcQq4lhxvDJFBFupjZ9gLOr-Q4aY915RSTQGZzAtvkMIk77S7v1s3Omepb6N2wKZ2w95JsBG3wniHNrLp9zadWLxclWAQXlTAkLFMtmIEtLbqdKODL4X7Df5FJRIJo2Q8E5grrtgGSbF-awXkEzjD_7YB6-AYZ4s6zWcYQr5ckJXWufmsP1zURF8LLPZksN0THUiGZq2SXO_qXN1ebg_7o_IkOEmh2msxmvOE90XPJWcqvlWEBiwmeElEgwsZ_Qj34p8Onqo-Y_vWN4ZXmyzzmFX-lcYHxaYL1YRat4xexKPXUduG106cnEvxuH-FEwD9vHTnW4-F0_lX-45KOenCbCL9x8NCdrSTpssmfi7SmUs-wO6MHPqyE4CvVwXUtufcepP01CiHv6vfetC0EsOmMUsV77KOWbYL5qu56mrAoPMecN27VlQtLT31FZKyKLHiGM5ng1wtk6vNKKGtQ6azDEy1eQBaPKwMjf2xkwxyRCZhZ8CMNnyj94V-Q94pwdAjX5RYyVpV4nRVttrT492E3c_nyfabMOXDWuIWZlh5TS1fQWDAdQGFa6JHhgYv8WDdMpstorxOfj9VMtuWT-tOLlMa2Ai1Mmvv3UaPL3bG9W9m7nxU8a6HBR9Olv2IVQsiMt1RJ-JvTIHWFMkdpWMU2L0XMN4Pln-Bs6qxlpcbjge0jl6-VgwNFYIil4mpEGizNN6Y_a_u6vz-FWbQbhBKa7rWY_6ZE7IS2zpt1nKjMmnKC_Dt1x1w9Bu1Y_KF6g32KhS-a1a-FfnZncL8UFOlUQ3ONF8IPy_pcme_gP5MsjHWL2xUIrT0Hoc9Fw0NOEePVg052uvnpYCxQ8mmYig12mNRED7B6-CMCtFcBOIGgrwLc64LsUm3AeDh4VCJmNPbaags_xyEfGFObOn93a66FX7A4LFi59E0O6uzvydcWuiWdxU7V_EyvCDHaYZqvNqi88T03iPkVmHo-G4Up-4Zg8SkVvoUnCqWlyqCphTVP3QMeQfkPe2PYfPJzRHziNmlA7Fo-ztlCilhJ0d-LxK1i8xNFjf4jr7iSyqUyteFByGcWvBWu4pGE9pLWdiibeN97-STtL709ew9xH8k3j0AtbdWOfs5AYAn-Kz0kW4_0zXHt9GksyBRPVMaMF_I02BKA4.VsHV98LbeBm6HkzYQSC3uQ/apprise/index.html](http://127.0.0.1:3999/tmp/ray/session_2025-03-19_12-12-38_586279_1/runtime_resources/uv/b4087477499d4d98f60ddd904f5146a19992f52e/exec_cwd/eyJ2ZXIiOjEsImlzdSI6MTc0MjQxMTg0NywiZW5jIjoiQTEyOEdDTSIsInRhZyI6ImJlWnZVWHI2RGFISXRhV3J2b2JlQmciLCJleHAiOjE3NDI0NTUwNDcsImFsZyI6IkExMjhHQ01LVyIsIml2IjoiSlVNTmxaemw1bkN0UklsZyJ9.vZ8egZQlA_pVoXh_Sjh4eQ.ZQdf-SopXIJ4q6Fe.Mu6xPuPr-PG21tdRV8aLNcspyWv0gtiH-dGhmUiKqaGtY5WzXr9Qs4bB5wyDK9b_bkT7LlZopfs8eeli15VXF_vgJ_WVbIfwdpKW-lJ4YUB2yrYCG9TTVXi5aAZgCSQe5H7tt1AZX1la-DWhNc3XEpSO-QSwwkZNnl70oVCex8W3nkWgXBkQkcQq4lhxvDJFBFupjZ9gLOr-Q4aY915RSTQGZzAtvkMIk77S7v1s3Omepb6N2wKZ2w95JsBG3wniHNrLp9zadWLxclWAQXlTAkLFMtmIEtLbqdKODL4X7Df5FJRIJo2Q8E5grrtgGSbF-awXkEzjD_7YB6-AYZ4s6zWcYQr5ckJXWufmsP1zURF8LLPZksN0THUiGZq2SXO_qXN1ebg_7o_IkOEmh2msxmvOE90XPJWcqvlWEBiwmeElEgwsZ_Qj34p8Onqo-Y_vWN4ZXmyzzmFX-lcYHxaYL1YRat4xexKPXUduG106cnEvxuH-FEwD9vHTnW4-F0_lX-45KOenCbCL9x8NCdrSTpssmfi7SmUs-wO6MHPqyE4CvVwXUtufcepP01CiHv6vfetC0EsOmMUsV77KOWbYL5qu56mrAoPMecN27VlQtLT31FZKyKLHiGM5ng1wtk6vNKKGtQ6azDEy1eQBaPKwMjf2xkwxyRCZhZ8CMNnyj94V-Q94pwdAjX5RYyVpV4nRVttrT492E3c_nyfabMOXDWuIWZlh5TS1fQWDAdQGFa6JHhgYv8WDdMpstorxOfj9VMtuWT-tOLlMa2Ai1Mmvv3UaPL3bG9W9m7nxU8a6HBR9Olv2IVQsiMt1RJ-JvTIHWFMkdpWMU2L0XMN4Pln-Bs6qxlpcbjge0jl6-VgwNFYIil4mpEGizNN6Y_a_u6vz-FWbQbhBKa7rWY_6ZE7IS2zpt1nKjMmnKC_Dt1x1w9Bu1Y_KF6g32KhS-a1a-FfnZncL8UFOlUQ3ONF8IPy_pcme_gP5MsjHWL2xUIrT0Hoc9Fw0NOEePVg052uvnpYCxQ8mmYig12mNRED7B6-CMCtFcBOIGgrwLc64LsUm3AeDh4VCJmNPbaags_xyEfGFObOn93a66FX7A4LFi59E0O6uzvydcWuiWdxU7V_EyvCDHaYZqvNqi88T03iPkVmHo-G4Up-4Zg8SkVvoUnCqWlyqCphTVP3QMeQfkPe2PYfPJzRHziNmlA7Fo-ztlCilhJ0d-LxK1i8xNFjf4jr7iSyqUyteFByGcWvBWu4pGE9pLWdiibeN97-STtL709ew9xH8k3j0AtbdWOfs5AYAn-Kz0kW4_0zXHt9GksyBRPVMaMF_I02BKA4.VsHV98LbeBm6HkzYQSC3uQ/apprise/index.html): File name too long (os error 36)

@cszhu
Copy link
Author

cszhu commented Mar 20, 2025

Hi everyone! Thanks for your feedback. We're going to start going through the issues listed.
In the meantime, I've created a uv Github issues tag to help triage uv specific issues. If you all decide to open a Github issue, I can help tag them correctly with uv. Thank you!

@d-miketa
Copy link
Contributor

d-miketa commented Mar 21, 2025

Opened an issue asking for uv in anyscale/ray Docker images.
#51592

@pcmoritz
Copy link
Contributor

We just released https://github.com/ray-project/ray/releases/tag/ray-2.44.0, which makes it possible to use the UV hook with Job submission (#51150), so you don't need to fall back to py_executable for job submission any more :)

@pcmoritz
Copy link
Contributor

In the latest master, uv run can now be used with the Ray Client too: #51683 :)

@HAKSOAT
Copy link

HAKSOAT commented Mar 28, 2025

Hello... I opened an issue reporting an error on running ray job submit #51777

@cszhu
Copy link
Author

cszhu commented Mar 31, 2025

ray-project/kuberay#3247

A community member opened this ticket for uv + Kuberay + Ray and am cross posting here for visibility since it is in a different repo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests