Skip to content

Conversation

@antfin-oss
Copy link

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2025-11-06
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

aslonnie and others added 30 commits October 19, 2025 09:41
`result_of_t` is deprecated

Signed-off-by: Lonnie Liu <[email protected]>
- disables java tests; ray java not supported on apple silicon yet.    
- skipping cpp tests that are not passing yet

we already stopped releasing macos wheels for Intel silicon, the tests
that are disabled or skipped were never passing on apple silicon, so
nothing is regressed.

Signed-off-by: Lonnie Liu <[email protected]>
…ay-project#57876)

## Description

## Related issues
Closes ray-project#57847

## Additional information

Signed-off-by: daiping8 <[email protected]>
…ystem cgroup (ray-project#57864)

For more details about the resource isolation project see
ray-project#54703.

When starting the head node, move the dashboard api server's
subprocesses into the system cgroup. I updated the integration test and
added a helpful error message because the test will break in the future
when a new dashboard module is added.

I ran the integration tests 25 times locally. 

> (ray2) ubuntu@devbox:~/code/ray2$ python -m pytest -s
python/ray/tests/resource_isolation/test_resource_isolation_integration.py
--count 25 -x
...
collecting ... 

python/ray/tests/resource_isolation/test_resource_isolation_integration.py
βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“ 25% β–ˆβ–ˆβ–Œ 2025-10-17 23:13:51,897 INFO
worker.py:1833 -- Connecting to existing Ray cluster at address:
172.31.12.251:6379...
2025-10-17 23:13:51,905 INFO worker.py:2004 -- Connected to Ray cluster.
View the dashboard at http://127.0.0.1:8265

python/ray/tests/resource_isolation/test_resource_isolation_integration.py
βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“ 26% β–ˆβ–ˆβ–‹ 2025-10-17 23:13:57,592 INFO
worker.py:1833 -- Connecting to existing Ray cluster at address:
172.31.12.251:6379...
2025-10-17 23:13:57,598 INFO worker.py:2004 -- Connected to Ray cluster.
View the dashboard at http://127.0.0.1:8265

python/ray/tests/resource_isolation/test_resource_isolation_integration.py
βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“ 98% β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š2025-10-17 23:19:45,417 INFO
worker.py:2004 -- Started a local Ray instance. View the dashboard at
http://127.0.0.1:8265

python/ray/tests/resource_isolation/test_resource_isolation_integration.py
βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“
99% β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–‰2025-10-17 23:19:50,194 INFO worker.py:2004 -- Started a
local Ray instance. View the dashboard at http://127.0.0.1:8265

python/ray/tests/resource_isolation/test_resource_isolation_integration.py
βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“βœ“
100% β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ
Results (366.41s):
     100 passed

---------

Signed-off-by: irabbani <[email protected]>
…roject#57037)

During the execution of tail_job_logs() after the job submission, if the
ray head connection breaks, the tail_job_logs() will not raise any
error. The error should be raised.

Query the rayjob status when receiving the message, and raise error if
connection closed with rayjob not in terminate stage.

## Related issue number

Closes: ray-project#57002

---------

Signed-off-by: machichima <[email protected]>
…ay-project#57802)

## Description

1. This PR added the `jax.distributed.shutdown()` for JaxBackend in
order to free up any leaked resources on TPU RayTrainWorkers.
2. if `jax.distributed` is not on, it is a noop:
https://docs.jax.dev/en/latest/_autosummary/jax.distributed.shutdown.html
3. Tested on Anyscale workspace.
<img width="1264" height="62" alt="image"
src="https://github.com/user-attachments/assets/f28102ff-f6d1-4da0-b41a-6cc785603e72"
/>
we are not releasing `x86_64` wheels anymore

Signed-off-by: Lonnie Liu <[email protected]>
…igurable (ray-project#57705)

Recently, when we ran performance tests with task event generation
turned on. We saw some performance regression when the workloads ran on
very small CPU machines. With further investigation, the overhead mainly
comes from the name format convention when converting the proto message
to JSON format payload in the aggregator agent.

This PR adds an env var for the config to control the name conversion
behavior and update the corresponding tests.

Also note that, eventually we are planning to remove this config turn
off the field name conversion by default after migrated all the current
event usage.

---------

Signed-off-by: Mengjin Yan <[email protected]>
It used to be in 3 different groups, now unionized in 1.

Signed-off-by: kevin <[email protected]>
…nter (ray-project#56848)

* Updated preprocessors to use a callback-based approach for stat
computation. This improves code organization and reduces duplication.
* Added ValueCounter aggregator and value_counts method to
BlockColumnAccessor. Includes implementations for both Arrow and Pandas
backends.

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: cem <[email protected]>
Signed-off-by: cem-anyscale <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
… only once." (ray-project#57917)

This PR fixes the Ray check failure
RayEventRecorder::StartExportingEvents() should be called only once..
The failure can occur in the following scenario:
- The metric_agent_client successfully establishes a connection with the
dashboard agent. In this case, RayEventRecorder::StartExportingEvents is
correctly invoked to start sending events.
- At the same time, the metric_agent_client exceeds its maximum number
of connection retries. In this case,
RayEventRecorder::StartExportingEvents is invoked again incorrectly,
causing duplicate attempts to start exporting events.

This PR introduces two fixes:
- In metric_agent_client, the connection success and retry logic are now
synchronized (previously they ran asynchronously, allowing both paths to
trigger).
- Do not call StartExportingEvents if the connection cannot be
established.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
## Description

Ray data can't serialize zero (byte) length numpy arrays:

```python3
import numpy as np
import ray.data

array = np.empty((2, 0), dtype=np.int8)

ds = ray.data.from_items([{"array": array}])

for batch in ds.iter_batches(batch_size=1):
     print(batch)
```

What I expect to see:

```
{'array': array([], shape=(1, 2, 0), dtype=int8)}
```

What I see:

```
/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py:736: RuntimeWarning: invalid value encountered in scalar divide
  offsets = np.arange(
2025-10-17 17:18:09,499 WARNING arrow.py:189 -- Failed to convert column 'array' into pyarrow array due to: Error converting data to Arrow: column: 'array', shape: (1, 2, 0), dtype: int8, data: []; falling back to serialize as pickled python objects
Traceback (most recent call last):
  File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 672, in from_numpy
    return cls._from_numpy(arr)
           ^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 736, in _from_numpy
    offsets = np.arange(
              ^^^^^^^^^^
ValueError: arange: cannot compute length

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 141, in convert_to_pyarrow_array
    return ArrowTensorArray.from_numpy(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 678, in from_numpy
    raise ArrowConversionError(data_str) from e
ray.air.util.tensor_extensions.arrow.ArrowConversionError: Error converting data to Arrow: column: 'array', shape: (1, 2, 0), dtype: int8, data: []
2025-10-17 17:18:09,789 INFO logging.py:293 -- Registered dataset logger for dataset dataset_0_0
2025-10-17 17:18:09,815 WARNING resource_manager.py:134 -- ⚠️  Ray's object store is configured to use only 33.5% of available memory (2.0GiB out of 6.0GiB total). For optimal Ray Data performance, we recommend setting the object store to at least 50% of available memory. You can do this by setting the 'object_store_memory' parameter when calling ray.init() or by setting the RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION environment variable.
{'array': array([array([], shape=(2, 0), dtype=int8)], dtype=object)}
```

This PR fixes the issue so that zero-length arrays are serialized
correctly, and the shape and dtype is preserved.

## Additional information

This is `ray==2.50.0`.

---------

Signed-off-by: Chris O'Hara <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
use awscli directly; stop installing extra dependencies

Signed-off-by: Lonnie Liu <[email protected]>
## Description
Found this while reading the docs. Not sure what this "Note that" is
referring to or why it is there.

Signed-off-by: Max van Dijck <[email protected]>
it should not run on macos intel silicon anymore

Signed-off-by: Lonnie Liu <[email protected]>
…ect#57877)

so that we are not tied to using public s3 buckets

Signed-off-by: Lonnie Liu <[email protected]>
…ject#57925)

This PR moves the error handling of metric+event exporter agent one
level up, inside the `metrics_agent_client` callback. Previously, the
errors handled were handled by either the metric or event recorder,
which leads to confusion and buggy code.

Test:
- CI

---------

Signed-off-by: Cuong Nguyen <[email protected]>
## Description
Bumping from small to medium because it's timing out for Python 3.12.

Signed-off-by: Matthew Deng <[email protected]>
…project#57932)

## Description
This PR add prometheus metrics to the selected RLlib components.

---------

Signed-off-by: joshlee <[email protected]>
Signed-off-by: Kamil Kaczmarek <[email protected]>
Signed-off-by: kevin <[email protected]>
Signed-off-by: cem <[email protected]>
Signed-off-by: cem-anyscale <[email protected]>
Signed-off-by: Cuong Nguyen <[email protected]>
Signed-off-by: Chris O'Hara <[email protected]>
Signed-off-by: Lonnie Liu <[email protected]>
Signed-off-by: Max van Dijck <[email protected]>
Signed-off-by: Seiji Eicher <[email protected]>
Co-authored-by: Joshua Lee <[email protected]>
Co-authored-by: Kevin H. Luu <[email protected]>
Co-authored-by: cem-anyscale <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cuong Nguyen <[email protected]>
Co-authored-by: Chris O'Hara <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
Co-authored-by: Max van Dijck <[email protected]>
Co-authored-by: Seiji Eicher <[email protected]>
This PR makes the `ray.get` public API thread-safe. It also cleans up a
lot of tech-debt wrt to
* Workers yielding CPU to the raylet when blocked.
* Cleaning up finished/inflight Get requests.

Previously, the raylet coalesced all get requests from the same worker
into one Get (and Pull) request. However, Get request cleanup could
happen on multiple threads meaning **one thread could cancel inflight
get requests for all threads in a worker**. This issue was reported in
ray-project#54007.

### Changes in this PR:

Raylet (server-side)
1. AsyncGetObjects will return a request_id.
2. LeaseDependencyManager no longer coalesces AsyncGetObjects requests
from the same worker.
3. LeaseDependencyManager has two methods for cleanup (delete all
requests for worker during worker disconnect/lease cleanup) and delete a
specific request (called through CancelGetRequest)
4. Wait no longer cancels all Get requests for the worker (this was
probably a bug)
5. NotifyWorkerUnblock does not cancel get requests anymore. 

CoreWorker (client-side)
1. PlasmaStoreProvider::Get will make 1 call to AsyncGetObjects per
batch.
2. PlasmaStoreProvider::Get will store scoped cleanup handlers that will
call CancelGetRequest for each call to AsyncGetObjects to guarantee
RAII-style cleanup

Closes ray-project#54007.

---------

Signed-off-by: irabbani <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
…acing work (ray-project#57908)

Tracing code hasn’t been maintained, and it can’t be run by relying on
the docs alone.

1.
[https://docs.ray.io/en/latest/ray-observability/user-guides/ray-tracing.html#installation](https://docs.ray.io/en/latest/ray-observability/user-guides/ray-tracing.html#installation)

`opentelemetry-api==1.1.0`
Version 1.1.0 is too oldβ€”

https://github.com/ray-project/ray/blob/b988ce4e9b0fb618b40865600c0d98f1714c3bcf/ci/docker/serve.build.Dockerfile#L47
we’re already using 1.3.0+, which is incompatible with 1.1.0.

2.
A legacy issue? This prevents the help information from being displayed.

---------

Signed-off-by: justwph <[email protected]>
Signed-off-by: JustWPH <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
crypdick and others added 19 commits November 2, 2025 20:24
… and GRPO. (ray-project#57961)

## Description
Example for first blog in the RDT series using NIXL for GPU-GPU tensor
transfers.

---------

Signed-off-by: Ricardo Decal <[email protected]>
Signed-off-by: Stephanie Wang <[email protected]>
Co-authored-by: Ricardo Decal <[email protected]>
Co-authored-by: Stephanie Wang <[email protected]>
Co-authored-by: Qiaolin Yu <[email protected]>
python 3.9 is now out of the support window

all using python 3.12 wheel names for unit testing

Signed-off-by: Lonnie Liu <[email protected]>
we will stop releasing them

Signed-off-by: Lonnie Liu <[email protected]>
and move them into bazel dir.
getting ready for python version upgrade

Signed-off-by: Lonnie Liu <[email protected]>
python 3.9 is out of support window

Signed-off-by: Lonnie Liu <[email protected]>
…ect#58375)

Starting with KubeRay 1.5.0, KubeRay supports gang scheduling for RayJob
custom resources.
Just add a mention for Yunikorn scheduler.

Related to ray-project/kuberay#3948.

Signed-off-by: win5923 <[email protected]>
This PR adds support for token-based authentication in the Ray
bi-directional syncer, for both client and server sides. It also
includes tests to verify the functionality.

---------

Signed-off-by: sampan <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Co-authored-by: sampan <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Support token based authentication in runtime env (client and server).
refactor existing dashboard head code so that the utils and midleware
can be reused by runtime env agent as well

---------

Signed-off-by: sampan <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Co-authored-by: sampan <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
…un. (ray-project#58335)

## Description
> Add spark master model validation to let Ray run on Spark-On-YARN
mode.

## Why need this?
> If we directly run Ray on a YARN cluster, we need to do more tests and
integration, and also need to setup related tools and environments. If
support ray-on-spark-on-yarn and we already have Spark envs setup, we
don't need to do other things, can use Spark and let the user run
pyspark.

Signed-off-by: Cai Zhanqi <[email protected]>
Co-authored-by: Cai Zhanqi <[email protected]>
upgrading reef tests to run on 3.10

Signed-off-by: elliot-barn <[email protected]>
The issue with the current implementation of core worker HandleKillActor
is that it won't send a reply when the RPC completes because the worker
is dead. The application code from the GCS doesn't really care since it
just logs the response if one is received, a response is only sent if
the actor ID of the actor on the worker and in the RPC don't match, and
the GCS will just log it and move on with its life.

Hence we can't differentiate in the case of a transient network failure
whether there was a network issue, or the actor was successfully killed.
What I think is the most straightforward approach is instead of the GCS
directly calling core worker KillActor, we have the GCS talk to the
raylet instead and call a new RPC KillLocalActor that in turn calls
KillActor. Since the raylet that receives KillLocalActor is local to the
worker that the actor is on, we're guaranteed to kill it at that point
(either through using KillActor, or if it hangs falling back to
SIGKILL).

Thus the main intuition is that the GCS now talks to the raylet, and
this layer implements retries. Once the raylet receives the
KillLocalActor request, it routes this to KillActor. This layer between
the raylet and core worker does not have retries enabled because we can
assume that RPCs between the local raylet and worker won't fail (same
machine). We then check on the status of the worker after a while (5
seconds via kill_worker_timeout_milliseconds) and if it still hasn't
been killed then we call DestroyWorker that in turn sends the SIGKILL.

---------

Signed-off-by: joshlee <[email protected]>
upgrading data ci tests to py3.10

postmerge build:
https://buildkite.com/ray-project/postmerge/builds/14192

---------

Signed-off-by: elliot-barn <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
upgrading serve tests to run on python 3.10

Post merge run: https://buildkite.com/ray-project/postmerge/builds/14190

---------

Signed-off-by: elliot-barn <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
…roject#58307)

There was a video object detection Ray Data workload hang reported. 

An initial investigation by @jjyao and @dayshah observed that it was due
to an actor restart and the actor creation task was being spilled to a
raylet that had an outdated resource view. This was found by looking at
the raylet state dump. This actor creation task required 1 GPU and 1
CPU, and the raylet where this actor creation task was being spilled to
had a cluster view that reported no available GPUs. However there were
many available GPUs, and all the other raylet state dumps correctly
reported this. Furthermore in the raylet logs for the oudated raylet
there was a "Failed to send a message to node: " originating from the
ray syncer. Hence an initial hypothesis was formed that the ray syncer
retry policy was not working as intended.

A follow up investigation by @edoakes and I revealed an incorrect usage
of the grpc streaming callback API.
Currently how retries works in the ray syncer on fail to send/write is:
- OnWriteDone/OnReadDone(ok = false) is called after a failed read/write
- Disconnect() (the one in *_bidi_reactor.h!) is called which flips
_disconnected to true and calls DoDisconnect()
- DoDisconnect() notifies grpc we will no longer write to the channel
via StartWritesDone() and removes the hold via RemoveHold()
- GRPC will see that the channel is idle and has no hold so will call
OnDone()
- we've overriden OnDone() to hold a cleanup_cb that contains the retry
policy that reinitializes the bidi reactor and connects to the same
server at a repeated interval of 2 seconds until it succeeds
- fault tolerance accomplished! :) 

However from logs that we added we weren't seeing OnDone() being called
after DoDisconnect() happens. From reading the grpc streaming callback
best practices here:

https://grpc.io/docs/languages/cpp/best_practices/#callback-streaming-api
it states that "The best practice is always to read until ok=false on
the client side"
From the OnDone grpc documentation:
https://grpc.github.io/grpc/cpp/classgrpc_1_1_client_bidi_reactor.html#a51529f76deeda6416ce346291577ffa9:
it states that "Notifies the application that all operations associated
with this RPC have completed and all Holds have been removed"

Since we call StartWritesDone() and removed the hold, this should notify
grpc that all operations associated with this bidi reactor are
completed. HOWEVER reads may not be finished, i.e. we have not read all
incoming data.
Consider the following scenario:
1.) We receive a bunch of resource view messages from the GCS and have
not processed all of them
2.) OnWriteDone(ok = false) is called => Disconnected() => disconnected_
= false
3.) OnReadDone(ok = true) is called however because disconnected_ = true
we early return and STOP processing any more reads as shown below:

https://github.com/ray-project/ray/blob/275a585203bef4e48c04b46b2b7778bd8265cf46/src/ray/ray_syncer/ray_syncer_bidi_reactor_base.h#L178-L180
4.) Pending reads left in queue, and prevent grpc from calling OnDone
since not all operations are done
5.) Hang, we're left in a zombie state and drop all incoming resource
view messages and don't send any resource view updates due to the
disconnected check

Hence the solution is to remove the disconnected check in OnReadDone and
simply allow all incoming data to be read.

There's a couple of interesting observations/questions remaining:
1.) The raylet with the outdated view is the local raylet to the gcs and
we're seeing read/write errors despite being on the same node
2.) From the logs I see that the gcs syncer thinks that the channel to
the raylet syncer is still available. There's no error logs on the gcs
side, its still sending messages to the raylet. Hence even though the
raylet gets the "Failed to write error: " we don't see a corresponding
error log on the GCS side.

---------

Signed-off-by: joshlee <[email protected]>
…project#58161)

## Description
kai-scheduler supports gang scheduling at
[v0.9.3](NVIDIA/KAI-Scheduler#500 (comment)).

But gang scheduling doesn't work at v0.9.4. However, it works again at
v0.10.0-rc1.

## Related issues

## Additional information
The reason might be as follow.

The `numOfHosts` is taken into consideration at v0.9.3.

https://github.com/NVIDIA/KAI-Scheduler/blob/0a680562b3cdbae7d81688a81ab4d829332abd0a/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go#L156-L162

The snippet of code is missing at v0.9.4.

https://github.com/NVIDIA/KAI-Scheduler/blob/281f4269b37ad864cf7213f44c1d64217a31048f/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go#L131-L140

Then, it shows up at v0.10.0-rc1.

https://github.com/NVIDIA/KAI-Scheduler/blob/96b4d22c31d5ec2b7375b0de0e78e59a57baded6/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go#L156-L162

---------

Signed-off-by: fscnick <[email protected]>
It is sometimes intuitive for users to provide their extensions with '.'
at the start. This PR takes care of that and removed the '.' when it is
provided.

For example, when using `ray.data.read_parquet`, the parameter
`file_extensions` needs to be something like `['parquet']`. However,
intuitively some users may interpret this parameter as being able to use
`['.parquet']`.

This commit allows users to switch from:

```python
train_data = ray.data.read_parquet(
    'example_parquet_folder/',
    file_extensions=['parquet'],
)
```

to

```python
train_data = ray.data.read_parquet(
    'example_parquet_folder/',
    file_extensions=['.parquet'],  # Now will read files, instead of silently not reading anything
)
```
…roject#58372)

When starting a Ray cluster in a Kuberay environment, the startup
process may sometimes be slow. In such cases, it is necessary to
increase the timeout duration for proper startup, otherwise, the error
"ray client connection timeout" will occur. Therefore, we need to make
the timeout and retry policies for the Ray worker configurable.

---------

Signed-off-by: OneSizeFitsQuorum <[email protected]>
…#58277)

## Description
Rich progress currently doesn't support reporting progress from worker.
As this is expected to take a lot of design into consideration, default
to using tqdm progress (which supports progress reporting from worker)

furthermore, we don't have an auto-detect to set `use_ray_tqdm`, so the
requirement is for that to be disabled as well.

In summary, requirements for rich progress as of now:
- rich progress bars enabled
- use_ray_tqdm disabled.

## Related issues
Fixes ray-project#58250

## Additional information
N/A

---------

Signed-off-by: kyuds <[email protected]>
Signed-off-by: Daniel Shin <[email protected]>
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request #671 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5203.

@gemini-code-assist
Copy link

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the latest changes from the master branch into main, bringing a comprehensive set of updates across the project's build, CI, and development infrastructure. The primary goal is to enhance the robustness, efficiency, and maintainability of the development workflow, from automated testing and dependency management to code quality enforcement and documentation generation.

Highlights

  • Bazel Configuration Updates: The .bazelrc file has been updated to enable --incompatible_strict_action_env by default, add a workspace status command for Linux builds, include /utf-8 CXX option for Windows, and ignore warnings for third-party files. It also adds --test_env=PATH for CI Python tests and allows custom user options via .user.bazelrc.
  • Buildkite CI Pipeline Refactoring: Extensive refactoring of Buildkite CI configurations has occurred. A new _images.rayci.yml centralizes Docker image definitions, simplifying other .rayci.yml files. Python versions are updated across various jobs, and new build steps for ray-core, ray-dashboard, and ray-java are introduced. Several test jobs have updated dependencies and build names.
  • Code Ownership and Linting Enhancements: The .github/CODEOWNERS file has been refactored for improved granularity. Pre-commit hooks in .pre-commit-config.yaml are updated, including a split for pydoclint and the addition of semgrep, vale, cython-lint, and eslint hooks for broader code quality checks.
  • Dependency Management Modernization: The CI system now leverages ci/raydepsets for managing Python dependencies, with new configuration files (.depsets.yaml) and pre-hooks for building placeholder wheels and handling compiled headers. This includes a switch from miniconda to miniforge3 in various build scripts and Dockerfiles.
  • C++ API and Runtime Adjustments: Changes in the C++ codebase include updates to actor and task caller argument wrapping, metric recording, and runtime initialization. Notably, GetNodeIpAddress is replaced with ray::GetNodeIpAddressFromPerspective, and GenerateUUIDV4 is updated to UniqueID::FromRandom().Hex() for namespace generation.
  • Documentation Infrastructure Improvements: The documentation build system has been updated to use python/deplocks/docs/docbuild_depset_py3.12.lock for requirements. New CSS and JavaScript files are added for dismissable banners and improved analytics tracking. The documentation style guide (.cursor/rules/ray-docs-style.mdc) is introduced to enforce consistent writing standards.
Ignored Files
  • Ignored by pattern: .gemini/** (1)
    • .gemini/config.yaml
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/stale_pull_request.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with πŸ‘ and πŸ‘Ž on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

@gemini-code-assist
Copy link

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request is an automated daily merge from the master branch into the main branch. It incorporates a wide array of updates across the repository, focusing on refining the Buildkite CI/CD pipelines, enhancing Bazel build configurations, and modernizing Python dependency management. The changes aim to improve build reliability, streamline testing processes, and update various Ray components to support newer Python and CUDA versions, alongside general code maintenance and style guide adherence.

Highlights

  • Bazel Configuration: Updated .bazelrc to enable strict action environment by default, ignore warnings for third-party files, and add a workspace status command for Linux builds. Also added a /utf-8 option for Windows C++ compilation and allowed custom user Bazel configurations.
  • Buildkite Pipeline Refactoring: Significant refactoring of Buildkite pipelines, including the creation of a new _images.rayci.yml for Docker image builds, and the removal of image build steps from _forge.rayci.yml. New dedicated build steps for ray-core, ray-dashboard, and ray-java have been introduced.
  • Python Dependency Management: Introduced ci/raydepsets for managing Python dependencies, replacing manual pip-compile steps and enabling more granular control over dependency sets for various Ray components and environments. This includes new dependencies.rayci.yml and raydepsets.yaml configurations.
  • CI Test Updates: Numerous updates to CI test configurations across core, data, ML, LLM, RLlib, and Serve components. This includes adding Python 3.10 support to many test matrices, refining except-tags for test execution, and introducing new test steps for CUDA 12.8, Dask, tracing, and authenticated data tests.
  • MacOS Build and Test Enhancements: Updated macOS Buildkite configurations to use macos-arm64 instance types and streamlined the macOS wheel build script. C++ tests on macOS are now run via a dedicated script.
  • Pre-commit and Linting Improvements: Enhanced pre-commit hooks to include semgrep, vale, cython-lint, and eslint. The pydoclint hook was split into local and CI-specific versions, and Bazel buildifier now covers more file types.
  • C++ API and Runtime Changes: Refactored C++ API and runtime code, including changes to RemoteFunctionHolder members, metric recording, and ConfigInternal bootstrap address handling. Object store PutRaw no longer throws an exception on error, and ProcessHelper uses GetNodeIpAddressFromPerspective.
Ignored Files
  • Ignored by pattern: .gemini/** (1)
    • .gemini/config.yaml
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/stale_pull_request.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with πŸ‘ and πŸ‘Ž on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is a large-scale refactoring of the CI/build system. It introduces many improvements, such as modularizing build steps, centralizing dependency management with raydepsets, and optimizing CI runs by using pre-built components. The changes are generally well-executed and improve the maintainability and efficiency of the build process. I found one minor issue in a test file.

with mock.patch("subprocess.check_call", side_effect=_mock_subprocess):
LinuxTesterContainer("team", build_type="debug")
docker_image = f"{_DOCKER_ECR_REPO}:{_RAYCI_BUILD_ID}-team"
docker_image = f"{_DOCKER_ECR_REPO}:team"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The expected docker image name seems incorrect. The _get_docker_image method now uses get_docker_image, which prefixes the tag with RAYCI_BUILD_ID if it's available. In this test setup, RAYCI_BUILD_ID is set to a1b2c3d4, so the expected image name should be f"{_DOCKER_ECR_REPO}:a1b2c3d4-team". The previous implementation was correct.

Suggested change
docker_image = f"{_DOCKER_ECR_REPO}:team"
docker_image = f"{_DOCKER_ECR_REPO}:a1b2c3d4-team"

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions bot added the stale label Nov 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.