🔄 daily merge: master → main 2025-11-18 #680

antfin-oss · 2025-11-18T02:56:07Z

This Pull Request was created automatically to merge the latest changes from master into main branch.

📅 Created: 2025-11-18
🔀 Merge direction: master → main
🤖 Triggered by: Scheduled

Please review and merge if everything looks good.

updating tune release tests to run on python 3.10 Successful release test run: https://buildkite.com/ray-project/release/builds/65655 (failing tests are already disabled) --------- Signed-off-by: elliot-barn <[email protected]>

Remove actor handle from object that get's passed around in long poll communication. Return actor handle in nested objects from the task make the caller of this task a borrower from the reference counting POV. But this pattern, although allowed, is not very well tested. Hence breaking it by passing actor_name from listen_for_change instead. --------- Signed-off-by: abrar <[email protected]>

## Description The full name was probably hallucinated from LLM. ## Related issues ## Additional information Signed-off-by: Rui Qiao <[email protected]>

…ross-node parallelism (ray-project#57261) Signed-off-by: jeffreyjeffreywang <[email protected]> Signed-off-by: Richard Liaw <[email protected]> Co-authored-by: jeffreyjeffreywang <[email protected]> Co-authored-by: Richard Liaw <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Nikhil G <[email protected]>

…imit` (ray-project#58303) ## Description ## Related issues Fix comment ray-project#58264 (comment) ## Additional information Signed-off-by: You-Cheng Lin <[email protected]>

…egy (ray-project#58306) Signed-off-by: wei-chenglai <[email protected]>

…ist with nixl (ray-project#58263) ## Description For nixl, reuse previous metadata if transferring the same tensor list. This is to avoid repeated `register_memory` before `deregister_memory` --------- Signed-off-by: Dhyey Shah <[email protected]> Co-authored-by: Dhyey Shah <[email protected]> Co-authored-by: Stephanie Wang <[email protected]>

…tuned_examples/`` in ``rllib`` (ray-project#56746)   ## Why are these changes needed?  Seventh split of ray-project#56416 ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Gagandeep Singh <[email protected]> Signed-off-by: Kamil Kaczmarek <[email protected]> Co-authored-by: Kamil Kaczmarek <[email protected]> Co-authored-by: Mark Towers <[email protected]>

…ct#57835) ## Description builds atop of ray-project#58047, this pr ensures the following when `auth_mode` is `token`: calling `ray.init() `(without passing an existing cluster address) -> check if token is present, generate and store in default path if not present calling `ray.init(address="xyz")` (connecting to an existing cluster) -> check if token is present, raise exception if one is not present --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

…indefinite waiting (ray-project#58238) ## Description Add guidance for RayService initialization timeout to prevent indefinite waiting with `ray.io/initializing-timeout` annotation on RayService. ## Related issues Closes ray-project/kuberay#4138 ## Additional information None --------- Signed-off-by: wei-chenglai <[email protected]> Signed-off-by: Wei-Cheng Lai <[email protected]> Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>

…check in accelerator_context.py (ray-project#58269) ## Description - Change the caught exception type from IndexError to TypeError - This modification ensures that the correct exception is raised when the expected accelerator ID is not included in the accelerator_visible_list `list.index` will raise a [ValueError](https://docs.python.org/3/library/exceptions.html#ValueError) if there is no such item. https://docs.python.org/3/tutorial/datastructures.html <img width="797" height="79" alt="image" src="https://github.com/user-attachments/assets/830cf4aa-d9cb-44d3-9363-8e5cd576bae9" /> Now, the error logs will be correctly captured and printed. <img width="1454" height="143" alt="image" src="https://github.com/user-attachments/assets/2b0ed0aa-60c7-49c3-84b0-7f0d4f1ebe48" /> Signed-off-by: daiping8 <[email protected]>

Significant component, I keep forgetting it's buried inside of `common/`. Also cleaned up the mock & proto build targets that were in the top-level `BUILD.bazel` file. We should also at some point clean it up to follow the common pattern of separate targets for an interface, client, & server. --------- Signed-off-by: Edward Oakes <[email protected]>

…t#58070) RayEvent provides a special API, merge, which allows multiple events to be combined into a single event. This reduces gRPC message size, network bandwidth usage, and is essential for scaling task event exports. This PR leverages that feature. Specifically, it clusters events into groups based on (i) entity ID and (ii) event type. Each group is merged into a single event, which is then added to the gRPC message body. The EntityId is a user-defined function, implemented by the event class creator, that determines which events can be safely merged. ``` Note: this is a redo of ray-project#56558 which gets converted because it randomize the order the events that get exported, lead to flaky tests etc. This attempt maintain the order even after merging. ``` Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>

… in core worker (ray-project#58315) The `metrics_agent_client_` depends on `client_call_manager_`, but previously it was pulling out a reference to it from the core worker, which is not guaranteed to outlive the agent client. Modifying it to keep the `client_call_manager_` as a field of the `core_worker_process` instead. I think we may also need to drain any ongoing RPCs from the `metrics_agent_client_` on shutdown. Leaving that for a future PR. --------- Signed-off-by: Edward Oakes <[email protected]>

## Description Historically, the intention was to avoid failures upon attempts to modify provided batch in-place when, for ex, using Torch tensors. However, that is unjustifiably penalizing 99.9% of use-cases for 0.1% of scenarios. As such, we're flipping this setting to be `zero_copy_batch=True` by default. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <[email protected]>

A few release tests currently use the `raw_metric` API to retrieve Prometheus metrics. This approach is unreliable because it directly polls the metric export endpoint, which can race with the Prometheus server that also polls the same endpoint. To address this, add a method to query metrics directly from the Prometheus server instead. This ensures that in end-to-end and production environments, Prometheus remains the sole poller of exported metrics and the single source of truth for metric values. Test: - CI - The memory metric is collected and the number makes sense: https://buildkite.com/ray-project/release/builds/66031/steps/canvas?sid=019a3215-e3be-4e9b-86a3-1f0a8253fea7#019a3215-e3f2-4d0e-877f-5f43d97d6e8e/557-654 Signed-off-by: Cuong Nguyen <[email protected]>

Extend token auth support to dashboard head (all API's) --------- Signed-off-by: sampan <[email protected]> Signed-off-by: Sampan S Nayak <[email protected]> Signed-off-by: Edward Oakes <[email protected]> Co-authored-by: sampan <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

This PR also updates the cluster resource scheduler logic to account for the list of `LabelSelector`s specified by the `fallback_strategy`, falling back to each fallback strategy `LabelSelector` in-order until one is satisfied when selecting the best node. We're able to support fallback selectors by considering them in the cluster resource scheduler in-order using the existing label selector logic in `IsFeasible` and `IsAvailable`, returning the first valid node returned by `GetBestSchedulableNode`. ray-project#51564 --------- Signed-off-by: Ryan O'Leary <[email protected]> Signed-off-by: Ryan O'Leary <[email protected]> Co-authored-by: Mengjin Yan <[email protected]>

…t Raylet logs (ray-project#58244) `test_gcs_fault_tolerance.py:: test_worker_raylet_resubscription` is still flaky in CI despite bumping up the timeout. Making a few improvements here: - Increasing the timeout to `20s` just in case it's a timeout issue (unlikely). - Changing to scheduling an actor instead of using `internal_kv` for our signal that the GCS is back up. This should better indicate that the Raylet is resubscribed. - Cleaning up some system logs. - Modifying the `ObjectLostError` logs to avoid logging likely-irrelevant plasma usage on owner death. It's likely that the underlying issue here is that we don't actually reliably resubscribe to all worker death notifications, as indicated in the TODO in the PR. --------- Signed-off-by: Edward Oakes <[email protected]>

removing core_ prefix used on release tests for testing purposes Original change: https://github.com/ray-project/ray/pull/57049/files#diff-5879986113a0287dff865f81faf24a2294660b6c4767d5a71fc6281e78101ad6R1380 Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

…ay-project#58255) ## Description Currently, the `ray.util.state.list_actors(limit=N)` API will return a details for at most N actors. However, when N exceeds the default value for `RAY_MAX_LIMIT_FROM_API_SERVER=10_000`, the error will fail with a misleading message (the error msg being that the dashboard or API server is unavailable, even if it is available). The reason why this fails is because we don't handle `ValueErrors`, and default throw a 500 error. My solution is to handle that, and open to suggestions/alternatives NOTE: I noticed it still fails with internal_server error, I think this should be a 4XX error, but it looks like that will require more code changes since it uses a very ubiquitous function: `do_reply`. Gemini suggests returning `rest_response` directly, happy to follow those orders too ### Before ```python >>> import ray >>> import ray.util.state >>> ray.init() >>> ray.util.state.list_actors(limit=100000) ray.util.state.exception.RayStateApiException: Failed to make request to http://127.0.0.1:8265/api/v0/actors. Failed to connect to API server. Please check the API server log for details. Make sure dependencies are installed with `pip install ray[default]`. Please also check dashboard is available, and included when starting ray cluster, i.e. `ray start --include-dashboard=True --head`. Response(url=http://127.0.0.1:8265/api/v0/actors?limit=100000&timeout=24&detail=False&exclude_driver=True&server_timeout_multiplier=0.8,status=500) ``` ### After ```python >>> import ray >>> import ray.util.state >>> ray.init() >>> ray.util.state.list_actors(limit=100000) ray.util.state.exception.RayStateApiException: API server internal error. See dashboard.log file for more details. Error: Given limit 100000 exceeds the supported limit 10000. Use a lower limit, or set the RAY_MAX_LIMIT_FROM_API_SERVER=limit ``` ## Related issues None ## Additional information None --------- Signed-off-by: iamjustinhsu <[email protected]>

ugprading tune fault release test to 3.10 Successful release test run: https://buildkite.com/ray-project/release/builds/65673# --------- Signed-off-by: elliot-barn <[email protected]>

upgrading jobs tests to run on python 3.10 Successful release tests: https://buildkite.com/ray-project/release/builds/65845 --------- Signed-off-by: elliot-barn <[email protected]>

removing byod compile jobs for release test images Now using raydepsets to generate lock files Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

## Description This PR introduces more information into the `explain` API. Before, `explain` showed Unoptimized Logical Plan, and Optimized Physical Plan. To make the `explain` API clearer, I introduce 4 types of plans - Logical Plan - Logical Plan (Optimized) - Physical Plan - Physical Plan (Optimized) Example Output ```python >>> import ray >>> ray.data.range(1000).select_columns("id").explain() -------- Logical Plan -------- Project[Project] +- Read[ReadRange] -------- Logical Plan (Optimized) -------- Project[Project] +- Read[ReadRange] -------- Physical Plan -------- TaskPoolMapOperator[Project] +- TaskPoolMapOperator[ReadRange] +- InputDataBuffer[Input] -------- Physical Plan (Optimized) -------- TaskPoolMapOperator[ReadRange->Project] +- InputDataBuffer[Input] ``` ## Related issues None ## Additional information None --------- Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: EkinKarabulut <[email protected]> Signed-off-by: Rueian <[email protected]> Signed-off-by: iamjustinhsu <[email protected]> Co-authored-by: EkinKarabulut <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: angelinalg <[email protected]> Co-authored-by: fscnick <[email protected]> Co-authored-by: Jiajun Yao <[email protected]> Co-authored-by: Rueian <[email protected]>

…ses (ray-project#58265) ## Description > Briefly describe what this PR accomplishes and why it's needed. Using the ip tables script created in ray-project#58241 we found a bug in RequestWorkerLease where a RAY_CHECK was being triggered here: https://github.com/ray-project/ray/blob/66c08b47a195bcfac6878a234dc804142e488fc2/src/ray/raylet/lease_dependency_manager.cc#L222-L223 The issue is that transient network errors can happen ANYTIME, including when the server logic is executing and has not yet replied back to the client. Our original testing framework using an env variable to drop the request or reply when it's being sent, hence this was missed. The issue specifically is that RequestWorkerLease could be in the process of pulling the lease dependencies to it's local plasma store, and the retry can arrive triggering this check. Created a cpp unit test that specifically triggers this RAY_CHECK without this change and is fixed. I decided to store the callbacks instead of replacing the older one with the new one due to the possibility of message reordering where the new one could arrive before the old one. --------- Signed-off-by: joshlee <[email protected]>

…yments (ray-project#58073) Signed-off-by: Nikhil Ghosh <[email protected]>

…irements (ray-project#58323) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

…gression guard (ray-project#58289) Signed-off-by: Nikhil Ghosh <[email protected]> Signed-off-by: Nikhil G <[email protected]>

## Why are these changes needed? Adding a version arg to read_delta_lake to support reading from a specific version  > ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: soffer-anyscale <[email protected]> Signed-off-by: Richard Liaw <[email protected]> Co-authored-by: Richard Liaw <[email protected]>

it uses `enum.Enum` that are not deepcopy-able Signed-off-by: Lonnie Liu <[email protected]>

…project#58670) use double brackets as much as possible Signed-off-by: Lonnie Liu <[email protected]>

…n pushdown rule (ray-project#58683) ## Description The projection pushdown rule was directly accessing `_cached_output_metadata.schema`, which breaks abstraction barriers by reaching into private implementation details. This violates encapsulation and makes the code fragile to internal changes. This PR fixes the issue by using the proper `infer_schema()` method instead, which provides a clean public interface for accessing schema information. This respects the operator's abstraction and ensures we get the correct schema through the intended API. ## Additional information The change is in `projection_pushdown.py:342` where we now call `input_op.infer_schema()` instead of directly accessing `input_op._cached_output_metadata.schema`. Signed-off-by: Balaji Veeramani <[email protected]>

…eterministic task ordering (ray-project#58655) ## Description I add a param that can determin whether we should check the order or not, release restriction for flaky tests. ## Related issues Closes ray-project#58561 ## Additional information I ran test 20 times using a simple script that rans `python -m pytest python/ray/data/tests/test_execution_optimizer_limit_pushdown.py::test_limit_pushdown_conservative`: - master: `Passed: 18, Failed: 2` - feature/flaky-test_limit_pushdown_conservative: `Passed: 20, Failed: 0` --------- Signed-off-by: ryankert01 <[email protected]> Signed-off-by: Ryan Huang <[email protected]> Co-authored-by: You-Cheng Lin <[email protected]>

…ect#58626) ## Description The `test_simple_imputer` test was flaky when using the most_frequent strategy because the output row ordering is nondeterministic after repartitioning. This PR makes the DataFrame comparison order-independent by using the built-in `rows_same` utility. ## Related issues Fixes ray-project#58563 ## Additional information Signed-off-by: justinyeh1995 <[email protected]>

…st-2 (ray-project#58437) ## Description The `image_embedding_from_jsonl` release tests are typically run in us-west-2, but they read data from a bucket in a different region. This is problematic because it's expensive and unrealistic (users often read data in the same region), and has a noticeable impact on read speeds. To address this issue, this PR updates the release test to read from a bucket in us-west-2. ## Related issues Signed-off-by: Balaji Veeramani <[email protected]>

…58680) for implementing kafka datasource. Signed-off-by: You-Cheng Lin (Owen) <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

…#58512) Signed-off-by: dayshah <[email protected]>

@arazthexd

…dule` instances (issue ray-project#56304) (ray-project#58243) ## Description At present, specifying a `model_config` for a MultiAgentRLModule does not actually pass that value to the instantiated module. This is due to an oversight in `get_multi_rl_module_spec`, which fails to add `model_config` as an argument when passing the rest of the attributes forward for instantiation. With this change, the issue is resolved. ## Related issues Fixes/Closes ray-project#56304. ## Additional information Credit to @arazthexd. --------- Signed-off-by: Matthew <[email protected]> Signed-off-by: MatthewCWeston <[email protected]>

…ray-project#58397) ## Description In improving the `SingleEnvRunner.make_env`, I found that some of the tests could be flaky. This PR improves the testing, in particular, to `sample` to ensure that the tests don't fail occasionally and the documentation to reflect this. The primary flaky problem I found is that `sample(num_timesteps=X)` will not always return a total of `X` timesteps, rather at least X timesteps up to the number of environments more. I'm updated the documentation to clarify this for users. In addition, I've added tests for when neither the number of timesteps or episodes are given and for the `force_reset` argument --------- Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]> Co-authored-by: simonsays1980 <[email protected]>

so that we are not tied to the python version coming with buildkite images Signed-off-by: Lonnie Liu <[email protected]>

not used anywhere any more Signed-off-by: Lonnie Liu <[email protected]>

they are failing right now Signed-off-by: Lonnie Liu <[email protected]>

## Description This PR adds caching for PyArrow schema operations to improve performance during batching operations, especially for tables with a large number of columns. ### Main Changes - **Caching for Tensor Type Serialization/Deserialization**: Added cache for tensor type serialization and deserialization operations. This significantly reduces overhead for frequently accessed tensor types during schema operations. ### Performance Impact This optimization is particularly beneficial during batching operations for tables with a large number of columns. In one of our tests with 200 columns, the batching time per batch decreased from **0.30s to 0.11s** (~63% improvement). #### Without cache: <img width="1719" height="464" alt="Screenshot 2025-11-13 at 9 49 33 PM" src="https://github.com/user-attachments/assets/46122634-dd09-40ed-a2a8-725d14f85728" /> We can see `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` in different places. Each time `__arrow_ext_deserialize__` will create a new object and `__arrow_ext_serialize__` includes expensive pickle. #### With cache <img width="1717" height="476" alt="Screenshot 2025-11-13 at 9 41 15 PM" src="https://github.com/user-attachments/assets/50e77253-d69d-40d9-9e1f-56e9341bc131" /> The time on `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` is not a bottleneck anymore. --------- Signed-off-by: xgui <[email protected]> Signed-off-by: Xinyuan <[email protected]>

@staticmethod

) ## Description Using a compose observation and action spaces for the RLlib Offline implementation Below is an example script for testing ```python import shutil from pathlib import Path from typing import Any import gymnasium as gym import msgpack import msgpack_numpy as mnp import numpy as np import ray from ray.rllib.algorithms import BCConfig from ray.rllib.connectors.common.flatten_observations import \ FlattenObservations from ray.rllib.env.single_agent_episode import SingleAgentEpisode class SampleEnv(gym.Env): metadata = {'render.modes': ['human', 'logger']} def __init__(self, config: dict[str, Any] = None): super(SampleEnv, self).__init__() self.observation_space = gym.spaces.Dict({ "deployable_area": gym.spaces.Box( low=0, high=1, shape=(25, 25), dtype=np.uint8 ), "available_units": gym.spaces.Box( low=0.0, high=1.0, shape=(35,), dtype=np.float32 ), "available_abilities": gym.spaces.MultiBinary(n=6), }) self.action_space = gym.spaces.Dict({ "deploy": gym.spaces.Dict({ "unit_index": gym.spaces.Discrete(35), "yx": gym.spaces.Discrete(625), }), "ability": gym.spaces.Discrete(6), }) self.current_step = 0 def reset(self, seed=None, options=None): super().reset() self.current_step = 0 return self.observation_space.sample(), {} def step(self, action): self.current_step += 1 obs = self.observation_space.sample() reward = np.random.rand(1)[0] terminate = self.current_step == 2 return obs, reward, terminate, False, {} class ExamplePreLearner: @staticmethod def _gen_episode() -> SingleAgentEpisode: env = SampleEnv({}) ep = SingleAgentEpisode( observation_space=env.observation_space, action_space=env.action_space, ) obs, info = env.reset() ep.add_env_reset(obs, info) term = False while not term: action = env.action_space.sample() obs, reward, term, trunc, info = env.step(action) ep.add_env_step(obs, action, reward, info, terminated=term, truncated=trunc) env.close() ep.validate() return ep if __name__ == "__main__": pl = ExamplePreLearner() episodes = [pl._gen_episode() for _ in range(10)] packed_episodes = [msgpack.packb(eps.get_state(), default=mnp.encode) for eps in episodes] samples_ds = ray.data.from_items(packed_episodes) path = "tmp/supercell_offline_learning" shutil.rmtree(path) samples_ds.write_parquet(path) # Read the episodes and decode them. read_sample_ds = ray.data.read_parquet(path) batch = read_sample_ds.take_batch(10) read_episodes = [ SingleAgentEpisode.from_state(msgpack.unpackb(state, object_hook=mnp.decode)) for state in batch["item"] ] assert len(episodes) == len(read_episodes) for eps, read_eps in zip(episodes, read_episodes): assert len(eps) == len(read_eps) assert eps.observations == read_eps.observations env = SampleEnv({}) config = ( BCConfig() .environment( observation_space=env.observation_space, action_space=env.action_space, ) .offline_data( input_=[Path(path).as_posix()], dataset_num_iters_per_learner=1, input_read_episodes=True, input_read_batch_size=1 ) .learners(num_learners=0) .training(learner_connector=FlattenObservations) ) algo = config.build() print("training") algo.train() ``` ## Related issues ray-project#57794, ray-project#50340 and internal report by Supercell --------- Signed-off-by: Mark Towers <[email protected]> Signed-off-by: simonsays1980 <[email protected]> Co-authored-by: Mark Towers <[email protected]> Co-authored-by: simonsays1980 <[email protected]>

…_env` (ray-project#58410) ## Description Allow users to use environments that are already vectorized for `SingleAgentEnvRunner` With `gymnasium.make_vec`, users have the option to either use the `SyncVectorEnv` to vectorize a base environment or to directly create a vector environment using the `vectorize_mode: gymnasium.VectorizeMode`. This PR utilises the `env_runners(gym_env_vectorize_mode=...)` argument to support `VectorizeMode.VECTOR_ENTRY_POINT` ``` import gymnasium as gym config = ... config.env_runners( gym_env_vectorize_mode=gym.VectorizeMode.VECTOR_ENTRY_POINT, ) ``` An important change related to this PR is that the values accepted for the vectorize mode is either the enum (`VectorizedMode.ASYNC`, etc) or the enum values (`"async"`, etc) as before it was the string version was the enum name (`"ASYNC"`) rather than the enum value itself. ## Related issues Completion of ray-project#57643, ## Additional information ray-project#58397 must be merged first We should apply a similar change to the `MultiAgentEnvRunner.make_env` --------- Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]>

using certificate saved in aws secret store now Signed-off-by: Lonnie Liu <[email protected]>

## Description ray-project#58234 increased the scale of the `map_groups` release test from SF 10 to SF 100. Since then, the release test has been consistently failing (see ray-project#58312). To avoid a perpetually broken release test, this PR reverts the scale to SF 10 while we investigate and fix the scalability issue. ## Related issues ray-project#58312 Signed-off-by: Balaji Veeramani <[email protected]>

…ect#58699) ## Description Previously we will try slice the block when `self._total_pending_rows >= self._target_num_rows` or `flush_remaining` is True, but flush_remaining doesn't mean `self._total_pending_rows >= self._target_num_rows ` so it could make the slicing failed because our slicing logic is based on assumption there should be at least one full block. This PR fix the logic and added test for such case. --------- Signed-off-by: You-Cheng Lin <[email protected]>

…bles (ray-project#58241) Signed-off-by: joshlee <[email protected]> Co-authored-by: Dhyey Shah <[email protected]>

Getting rid of the excessive `while True` loops & timeouts in the tests (we already wait for the dashboard to be up). Also just cleaned up some comments and naming while I was poking around. --------- Signed-off-by: Edward Oakes <[email protected]>

sourcery-ai

The pull request #680 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5400.

gemini-code-assist · 2025-11-18T03:14:48Z

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive set of changes across the Ray repository, focusing on improving build configurations, dependency management, CI infrastructure, and code quality. The changes aim to enhance the stability, reproducibility, and maintainability of the Ray project.

Highlights

Bazel Configuration: Updates to .bazelrc to enable strict action environment, set compiler options for Windows, ignore warnings for third-party code, and adjust settings for CI and macOS.
Buildkite Configuration: Significant modifications to .buildkite/ configurations, including restructuring of _forge.rayci.yml, adding new _images.rayci.yml and dependencies.rayci.yml, and updating various build and test pipelines.
Dependency Management: Changes to dependency management, including updates to requirements.txt files, compilation of pip dependencies, and modifications to dependency-related scripts.
CI Infrastructure: Updates to CI infrastructure, including changes to Dockerfiles, build scripts, and test configurations.
Code Formatting and Linting: Updates to code formatting and linting, including changes to .pre-commit-config.yaml and related scripts.
Documentation Build: Updates to documentation build process, including changes to .readthedocs.yaml and related scripts.

Ignored Files

Ignored by pattern: .gemini/** (1)
- .gemini/config.yaml
Ignored by pattern: .github/workflows/** (1)
- .github/workflows/stale_pull_request.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is a large automated merge that introduces a significant refactoring of the CI/CD and build systems. Key improvements include modularizing Buildkite pipelines and Bazel builds, standardizing artifact packaging with rules_pkg, and switching from miniconda to miniforge. The changes also drop support for x86_64 macOS and Python 3.9 in many areas, while enhancing linting with more pre-commit hooks and modernizing the C++ codebase. Overall, these changes significantly improve the maintainability, standardization, and efficiency of the build process. I've found one potential issue related to ensuring a clean build environment on macOS.

elliot-barn and others added 30 commits October 29, 2025 16:13

[LLM] Correct NIXL in docs (ray-project#58258)

3cd8202

## Description The full name was probably hallucinated from LLM. ## Related issues ## Additional information Signed-off-by: Rui Qiao <[email protected]>

[Data] Add logging to limit pushdown when map `min_rows > limit_op._l…

fa69de8

…imit` (ray-project#58303) ## Description ## Related issues Fix comment ray-project#58264 (comment) ## Additional information Signed-off-by: You-Cheng Lin <[email protected]>

[Docs] Update RayJob documentation to introduce the New DeletionStrat…

3fcb2a2

…egy (ray-project#58306) Signed-off-by: wei-chenglai <[email protected]>

[core]move python_callbacks to common (ray-project#57909)

d7b6f1e

[release][ci] upgrading tune fault python 3.10 (ray-project#58224)

c07510a

ugprading tune fault release test to 3.10 Successful release test run: https://buildkite.com/ray-project/release/builds/65673# --------- Signed-off-by: elliot-barn <[email protected]>

[release][ci] updating jobs tests to run on 3.10 (ray-project#58248)

d1bcf97

upgrading jobs tests to run on python 3.10 Successful release tests: https://buildkite.com/ray-project/release/builds/65845 --------- Signed-off-by: elliot-barn <[email protected]>

[ci][release] removing byod compile jobs (ray-project#58318)

dc5bcd4

removing byod compile jobs for release test images Now using raydepsets to generate lock files Signed-off-by: elliot-barn <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

[serve][llm] Add TP*PP spacing to port offset for multi-replica deplo…

71a2f40

…yments (ray-project#58073) Signed-off-by: Nikhil Ghosh <[email protected]>

[serve][llm] Fix: Allow unregistered KV connectors without setup requ…

b705c21

…irements (ray-project#58323) Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

[data][llm] Add single-node Ray Data LLM perf baseline benchmark + re…

267242d

…gression guard (ray-project#58289) Signed-off-by: Nikhil Ghosh <[email protected]> Signed-off-by: Nikhil G <[email protected]>

soffer-anyscale and others added 22 commits November 16, 2025 21:36

[scripts] ban click 8.3.* (ray-project#58677)

c446399

it uses `enum.Enum` that are not deepcopy-able Signed-off-by: Lonnie Liu <[email protected]>

[release test] edit shell script styling in run_release_test.sh (ray-…

fa8bc74

…project#58670) use double brackets as much as possible Signed-off-by: Lonnie Liu <[email protected]>

[CI] Add kafka-python&testcontainers[kafka] to deps (ray-project#…

79d2a69

…58680) for implementing kafka datasource. Signed-off-by: You-Cheng Lin (Owen) <[email protected]> Co-authored-by: Lonnie Liu <[email protected]>

[core] Adding option for in flight rpc failure injection (ray-project…

fbf3c32

…#58512) Signed-off-by: dayshah <[email protected]>

[release test] use uv python to run launcher (ray-project#58664)

c73fc72

so that we are not tied to the python version coming with buildkite images Signed-off-by: Lonnie Liu <[email protected]>

[release test] remove remote_task.py (ray-project#58691)

c751925

not used anywhere any more Signed-off-by: Lonnie Liu <[email protected]>

[data] disable tests against pyarrow nightly (ray-project#58704)

55eed4f

they are failing right now Signed-off-by: Lonnie Liu <[email protected]>

[release test] fix azure credential loading (ray-project#58684)

78082d6

using certificate saved in aws secret store now Signed-off-by: Lonnie Liu <[email protected]>

[core] Add release test to simulate network transient error via ip ta…

3ff6ed3

…bles (ray-project#58241) Signed-off-by: joshlee <[email protected]> Co-authored-by: Dhyey Shah <[email protected]>

antfin-oss requested review from SongGuyang and kfstorm as code owners November 18, 2025 02:56

antfin-oss added auto-generated daily-merge labels Nov 18, 2025

antfin-oss assigned ffbin Nov 18, 2025

sourcery-ai bot reviewed Nov 18, 2025

View reviewed changes

gemini-code-assist bot reviewed Nov 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔄 daily merge: master → main 2025-11-18 #680

🔄 daily merge: master → main 2025-11-18 #680

Uh oh!

antfin-oss commented Nov 18, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

gemini-code-assist bot commented Nov 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

82 participants

🔄 daily merge: master → main 2025-11-18 #680

Are you sure you want to change the base?

🔄 daily merge: master → main 2025-11-18 #680

Uh oh!

Conversation

antfin-oss commented Nov 18, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Nov 18, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

82 participants