🔄 daily merge: master → main 2025-11-25 #688

antfin-oss · 2025-11-25T03:27:10Z

This Pull Request was created automatically to merge the latest changes from master into main branch.

📅 Created: 2025-11-25
🔀 Merge direction: master → main
🤖 Triggered by: Scheduled

Please review and merge if everything looks good.

@ZacAttack

…y-project#58217) Change the unit of `scheduler_placement_time` from seconds to mili-seconds. The current bucket is in the range of 0.1s to 2.5 hours which doesn't make sense. According to a sample of data, the range we are interested in would be from us to s. Thanks @ZacAttack for pointing this out. ``` Note: This is an internal (non–public-facing) metric, so we only need to update its usage within Ray (e.g., the dashboard). A simple code change should suffice. ``` <img width="1609" height="421" alt="505491038-c5d81017-b86c-406f-acf4-614560752062" src="https://github.com/user-attachments/assets/cc647b97-42ec-42eb-bf01-4d1867940207" /> Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>

…s in the Raylet (ray-project#58342) Found it very hard to parse what was happening here, so helping future me (or you!). Also: - Deleted vestigial `next_resource_seq_no_`. - Converted from non-monotonic clock to a monotonically incremented `uint64_t` for the version number for commands. - Added logs when we drop messages with stale versions. --------- Signed-off-by: Edward Oakes <[email protected]>

## Description There was a typo ## Related issues N/A ## Additional information N/A Signed-off-by: Daniel Shin <[email protected]>

be consistent with the CI base env specified in `--build-name` Signed-off-by: Lonnie Liu <[email protected]>

getting ready to run things on python 3.10 Signed-off-by: Lonnie Liu <[email protected]>

…tion on a single node (ray-project#58456) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description Currently, finalization is scheduled in batches sequentially -- ie batch of N adjacent partitions is finalized at once (in a sliding window). This creates a lensing effect since: 1. Adjacent partitions i and i+1 get scheduled onto adjacent aggregators j and j+i (since membership is determined as j = i % num_aggregators) 2. Adjacent aggregators have high likelihood of getting scheduled on the same node (due to similarly being scheduled at about the same time in sequence) To address that this change applies random sampling when choosing next partitions to finalize to make sure partitions are chosen uniformly reducing concurrent finalization of the adjacent partitions. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <[email protected]>

## Description > Briefly describe what this PR accomplishes and why it's needed. Making NotifyGCSRestart RPC Fault Tolerant and Idempotent. There were multiple places where we were always returning Status::OK() in the gcs_subscriber making idempotency harder to understand and there was dead code for one of the resubscribes, so did a minor clean up. Added a python integration test to verify retry behavior, left out the cpp test since on the raylet side there's nothing to test since its just making a gcs_client rpc call --------- Signed-off-by: joshlee <[email protected]>

…ct#58445) ## Summary Creates a dedicated `tests/unit/` directory for unit tests that don't require Ray runtime or external dependencies. ## Changes - Created `tests/unit/` directory structure - Moved 13 pure unit tests to `tests/unit/` - Added `conftest.py` with fixtures to prevent `ray.init()` and `time.sleep()` - Added `README.md` documenting unit test requirements - Updated `BUILD.bazel` to run unit tests with "small" size tag ## Test Files Moved 1. test_arrow_type_conversion.py 2. test_block.py 3. test_block_boundaries.py 4. test_data_batch_conversion.py 5. test_datatype.py 6. test_deduping_schema.py 7. test_expression_evaluator.py 8. test_expressions.py 9. test_filename_provider.py 10. test_logical_plan.py 11. test_object_extension.py 12. test_path_util.py 13. test_ruleset.py These tests are fast (<1s each), isolated (no Ray runtime), and deterministic (no time.sleep or randomness). --------- Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

> Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Concurrency Cap Backpressure tuning - Maintain asymmetric EWMA of total queued bytes (this op + downstream) as the typical level: level. - Maintain asymmetric EWMA of absolute residual vs the previous level as a scale proxy: dev = EWMA(|q - level_prev|). - Define deadband: [lower, upper] = [level - K_DEVdev, level + K_DEVdev]. If q > upper -> target cap = running - BACKOFF_FACTOR (back off) If q < lower -> target cap = running + RAMPUP_FACTOR (ramp up) Else -> target cap = running (hold) - Clamp to [1, configured_cap], admit iff running < target cap. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Srinath Krishnamachari <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ay-project#58301)

Signed-off-by: Nikhil Ghosh <[email protected]>

… in read-only mode (ray-project#58460) This ensures node type names are correctly reported even when the autoscaler is disabled (read-only mode). ## Description Autoscaler v2 fails to report prometheus metrics when operating in read-only mode on KubeRay with the following KeyError error: ``` 2025-11-08 12:06:57,402 ERROR autoscaler.py:215 -- 'small-group' Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state return Reconciler.reconcile( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 120, in reconcile Reconciler._step_next( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 275, in _step_next Reconciler._scale_cluster( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1125, in _scale_cluster reply = scheduler.schedule(sched_request) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 933, in schedule ResourceDemandScheduler._enforce_max_workers_per_type(ctx) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 1006, in _enforce_max_workers_per_type node_config = ctx.get_node_type_configs()[node_type] KeyError: 'small-group' ``` This happens because the `ReadOnlyProviderConfigReader` populates `ctx.get_node_type_configs()` using node IDs as node types, which is correct for local Ray (where local ray does not have `RAY_NODE_TYPE_NAME` set), but incorrect for KubeRay where `ray_node_type_name` is present and expected with `RAY_NODE_TYPE_NAME` set. As a result, in read-only mode the scheduler sees a node type name (ex. small-group) that never exists in the populated configs. This PR fixes the issue by using `ray_node_type_name` when it exists, and only falling back to node ID when it does not. ## Related issues Fixes ray-project#58227 Signed-off-by: Rueian <[email protected]>

…cess: bool (ray-project#58384) ## Description Pass in `status_code` directly into `do_reply`. This is a follow up to ray-project#58255 ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <[email protected]>

…ay-project#58464)

ray-project#58395) ## Description there are places in the python code where we use the raw grpc library to make grpc calls (eg: pub-sub, some calls to gcs etc). In the long term we want to fully deprecate grpc library usage in our python code base but as that can take more effort and testing, in this pr I am introducing an interceptor to add auth headers (this will take effect for all grpc calls made from python). ## Testing ### case 1: submitting a job using CLI ``` export RAY_auth_mode="token" export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789" ray start --head ray job submit -- echo "hi" ``` output ``` ray job submit -- echo "hi" 2025-11-04 06:28:09,122 - INFO - NumExpr defaulting to 4 threads. Job submission server address: http://127.0.0.1:8265 ------------------------------------------------------- Job 'raysubmit_1EV8q86uKM24nHmH' submitted successfully ------------------------------------------------------- Next steps Query the logs of the job: ray job logs raysubmit_1EV8q86uKM24nHmH Query the status of the job: ray job status raysubmit_1EV8q86uKM24nHmH Request the job to be stopped: ray job stop raysubmit_1EV8q86uKM24nHmH Tailing logs until the job exits (disable with --no-wait): 2025-11-04 06:28:10,363 INFO job_manager.py:568 -- Runtime env is setting up. hi Running entrypoint for job raysubmit_1EV8q86uKM24nHmH: echo hi ------------------------------------------ Job 'raysubmit_1EV8q86uKM24nHmH' succeeded ------------------------------------------ ``` ### case 2: submitting a job with actors and tasks and verifying on dashboard test.py ```python import time import ray from ray._raylet import Config ray.init() @ray.remote def print_hi(): print("Hi") time.sleep(2) @ray.remote class SimpleActor: def __init__(self): self.value = 0 def increment(self): self.value += 1 return self.value actor = SimpleActor.remote() result = ray.get(actor.increment.remote()) for i in range(100): ray.get(print_hi.remote()) time.sleep(20) ray.shutdown() ``` ``` export RAY_auth_mode="token" export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789" python test.py ``` ### dashboard screenshots: #### promts user to input the token <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/008829d8-51b6-445a-b135-5f76b6ccf292" /> ### on passing the right token: overview page <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/cece0da7-0edd-4438-9d60-776526b49762" /> job page: tasks are listed <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/b98eb1d9-cacc-45ea-b0e2-07ce8922202a" /> task page <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/09ff38e1-e151-4e34-8651-d206eb8b5136" /> actors page <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/10a30b3d-3f7e-4f3d-b669-962056579459" /> specific actor page <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/ab1915bd-3d1b-4813-8101-a219432a55c0" /> --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>

Currently are deprecation warnings sometimes not informative enough. The the warning is triggered it does not tell us *where* the deprecated feature is used. For example, ray internally raises a deprecation warning when an `RLModuleConfig` is initialized. ```python >>> from ray.rllib.core.rl_module.rl_module import RLModuleConfig >>> RLModuleConfig() 2025-11-02 18:21:27,318 WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future! ``` This is confusing, where did *I* use a config, what am I doing wrong? This raises issues like: https://discuss.ray.io/t/warning-deprecation-py-50-deprecationwarning-rlmodule-config-rlmoduleconfig-object-has-been-deprecated-use-rlmodule-observation-space-action-space-inference-only-model-config-catalog-class-instead/23064 Tracing where the error actually happens is tedious - is it my code or internal? The output just shows `deprecation.:50`. Not helpful. This PR adds a stacklevel option with stacklevel=2 as the default to all `deprecation_warning`s. So devs and users can better see where is the deprecated option actually used. --- EDIT: **Before** ```python WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` ``` **After** module.py:line where the deprecated artifact is used is shown in the log output: When building an Algorithm: ```python WARNING rl_module.py:445 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future! ``` ```python .../ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning" ``` Signed-off-by: Daraan <[email protected]>

Fix some vale's error and suggestions on the kai-scheduler document. See ray-project#58161 (comment) Signed-off-by: fscnick <[email protected]>

> Briefly describe what this PR accomplishes and why it's needed. Our serve ingress keeps running into below error related to `uvloop` under heavy load ``` File descriptor 97 is used by transport ``` The uvloop team have a [PR](MagicStack/uvloop#646) to fix it, but seems like no one is working on it One of workaround mentioned in the ([PR](MagicStack/uvloop#646 (comment))) is to just turn off uvloop . We tried it in our env and didn't see any major performance difference Hence as part of this PR, we are defining a new env for controlling UVloop Signed-off-by: jugalshah291 <[email protected]>

- `test_target_capacity` windows test is failing, possibly because we have put up a short timeout of 10 seconds, increasing it to verify whether timeout is an issue or not. Signed-off-by: harshit <[email protected]>

so that they are not called lints any more Signed-off-by: Lonnie Liu <[email protected]>

…ject#57233) Update remaining mulitmodal release tests to use new depsets.

…y-project#58441) ## Description Currently, we clear _external_ queues when an operator is manually marked as finished. But we don't clear their _internal_ queues. This PR fixes that ## Related issues Fixes this test https://buildkite.com/ray-project/postmerge/builds/14223#019a5791-3d46-4ab8-9f97-e03ea1c04bb0/642-736 ## Additional information --------- Signed-off-by: iamjustinhsu <[email protected]>

be consistent with doc build environment Signed-off-by: Lonnie Liu <[email protected]>

migrating all doc related things to run on python 3.12 Signed-off-by: Lonnie Liu <[email protected]>

excluding `*_tests` directories for now to reduce the impact Signed-off-by: Lonnie Liu <[email protected]>

using `bazelisk run //java:gen_ray_java_pkg` everywhere Signed-off-by: Lonnie Liu <[email protected]>

This PR adds 2 new metrics to core_worker by way of the reference counter. The two new metrics keep track of the count and size of objects owned by the worker as well as keeping track of their states. States are defined as: - **PendingCreation**: An object that is pending creation and hasn't finished it's initialization (and is sizeless) - **InPlasma**: An object which has an assigned node address and isn't spilled - **Spilled**: An object which has an assigned node address and is spilled - **InMemory**: An object which has no assigned address but isn't pending creation (and therefore, must be local) The approach used by these new metrics is to examine the state 'before and after' any mutations on the reference in the reference_counter. This is required in order to do the appropriate bookkeeping (decrementing values and incrementing others). Admittedly, there is potential for counting on the in between decrements/increments depending on when the RecordMetrics loop is run. This unfortunate side effect however seems preferable to doing mutual exclusion with metric collection as this is potentially a high throughput code path. In addition, performing live counts seemed preferable then doing full accounting of the object store and across all references at time of metric collection. Reason being, that potentially the reference counter is tracking millions of objects, and each metric scan could potentially be very expensive. So running the accounting (despite being potentially innaccurate for short periods) seemed the right call. This PR also allows for object size to potentially change due to potential non deterministic instantation (say an object is initially created, but it's primary copy dies, and then the recreation fails). This is an edge case, but seems important for completeness sake. --------- Signed-off-by: zac <[email protected]>

to 0.21.0; supports wanda priority now. Signed-off-by: Lonnie Liu <[email protected]>

…8498) Signed-off-by: Seiji Eicher <[email protected]>

updating llm batch release test depset successful release test run: https://buildkite.com/ray-project/release/builds/68802#019aa374-b0c3-4da1-a003-9296ac07f4e0 --------- Signed-off-by: elliot-barn <[email protected]>

ray-project#58911) This reverts commit 3663299. ```` [2025-11-22T01:29:32Z] File "/rayci/python/ray/dashboard/modules/metrics/dashboards/data_dashboard_panels.py", line 1451, in <module> -- [2025-11-22T01:29:32Z] assert len(all_panel_ids) == len( [2025-11-22T01:29:32Z] AssertionError: Duplicated id found. Use unique id for each panel. [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 43, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 70, 71, 72, 73, 74, 75, 76, 77, 78, 78, 79, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108] ```` Co-authored-by: Lonnie Liu <[email protected]>

…#58872) removing flag from raydepsets: --index-url https://pypi.org/simple (included by default https://docs.astral.sh/uv/reference/cli/#uv-pip-compile--default-index) adding flag to raydepsets: --no-header updating unit tests This will prevent updating all lock files updating when default or config level flags are updated --------- Signed-off-by: elliot-barn <[email protected]>

…er (ray-project#58739) ## Description The algorithm config isn't updating `rl_module_spec.model_config` when a custom one is specified which means that the learner and env-runner. As a result, the runner model wasn't been updated. The reason this problem wasn't detected previous was that when updating the model state-dict is we used `strict=False`. Therefore, I've added an error checker that the missing keys should always be empty and will detect when env-runner is missing components from the learner update model. ```python from ray.rllib.algorithms import PPOConfig from ray.rllib.core.rl_module import RLModuleSpec from ray.rllib.policy.sample_batch import DEFAULT_POLICY_ID config = ( PPOConfig() .environment('CartPole-v1') .env_runners( num_env_runners=0, num_envs_per_env_runner=1, ) .rl_module( rl_module_spec=RLModuleSpec( model_config={ "head_fcnet_hiddens": (32,), # This used to cause encoder.config.shared mismatch } ) ) ) algo = config.build_algo() learner_module = algo.learner_group._learner._module[DEFAULT_POLICY_ID] env_runner_modules = algo.env_runner_group.foreach_env_runner(lambda runner: runner.module) print(f'{learner_module.encoder.config.shared=}') print(f'{[mod.encoder.config.shared for mod in env_runner_modules]=}') algo.train() ``` ## Related issues Closes ray-project#58715 --------- Signed-off-by: Mark Towers <[email protected]> Co-authored-by: Mark Towers <[email protected]> Co-authored-by: Hassam Ullah Sheikh <[email protected]>

resolves logical merge conflicts and fix ci test Signed-off-by: Lonnie Liu <[email protected]>

This test was flaky before, making it more deterministic now by cancelling only once the outer and inner task are both running and making sure they're running before launching the many_resources task. The flaky failure was the ray.get on many_resources returning before the cancel. https://buildkite.com/ray-project/postmerge/builds/14463#019a99a6-acf2-4e28-89d3-2abc99eb93ac/609-913 --------- Signed-off-by: dayshah <[email protected]>

Signed-off-by: dayshah <[email protected]>

…ect#58848) ## Description before running `ray get-auth-token` users first need to set `AUTH_MODE=token`. this is unnecessary, this pr introduces `ignore_auth_mode` flag in the token_loader class to remove this check. This pr also moves all auth related tests in `src/ray/rpc/tests` to `src/ray/rpc/authentication/tests`. --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>

## Description - Add kafka user guide in loading data --------- Signed-off-by: You-Cheng Lin <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]> Co-authored-by: Richard Liaw <[email protected]>

adding external scaler enabled flag in the application config, which will dictate, whether to allow the external scalers to update the num replicas for an application or not. this is being done as part of the [custom autoscaling story](https://docs.google.com/document/d/1KtMUDz1O3koihG6eh-QcUqudZjNAX3NsqqOMYh3BoWA/edit?tab=t.0#heading=h.2vf4s2d7ca46) --------- Signed-off-by: harshit <[email protected]>

- adding docs for POST API, here are more details: https://docs.google.com/document/d/1KtMUDz1O3koihG6eh-QcUqudZjNAX3NsqqOMYh3BoWA/edit?tab=t.0#heading=h.2vf4s2d7ca46 - also, making changes for the external scaler enabled in the existing serve application docs to be merged after ray-project#57554 --------- Signed-off-by: harshit <[email protected]> Co-authored-by: Cursor Agent <[email protected]>

### [Data] Simplify ArrowBlock and PandasBlock Simplify inheritance hierarchy for `ArrowBlock` and `PandasBlock` by removing `TableRow` to improve code maintainability. Signed-off-by: Srinath Krishnamachari <[email protected]>

…ix ordering assumption (ray-project#58746) ## Description **Split multi-case test function** - `test_limit_pushdown_conservative` → 10 separate tests (basic fusion, limit fusion reversed, multiple limit fusion, maprows, mapbatches, filter, project, sort, complex interweaved operations, and between two map operators) **Fixed ordering assumptions** - Added `check_ordering=False` to union tests (blocks may interleave) - Added `check_ordering=False` to project test with `override_num_blocks` (parallel execution) ## Related issues Related to ray-project#58655 ## Additional information --------- Signed-off-by: ryankert01 <[email protected]>

1. this PR added multihost GPU support for Ray Train JaxTrainer 2. Following Jax [GPU distributed doc](https://docs.jax.dev/en/latest/multi_process.html#gpu-example): if `ScalingConfig.use_gpu == True`, we add "cuda" as JAX_PLATFORMS. 3. if cuda is the jax platform, add CUDA_VISIBLE_DEVICES and initialize jax distributed with https://docs.jax.dev/en/latest/_autosummary/jax.distributed.initialize.html#jax.distributed.initialize --------- Signed-off-by: Lehui Liu <[email protected]>

## Description Ensure the predicate expr appears correctly for the `Filter` logical op. ## Related issues Closes ray-project#58620 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <[email protected]>

…checkpoint (ray-project#58863) Fail fast if the users forgets to return a checkpoint in their `checkpoint_upload_fn`. This also causes unexpected issues like `get_all_reported_checkpoints` stalling indefinitely because the counter is misaligned, which I can also fix in a separate PR. --------- Signed-off-by: Timothy Seah <[email protected]>

… dashboard attempt 2 (ray-project#58912) Changed panel id's to avoid merge conflicts --------- Signed-off-by: Timothy Seah <[email protected]>

…ect#58771) Previously, `RAY_num_server_call_thread` controlled the gRPC reply thread pool size for all processes (including CoreWorkers), and its default value was tied to the number of CPUs, which could oversubscribe threads in CoreWorkers on large instances. In this PR, we introduce `RAY_core_worker_num_server_call_thread` to separately control CoreWorkers, defaulting to `min(2, max(1, num cpu/4))`, and scope `RAY_num_server_call_thread` to system components (raylet, GCS, etc.) only. This keeps per-worker reply pools tiny so we can run many workers on the same node without oversubscribing threads; the choice of “2” is based on the microbenchmarks in ray-project#58351. ## Related issues Closes ray-project#58351 ## Test ```python #!/usr/bin/env python3 import os import subprocess import sys def get_thread_count(config_value): subprocess.run(["ray", "stop", "-f"], capture_output=True) env = os.environ.copy() if config_value is not None: env["RAY_core_worker_num_server_call_thread"] = str(config_value) test_code = """ import ray import psutil import os import time @ray.remote def count_threads(): return len(psutil.Process(os.getpid()).threads()) ray.init() # Warm up once to make sure thread pools are instantiated. ray.get(count_threads.remote()) time.sleep(1) print(ray.get(count_threads.remote())) """ result = subprocess.run( [sys.executable, "-c", test_code], env=env, capture_output=True, text=True, check=True ) return int(result.stdout.strip()) if __name__ == "__main__": default_threads = get_thread_count(None) with_config_10 = get_thread_count(10) subprocess.run(["ray", "stop", "-f"], capture_output=True) print(f"Default (RAY_core_worker_num_server_call_thread=2): {default_threads} threads") print(f"With RAY_core_worker_num_server_call_thread=10: {with_config_10} threads") ``` ```shell #Default (RAY_core_worker_num_server_call_thread=2): 52 threads #With RAY_core_worker_num_server_call_thread=10: 60 threads ``` By default, this setting creates two threads. After changing it to ten, we typically observe eight additional threads. (Because of ray-project#55215, the exact count may differ, but in most cases the delta is three.) --------- Signed-off-by: yicheng <[email protected]> Co-authored-by: yicheng <[email protected]>

> Briefly describe what this PR accomplishes and why it's needed. Creating core chaos network release tests by adding ip table variations to the current chaos release tests. Also added a basic chaos release test for streaming generators and object ref borrowing. Did a minor refactor by moving each chaos test workload (tasks/actors/streaming gen/borrowing) into it's own python file so it's easier to add additional tests in the future rather than making a huge mono file. Added metrics for total runtime + peak head node memory usage. Furthermore removed the baseline run as it's repeated among all chaos failure types, and it should only be run once in total. Hence for each workload, we now have 4 tests (baseline, EC2 instance killer, raylet killer, ip table network failure). Note that for the ip table tests you'll need to add these 4 config variables: - RAY_health_check_period_ms=10000 - RAY_health_check_timeout_ms=100000 - RAY_health_check_failure_threshold=10 - RAY_gcs_rpc_server_connect_timeout_s=60 the top 3 prevent the raylet from failing the gcs health check during the transient network error duration, and the last prevents us from getting killed by the GCS client check where upon connection if we can't initially connect to the GCS for 5 seconds, we die. Also deleted test_chaos.py that's located in python/ray/tests as the release chaos tests cover similar functionality. --------- Signed-off-by: joshlee <[email protected]>

## Description Completing the datetime namespace operations ## Related issues Related to ray-project#58674 ## Additional information --------- Signed-off-by: 400Ping <[email protected]>

This PR adds support for per-dataset execution options in `DataConfig`, allowing users to specify different `ExecutionOptions` for different datasets. This enables fine-grained control over how each dataset is processed by Ray Data. --------- Signed-off-by: xgui <[email protected]> Signed-off-by: Xinyuan <[email protected]> Co-authored-by: Justin Yu <[email protected]>

…n find them in the code with search. (ray-project#58943) Metric names and config variable names are one of the places with being verbose and duplicating prefixes is fine. You should be able to search through the code and find them easily. Signed-off-by: irabbani <[email protected]>

sourcery-ai

The pull request #688 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5473.

gemini-code-assist · 2025-11-25T03:36:09Z

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a broad set of changes from the master branch into main, primarily focusing on modernizing and streamlining the project's build, CI/CD, and dependency management systems. The updates aim to enhance the reliability, efficiency, and maintainability of the development workflow, particularly for Python and C++ components, while also improving code quality checks and documentation practices.

Highlights

CI/CD Pipeline Refactoring: The Buildkite CI/CD pipelines have undergone a significant refactoring. This includes moving Docker image definitions to dedicated files like .buildkite/_images.rayci.yml, consolidating documentation-related CI steps into .buildkite/doc.rayci.yml, and introducing a new .buildkite/dependencies.rayci.yml for managing Python dependency compilation. This aims to improve modularity and maintainability of the CI infrastructure.
Build System Enhancements (Bazel & uv): The Bazel build system has been extensively updated. This involves refactoring C++ RPC and GCS dependencies into more granular targets, introducing pkg_zip and pkg_files for artifact packaging, and updating Python toolchain management. A new raydepsets CLI tool has been added, leveraging uv for more robust and deterministic Python dependency compilation across various environments and configurations.
Python 3.10 as Default/Supported Version: Python 3.10 is now widely adopted as a default or newly supported version across many CI configurations, Docker image builds, and dependency compilation steps. This includes updates in .bazelrc, various .rayci.yml files, and Dockerfiles to reflect the shift.
Code Ownership and Linting Updates: The .github/CODEOWNERS file has been simplified for better maintainability, with broader ownership definitions for core components. Pre-commit hooks have been updated to include new tools like semgrep, vale, cython-lint, and eslint, alongside a refactoring of pydoclint for local and CI stages, enhancing code quality and consistency.
C++ API and Runtime Adjustments: Several internal C++ API and runtime components have been refined. This includes changes to RemoteFunctionHolder for clarity, adjustments to metric recording, modifications to object store behavior in local mode, and updates to network utility functions for IP address resolution and bootstrap address parsing.

Ignored Files

Ignored by pattern: .gemini/** (1)
- .gemini/config.yaml
Ignored by pattern: .github/workflows/** (1)
- .github/workflows/stale_pull_request.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is a large-scale refactoring of the build and continuous integration system. The changes are extensive, touching many configuration files for Bazel and Buildkite, as well as some source code. Key improvements include modularizing CI pipelines, upgrading base images and tools, refactoring Bazel build files for better structure and use of standard rules, and enhancing the linting and dependency management processes. Support for Python 3.9 and macOS x86_64 appears to be reduced or removed in several places. Overall, these changes represent a significant and positive overhaul of the project's development infrastructure, aimed at improving maintainability, reliability, and developer experience. I have one suggestion to improve the clarity of a script.

gemini-code-assist · 2025-11-25T03:43:25Z

.buildkite/release-automation/verify-macos-wheels.sh

-    echo "Use: x86_64 or arm64" >/dev/stderr
-    exit 1
-fi
+PYTHON_VERSIONS=("3.10" "3.11" "3.12")


Python 3.9 is also removed from this list. It would be helpful to add a comment explaining why, similar to the one for Python 3.13. This helps future maintainers understand if it was intentional (e.g., end-of-life) or if it should be re-added later.

Daraan and others added 30 commits November 7, 2025 20:10

[RLlib] Broken restore from remote - Add missing FileSystem argument (r…

7271b0c

…ay-project#58324)

[Data] Fix Progress Bar Name (ray-project#58451)

3553d8e

## Description There was a typo ## Related issues N/A ## Additional information N/A Signed-off-by: Daniel Shin <[email protected]>

[data] ci: fix data doc test build env (ray-project#58458)

412220e

be consistent with the CI base env specified in `--build-name` Signed-off-by: Lonnie Liu <[email protected]>

[bazel] add python 3.10 runtime pair (ray-project#58455)

a70a1b1

getting ready to run things on python 3.10 Signed-off-by: Lonnie Liu <[email protected]>

[serve][llm] Data Parallel Attention: Public API and Documentation (r…

2691094

…ay-project#58301)

[data][llm] fix vllm ray data quickstart example (ray-project#58463)

654feda

Signed-off-by: Nikhil Ghosh <[email protected]>

[Data] Add exception handling for invalid URIs in download operation (r…

71c7bd0

…ay-project#58464)

[Doc][KubeRay] eliminate vale errors (ray-project#58429)

2486ddd

Fix some vale's error and suggestions on the kai-scheduler document. See ray-project#58161 (comment) Signed-off-by: fscnick <[email protected]>

increase timeout for test_initial_replica tests (ray-project#58423)

3f7a7b4

- `test_target_capacity` windows test is failing, possibly because we have put up a short timeout of 10 seconds, increasing it to verify whether timeout is an issue or not. Signed-off-by: harshit <[email protected]>

[ci] seperate doc related jobs into its own group (ray-project#58454)

62231dd

so that they are not called lints any more Signed-off-by: Lonnie Liu <[email protected]>

[data] Update depsets for multimodal inference release tests (ray-pro…

ffb51f8

…ject#57233) Update remaining mulitmodal release tests to use new depsets.

[doc] ci: move doc annotation check to python 3.12 (ray-project#58507)

b09b076

be consistent with doc build environment Signed-off-by: Lonnie Liu <[email protected]>

[doc] change link check to run on python 3.12 (ray-project#58506)

ce1fd47

migrating all doc related things to run on python 3.12 Signed-off-by: Lonnie Liu <[email protected]>

[ci] apply isort to release test directory, part 1 (ray-project#58505)

b23adc7

excluding `*_tests` directories for now to reduce the impact Signed-off-by: Lonnie Liu <[email protected]>

[java] remove local genrule //java:ray_java_pkg (ray-project#58503)

f2dd0e2

using `bazelisk run //java:gen_ray_java_pkg` everywhere Signed-off-by: Lonnie Liu <[email protected]>

[ci] upgrade rayci version (ray-project#58514)

405c464

to 0.21.0; supports wanda priority now. Signed-off-by: Lonnie Liu <[email protected]>

[serve][llm] Fix import path in muli-node release test (ray-project#5…

09f0113

…8498) Signed-off-by: Seiji Eicher <[email protected]>

elliot-barn and others added 22 commits November 21, 2025 16:55

[deps] updating llm batch release depset (ray-project#58858)

f09600c

updating llm batch release test depset successful release test run: https://buildkite.com/ray-project/release/builds/68802#019aa374-b0c3-4da1-a003-9296ac07f4e0 --------- Signed-off-by: elliot-barn <[email protected]>

[depset] remove headers on all depsets (ray-project#58917)

ffa560a

resolves logical merge conflicts and fix ci test Signed-off-by: Lonnie Liu <[email protected]>

[core][rdt] Fix flaky send fail gloo test (ray-project#58924)

1364a14

Signed-off-by: dayshah <[email protected]>

[Data][Docs] Add read kafka user guide (ray-project#58881)

05a6fa8

## Description - Add kafka user guide in loading data --------- Signed-off-by: You-Cheng Lin <[email protected]> Signed-off-by: You-Cheng Lin <[email protected]> Co-authored-by: Richard Liaw <[email protected]>

[data][dashboard] Add time_to_first_batch and get_ref_bundles to data…

cd0881e

… dashboard attempt 2 (ray-project#58912) Changed panel id's to avoid merge conflicts --------- Signed-off-by: Timothy Seah <[email protected]>

[Data] Compute Expressions-datetime (ray-project#58740)

0cfeb95

## Description Completing the datetime namespace operations ## Related issues Related to ray-project#58674 ## Additional information --------- Signed-off-by: 400Ping <[email protected]>

antfin-oss requested review from SongGuyang and kfstorm as code owners November 25, 2025 03:27

antfin-oss added auto-generated daily-merge labels Nov 25, 2025

antfin-oss assigned ffbin Nov 25, 2025

sourcery-ai bot reviewed Nov 25, 2025

View reviewed changes

gemini-code-assist bot reviewed Nov 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔄 daily merge: master → main 2025-11-25 #688

🔄 daily merge: master → main 2025-11-25 #688

Uh oh!

antfin-oss commented Nov 25, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

gemini-code-assist bot commented Nov 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

74 participants

🔄 daily merge: master → main 2025-11-25 #688

Are you sure you want to change the base?

🔄 daily merge: master → main 2025-11-25 #688

Uh oh!

Conversation

antfin-oss commented Nov 25, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Nov 25, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

74 participants