🔄 daily merge: master → main 2025-11-11 #674

antfin-oss · 2025-11-11T02:57:46Z

This Pull Request was created automatically to merge the latest changes from master into main branch.

📅 Created: 2025-11-11
🔀 Merge direction: master → main
🤖 Triggered by: Scheduled

Please review and merge if everything looks good.

…API (ray-project#57977) ## Summary This PR updates the document embedding benchmark to use the canonical Ray Data implementation pattern, following best practices for the framework. ## Key Changes ### Use `download()` expression instead of separate materialization **Before:** ```python file_paths = ( ray.data.read_parquet(INPUT_PATH) .filter(lambda row: row["file_name"].endswith(".pdf")) .take_all() ) file_paths = [row["uploaded_pdf_path"] for row in file_paths] ds = ray.data.read_binary_files(file_paths, include_paths=True) ``` **After:** ```python ( ray.data.read_parquet(INPUT_PATH) .filter(lambda row: row["file_name"].endswith(".pdf")) .with_column("bytes", download("uploaded_pdf_path")) ``` This change: - Eliminates the intermediate materialization with `take_all()`, which loads all data into memory - Uses the `download()` expression to lazily fetch file contents as part of the pipeline - Removes the need for a separate `read_binary_files()` call ### Method chaining for cleaner code All operations are now chained in a single pipeline, making the data flow more clear and idiomatic. ### Consistent column naming Updated references from `path` to `uploaded_pdf_path` throughout the code for consistency with the source data schema. Signed-off-by: Balaji Veeramani <[email protected]>

This PR addresses several failing release tests likely due to the recent Ray Train V2 default enablement. The following failing release tests are addressed: - huggingface_transformers - distributed_training.regular - distributed_training.chaos distributed_training fix: `distributed_training.regular` and `distributed_training.chaos` were failing due to relying on the deprecated Reporting free-floating metrics functionality. The tests attempted to access an non existent key in the `result.metrics` that were not reported. The fix uploads a checkpoint to ensure this key exists. huggingface_transformers: The `huggingface_transformers` test was failing due to outdated accelerate and peft versions. The fix leverages a post-build file to ensure the proper accelerate and peft versions. Tests: Test Name | Before | After -- | -- | -- huggingface_transformers | https://buildkite.com/ray-project/release/builds/64733#019a0559-25da-401f-8d7e-3128b8f7d287 | https://buildkite.com/ray-project/release/builds/64888#019a090d-5f53-4f7d-b0ac-ac8cf7c529b6 distributed_training.regular | https://buildkite.com/ray-project/release/builds/64733#019a0572-1095-4b6f-b3bc-b496227c9280 | https://buildkite.com/ray-project/release/builds/64855#019a08c4-76b5-41b6-aaf6-2bbd443a0a1e distributed_training.chaos | https://buildkite.com/ray-project/release/builds/64733#019a0574-3862-4da2-a264-a9e11333bd72 | https://buildkite.com/ray-project/release/builds/64855#019a08c4-76b6-4344-90f2-cbcd637aae3d --------- Signed-off-by: JasonLi1909 <[email protected]> Signed-off-by: Jason Li <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

num_waiter == 0 does not necessarily mean that the request has been completed. --------- Signed-off-by: abrar <[email protected]>

…es (ray-project#57883) This PR adds a test to verify that DataOpTask handles node failures correctly during execution. To enable this testing, callback seams are added to DataOpTask that allow tests to simulate preemption scenarios by killing and restarting nodes at specific points during task execution. ## Summary - Add callback seams (`block_ready_callback` and `metadata_ready_callback`) to `DataOpTask` for testing purposes - Add `has_finished` property to track task completion state - Create `create_stub_streaming_gen` helper function to simplify test setup - Refactor existing `DataOpTask` tests to use the new helper function - Add new parametrized test `test_on_data_ready_with_preemption` to verify behavior when nodes fail during execution ## Test plan - Existing tests pass with refactored code - New preemption test validates that `on_data_ready` handles node failures correctly by testing both block and metadata callback scenarios --------- Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

## Description 1. the `mlflow.start_run()` does not have the `tracking_uri` arg: https://mlflow.org/docs/latest/api_reference/python_api/mlflow.html#mlflow.start_run 2. rewrite the mlflow set up as follow ``` mlflow.set_tracking_uri(uri="file://some_shared_storage_path/mlruns") mlflow.set_experiment("my_experiment") mlflow.start_run() ``` ## Related issues N/A --------- Signed-off-by: Lehui Liu <[email protected]>

…ay-project#57980)

…t#57855) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. ## Description 1. Add visitors for collecting column names from all expressions and renaming names across the tree. 2. Use expressions for rename_columns, with_column, select_columns and remove cols and cols_rename in Project 3. Modify Projection Pushdown to work with combinations of the above operators correctly ## Related issues Closes ray-project#56878, ray-project#57700 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Goutam <[email protected]>

…ject#55291) Resolves: ray-project#55288 (wrong `np.array` in `TensorType`) Furthermore changes: - Changed comments to (semi)docstring which will be displayed as tooltips by IDEs (e.g. VSCode + Pylance) making that information available to the user. - `AgentID: Any -> Hashable` as it used for dict keys - changed `DeviceType` to be not a TypeVar (makes no sense in the way it is currently used), also includes DeviceLikeType (`int | str | device`) from `torch`. IMO it can fully replace the current type but being defensive I only added it as an extra possible type - Used updated DeviceType to improve type of Runner._device and make it more correct - Used torch's own type in `data`, current code supports more than just `str`. I refrained from adding a reference to `rllib` despite it being nice if they would be in sync. - Some extra formatting that is forced by pre-commit  --- > [!NOTE] > Revamps `rllib.utils.typing` (NDArray-based `TensorType`, broader `DeviceType`, `AgentID` as `Hashable`, docstring cleanups) and updates call sites to use optional device typing and improved hints. > > - **Types**: > - Overhaul `rllib/utils/typing.py`: > - `TensorType` now uses `numpy.typing.NDArray`; heavy use of `TYPE_CHECKING` to avoid runtime deps on torch/tf/jax. > - `DeviceType` widened to `Union[str, torch.device, int]` (was `TypeVar`). > - `AgentID` tightened to `Hashable`; `NetworkType` uses `keras.Model`. > - Refined aliases (e.g., `FromConfigSpec`, `SpaceStruct`) and added concise docstrings. > - **Runners**: > - `Runner._device` now `Optional` (`Union[DeviceType, None]`) with updated docstring; same change in offline runners’ `_device` properties. > - **Connectors**: > - `NumpyToTensor`: `device` param typed as `Optional[DeviceType]` (via `TYPE_CHECKING`). > - **Utils**: > - `from_config`: typed `config: Optional[FromConfigSpec]` with `TYPE_CHECKING` import. > - **Misc**: > - Minor formatting/import ordering and comment typo fixes. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit ae2e422. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>  --------- Signed-off-by: Daniel Sperber <[email protected]> Signed-off-by: Daraan <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Kamil Kaczmarek <[email protected]> Co-authored-by: Kamil Kaczmarek <[email protected]>

…57993) Although Spark-on-Ray depends on the Java bindings, we `java` tests are triggered by all C++ changes and we don't want to run Spark-on-Ray tests every time we change C++ code. --------- Signed-off-by: Edward Oakes <[email protected]>

…ay-project#57771) Signed-off-by: Nikhil Ghosh <[email protected]>

…ay-project#57987) Signed-off-by: daiping8 <[email protected]>

…ect#57974) This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all object-manager components. Normally, metrics are defined at the top-level component and passed down to sub-components. However, in this case, because object manager is used as an API across, doing so would feel unnecessarily cumbersome. I decided to define the metrics inline within each client and server class instead. Note that the metric classes (Metric, Gauge, Sum, etc.) are simply wrappers around static OpenCensus/OpenTelemetry entities. **Details** Full context of this refactoring work. - Each component (e.g., gcs, raylet, core_worker, etc.) now has a metrics.h file located in its top-level directory. This file defines all metrics for that component. - In most cases, metrics are defined once in the main entry point of each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.). These metrics are then passed down to subcomponents via the ray::observability::MetricInterface. - This approach significantly reduces rebuild time when metric infrastructure changes. Previously, a change would trigger a full Ray rebuild; now, only the top-level entry points of each component need rebuilding. - There are a few exceptions where metrics are tracked inside object libraries (e.g., task_specification). In these cases, metrics are defined within the library itself, since there is no corresponding top-level entry point. - Finally, the obsolete metric_defs.h and metric_defs.cc files can now be completely removed. This paves the way for further dead code cleanup in a future PR. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>

## Description Use `tune.report` instead of `train.report`. Signed-off-by: Matthew Deng <[email protected]>

…t#57620)   ## Why are these changes needed? This will be used to help control the targets that are returned.  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: akyang-anyscale <[email protected]>

## Description This PR adds a new check to make sure proxies are ready to serve traffic before finishing serve.run. For now, the check immediately finishes.  ## Related issues  ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [ ] Enhancement 🚀 - [ ] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes ⚠️ - [ ] No  **Testing:** - [ ] Added/updated tests for my changes - [ ] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [ ] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context  --------- Signed-off-by: akyang-anyscale <[email protected]>

…roject#57793) When deploying ray on Yarn using Skein, it's useful to expose the ray's dashboard via Skein's web ui. This PR shows how to expose that and update the related document. Signed-off-by: Zakelly <[email protected]> Co-authored-by: Edward Oakes <[email protected]>

…cgroup even if they are drivers (ray-project#57955) For more details about the resource isolation project see ray-project#54703. Driver processes that are registered in ray's internal namespace (such as ray dashboard's job and serve modules) are considered system processes. Therefore, they will not be moved into the workers cgroup when they register with the raylet. --------- Signed-off-by: irabbani <[email protected]>

…ray-project#57938) This PR adds persistent epoch data to the checkpointing logic in the [FSDP2 Template](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html). This PR includes: - New logic for saving the epoch into a distributed checkpoint - New logic for resuming training from the saved epoch in a loaded checkpoint - Updates the [OSS FSDP2 example](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html) to include the new logic Passing release test: https://buildkite.com/ray-project/release/builds/64867#019a08e3-1a3e-4fc5-9633-b8e3a0b0f34f --------- Signed-off-by: JasonLi1909 <[email protected]> Signed-off-by: Jason Li <[email protected]> Co-authored-by: matthewdeng <[email protected]>

> Briefly describe what this PR accomplishes and why it's needed. Making Cancel Remote Task RPC idempotent and fault tolerant. Added a python test to verify retry behavior, no cpp test since it just calls CancelTask RPC so nothing to test. Also renamed uses of RemoteCancelTask to CancelRemoteTask since it should be consistent. --------- Signed-off-by: joshlee <[email protected]>

…utside of a Ray Train worker (ray-project#57863) Introduce a decorator to mark functions that require running inside a worker process spawned by Ray Train. --------- Signed-off-by: Justin Yu <[email protected]>

## Description Fix the typing for UDFs. This should not accept an instance as it is currently defined. Signed-off-by: Matthew Owen <[email protected]>

…ock sizing (ray-project#58013) ## Summary Add a `repartition` call with `target_num_rows_per_block=BATCH_SIZE` to the audio transcription benchmark. This ensures blocks are appropriately sized to: - Prevent out-of-memory (OOM) errors - Ensure individual tasks don't take too long to complete ## Changes - Added `ds = ds.repartition(target_num_rows_per_block=BATCH_SIZE)` after reading the parquet file in `ray_data_main.py:98` Signed-off-by: Balaji Veeramani <[email protected]>

…57044) running core scalability tests on python 3.10 Updating unit test Successful release test: https://buildkite.com/ray-project/release/builds/60890#01999c8a-6fdc-446a-a9da-2b9b006692d3 --------- Signed-off-by: elliot-barn <[email protected]>

## Description We are using `read_parquet` in two of our tests in `test_operator_fusion.py`, this switches those to use `range` to make the tests less brittle. Signed-off-by: Matthew Owen <[email protected]>

with comments to github issues Signed-off-by: Lonnie Liu <[email protected]>

otherwise, the ordering or messages looks strange on windows. Signed-off-by: Lonnie Liu <[email protected]>

Updates the vicuna lightning deepspeed example to run w/ Train V2. --------- Signed-off-by: Justin Yu <[email protected]>

…8020) ## Description Currently, streaming repartition isn't combining blocks to the `target_num_rows_per_block` which is problematic, in a sense that it can only split blocks but not recombine them. This PR is addressing that by allowing it to recombine smaller blocks into bigger ones. However, one caveat is that the remainder of the block could still be under `target_num_rows_per_block`. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <[email protected]>

…e buildup (ray-project#57996) …e buildup > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] ConcurrencyCapBackpressurePolicy - Handle internal output queue buildup **Issue** - When there is internal output queue buildup specifically when preserve_order is set, we don't limit task concurrency in streaming executor and just honor static concurrency cap. - When concurrency cap is unlimited, we keep queuing more Blocks into internal output queue leading to spill and steep spill curve. **Solution** In ConcurrencyCapBackpressurePolicy, detect internal output queue buildup and then limit the concurrency of the tasks. - Keep the internal output queue history and detect trends in percentage & size in GBs. Based on trends, increase/decrease the concurrency cap. - Given queue based buffering is needed for `preserve_order`, allow adaptive queuing threshold. This would result in Spill, but would flatten out the Spill curve and not cause run away buffering queue growth. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <[email protected]>

…#57999) We have a feature flag to control the rolling out of ray export event, but the feature flag is missing the controlling of `StartExportingEvents`. This PR fixes the issue. Test: - CI Signed-off-by: Cuong Nguyen <[email protected]>

…ct#58445) ## Summary Creates a dedicated `tests/unit/` directory for unit tests that don't require Ray runtime or external dependencies. ## Changes - Created `tests/unit/` directory structure - Moved 13 pure unit tests to `tests/unit/` - Added `conftest.py` with fixtures to prevent `ray.init()` and `time.sleep()` - Added `README.md` documenting unit test requirements - Updated `BUILD.bazel` to run unit tests with "small" size tag ## Test Files Moved 1. test_arrow_type_conversion.py 2. test_block.py 3. test_block_boundaries.py 4. test_data_batch_conversion.py 5. test_datatype.py 6. test_deduping_schema.py 7. test_expression_evaluator.py 8. test_expressions.py 9. test_filename_provider.py 10. test_logical_plan.py 11. test_object_extension.py 12. test_path_util.py 13. test_ruleset.py These tests are fast (<1s each), isolated (no Ray runtime), and deterministic (no time.sleep or randomness). --------- Signed-off-by: Balaji Veeramani <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

> Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Concurrency Cap Backpressure tuning - Maintain asymmetric EWMA of total queued bytes (this op + downstream) as the typical level: level. - Maintain asymmetric EWMA of absolute residual vs the previous level as a scale proxy: dev = EWMA(|q - level_prev|). - Define deadband: [lower, upper] = [level - K_DEVdev, level + K_DEVdev]. If q > upper -> target cap = running - BACKOFF_FACTOR (back off) If q < lower -> target cap = running + RAMPUP_FACTOR (ramp up) Else -> target cap = running (hold) - Clamp to [1, configured_cap], admit iff running < target cap. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <[email protected]> Signed-off-by: Srinath Krishnamachari <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ay-project#58301)

Signed-off-by: Nikhil Ghosh <[email protected]>

… in read-only mode (ray-project#58460) This ensures node type names are correctly reported even when the autoscaler is disabled (read-only mode). ## Description Autoscaler v2 fails to report prometheus metrics when operating in read-only mode on KubeRay with the following KeyError error: ``` 2025-11-08 12:06:57,402 ERROR autoscaler.py:215 -- 'small-group' Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state return Reconciler.reconcile( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 120, in reconcile Reconciler._step_next( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 275, in _step_next Reconciler._scale_cluster( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1125, in _scale_cluster reply = scheduler.schedule(sched_request) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 933, in schedule ResourceDemandScheduler._enforce_max_workers_per_type(ctx) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 1006, in _enforce_max_workers_per_type node_config = ctx.get_node_type_configs()[node_type] KeyError: 'small-group' ``` This happens because the `ReadOnlyProviderConfigReader` populates `ctx.get_node_type_configs()` using node IDs as node types, which is correct for local Ray (where local ray does not have `RAY_NODE_TYPE_NAME` set), but incorrect for KubeRay where `ray_node_type_name` is present and expected with `RAY_NODE_TYPE_NAME` set. As a result, in read-only mode the scheduler sees a node type name (ex. small-group) that never exists in the populated configs. This PR fixes the issue by using `ray_node_type_name` when it exists, and only falling back to node ID when it does not. ## Related issues Fixes ray-project#58227 Signed-off-by: Rueian <[email protected]>

…cess: bool (ray-project#58384) ## Description Pass in `status_code` directly into `do_reply`. This is a follow up to ray-project#58255 ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <[email protected]>

…ay-project#58464)

ray-project#58395) ## Description there are places in the python code where we use the raw grpc library to make grpc calls (eg: pub-sub, some calls to gcs etc). In the long term we want to fully deprecate grpc library usage in our python code base but as that can take more effort and testing, in this pr I am introducing an interceptor to add auth headers (this will take effect for all grpc calls made from python). ## Testing ### case 1: submitting a job using CLI ``` export RAY_auth_mode="token" export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789" ray start --head ray job submit -- echo "hi" ``` output ``` ray job submit -- echo "hi" 2025-11-04 06:28:09,122 - INFO - NumExpr defaulting to 4 threads. Job submission server address: http://127.0.0.1:8265 ------------------------------------------------------- Job 'raysubmit_1EV8q86uKM24nHmH' submitted successfully ------------------------------------------------------- Next steps Query the logs of the job: ray job logs raysubmit_1EV8q86uKM24nHmH Query the status of the job: ray job status raysubmit_1EV8q86uKM24nHmH Request the job to be stopped: ray job stop raysubmit_1EV8q86uKM24nHmH Tailing logs until the job exits (disable with --no-wait): 2025-11-04 06:28:10,363 INFO job_manager.py:568 -- Runtime env is setting up. hi Running entrypoint for job raysubmit_1EV8q86uKM24nHmH: echo hi ------------------------------------------ Job 'raysubmit_1EV8q86uKM24nHmH' succeeded ------------------------------------------ ``` ### case 2: submitting a job with actors and tasks and verifying on dashboard test.py ```python import time import ray from ray._raylet import Config ray.init() @ray.remote def print_hi(): print("Hi") time.sleep(2) @ray.remote class SimpleActor: def __init__(self): self.value = 0 def increment(self): self.value += 1 return self.value actor = SimpleActor.remote() result = ray.get(actor.increment.remote()) for i in range(100): ray.get(print_hi.remote()) time.sleep(20) ray.shutdown() ``` ``` export RAY_auth_mode="token" export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789" python test.py ``` ### dashboard screenshots: #### promts user to input the token <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/008829d8-51b6-445a-b135-5f76b6ccf292" /> ### on passing the right token: overview page <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/cece0da7-0edd-4438-9d60-776526b49762" /> job page: tasks are listed <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/b98eb1d9-cacc-45ea-b0e2-07ce8922202a" /> task page <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/09ff38e1-e151-4e34-8651-d206eb8b5136" /> actors page <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/10a30b3d-3f7e-4f3d-b669-962056579459" /> specific actor page <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/ab1915bd-3d1b-4813-8101-a219432a55c0" /> --------- Signed-off-by: sampan <[email protected]> Co-authored-by: sampan <[email protected]>

Currently are deprecation warnings sometimes not informative enough. The the warning is triggered it does not tell us *where* the deprecated feature is used. For example, ray internally raises a deprecation warning when an `RLModuleConfig` is initialized. ```python >>> from ray.rllib.core.rl_module.rl_module import RLModuleConfig >>> RLModuleConfig() 2025-11-02 18:21:27,318 WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future! ``` This is confusing, where did *I* use a config, what am I doing wrong? This raises issues like: https://discuss.ray.io/t/warning-deprecation-py-50-deprecationwarning-rlmodule-config-rlmoduleconfig-object-has-been-deprecated-use-rlmodule-observation-space-action-space-inference-only-model-config-catalog-class-instead/23064 Tracing where the error actually happens is tedious - is it my code or internal? The output just shows `deprecation.:50`. Not helpful. This PR adds a stacklevel option with stacklevel=2 as the default to all `deprecation_warning`s. So devs and users can better see where is the deprecated option actually used. --- EDIT: **Before** ```python WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` ``` **After** module.py:line where the deprecated artifact is used is shown in the log output: When building an Algorithm: ```python WARNING rl_module.py:445 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future! ``` ```python .../ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning" ``` Signed-off-by: Daraan <[email protected]>

Fix some vale's error and suggestions on the kai-scheduler document. See ray-project#58161 (comment) Signed-off-by: fscnick <[email protected]>

> Briefly describe what this PR accomplishes and why it's needed. Our serve ingress keeps running into below error related to `uvloop` under heavy load ``` File descriptor 97 is used by transport ``` The uvloop team have a [PR](MagicStack/uvloop#646) to fix it, but seems like no one is working on it One of workaround mentioned in the ([PR](MagicStack/uvloop#646 (comment))) is to just turn off uvloop . We tried it in our env and didn't see any major performance difference Hence as part of this PR, we are defining a new env for controlling UVloop Signed-off-by: jugalshah291 <[email protected]>

- `test_target_capacity` windows test is failing, possibly because we have put up a short timeout of 10 seconds, increasing it to verify whether timeout is an issue or not. Signed-off-by: harshit <[email protected]>

so that they are not called lints any more Signed-off-by: Lonnie Liu <[email protected]>

…ject#57233) Update remaining mulitmodal release tests to use new depsets.

…y-project#58441) ## Description Currently, we clear _external_ queues when an operator is manually marked as finished. But we don't clear their _internal_ queues. This PR fixes that ## Related issues Fixes this test https://buildkite.com/ray-project/postmerge/builds/14223#019a5791-3d46-4ab8-9f97-e03ea1c04bb0/642-736 ## Additional information --------- Signed-off-by: iamjustinhsu <[email protected]>

be consistent with doc build environment Signed-off-by: Lonnie Liu <[email protected]>

migrating all doc related things to run on python 3.12 Signed-off-by: Lonnie Liu <[email protected]>

excluding `*_tests` directories for now to reduce the impact Signed-off-by: Lonnie Liu <[email protected]>

using `bazelisk run //java:gen_ray_java_pkg` everywhere Signed-off-by: Lonnie Liu <[email protected]>

This PR adds 2 new metrics to core_worker by way of the reference counter. The two new metrics keep track of the count and size of objects owned by the worker as well as keeping track of their states. States are defined as: - **PendingCreation**: An object that is pending creation and hasn't finished it's initialization (and is sizeless) - **InPlasma**: An object which has an assigned node address and isn't spilled - **Spilled**: An object which has an assigned node address and is spilled - **InMemory**: An object which has no assigned address but isn't pending creation (and therefore, must be local) The approach used by these new metrics is to examine the state 'before and after' any mutations on the reference in the reference_counter. This is required in order to do the appropriate bookkeeping (decrementing values and incrementing others). Admittedly, there is potential for counting on the in between decrements/increments depending on when the RecordMetrics loop is run. This unfortunate side effect however seems preferable to doing mutual exclusion with metric collection as this is potentially a high throughput code path. In addition, performing live counts seemed preferable then doing full accounting of the object store and across all references at time of metric collection. Reason being, that potentially the reference counter is tracking millions of objects, and each metric scan could potentially be very expensive. So running the accounting (despite being potentially innaccurate for short periods) seemed the right call. This PR also allows for object size to potentially change due to potential non deterministic instantation (say an object is initially created, but it's primary copy dies, and then the recreation fails). This is an edge case, but seems important for completeness sake. --------- Signed-off-by: zac <[email protected]>

sourcery-ai

The pull request #674 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5336.

gemini-code-assist · 2025-11-11T03:06:23Z

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the latest changes from the master branch into main, focusing on comprehensive updates to the project's build and continuous integration infrastructure. It modernizes dependency management, refines Docker image creation, and enhances the testing framework to ensure more robust and efficient development workflows. The changes span across various configuration files, reflecting a concerted effort to improve the overall maintainability and reliability of the project's development environment.

Highlights

Bazel Configuration Updates: The .bazelrc file has been updated to enable --incompatible_strict_action_env by default, add a workspace_status_command for Linux builds, and include /utf-8 CXX options for Windows. It also now ignores warnings for src/ray/thirdparty/ files and includes a try-import .user.bazelrc for custom user options. Specific GPR_MANYLINUX1 definitions for gRPC on Linux and incompatible_strict_action_env for Windows builds have been removed. Additionally, -Wno-error=deprecated-declarations has been added for macOS builds, and test_env=PATH is now included for CI Python tests.
CI/CD Pipeline Refactoring: Significant changes have been made across .buildkite YAML files. Several base image build steps were removed from _forge.rayci.yml and consolidated into a new _images.rayci.yml file, which defines a comprehensive set of base images including extra variants. Python 3.10 support has been expanded across various build and test matrices in base.rayci.yml, and new cibase tags have been introduced for many steps. Dedicated CI pipelines for dependencies (dependencies.rayci.yml) and documentation (doc.rayci.yml) have been added. The kuberay.rayci.yml now uses macos-arm64 for bisecting and has removed kuberay doc tests.
Docker Image Build Enhancements: New Dockerfiles and Wanda configurations have been introduced for ray-core, ray-dashboard, and ray-java builds, streamlining the creation of these core components. The build.rayci.yml now includes ray-extra image builds and updates platform versions for ray-llm. The manylinux.Dockerfile has been upgraded to use miniforge3 instead of miniconda3 and includes uv for dependency management, along with updated Docker group configurations.
Test Configuration Adjustments: Test configurations have been refined across core.rayci.yml, data.rayci.yml, llm.rayci.yml, ml.rayci.yml, rllib.rayci.yml, and serve.rayci.yml. This includes adding new build dependencies, updating Python versions in test matrices, and modifying except-tags for various test suites to improve granularity and relevance. Notably, dask and modin tests have been separated, and new train v2 and tracing test steps have been introduced. Many instance_type settings have been adjusted, and RAYCI_DISABLE_TEST_DB=1 has been added to several steps to prevent test database interactions.
Linting and Formatting Updates: The .pre-commit-config.yaml has been updated to include src/ray/thirdparty/ in exclusions, refine pydoclint hooks for local and CI use, and expand cpplint and buildifier file matching. New semgrep, vale, cython-lint, check-train-circular-imports, and eslint hooks have been added to enhance code quality checks. The .vale.ini and related vocabulary files have been updated with new ignored blocks and accepted terms.

Ignored Files

Ignored by pattern: .gemini/** (1)
- .gemini/config.yaml
Ignored by pattern: .github/workflows/** (1)
- .github/workflows/stale_pull_request.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request represents a massive and impressive refactoring of the CI/CD and build systems. The changes significantly improve modularity, maintainability, and adopt modern best practices across the board. Key improvements include the modularization of the build pipeline, the introduction of a new dependency management system (raydepsets), the switch to uv and pre-commit for better linting and dependency handling, and the simplification of the Bazel build configuration. The refactoring is consistent and well-executed across all the changes. I have reviewed the changes in detail and found no issues. This is an excellent contribution to the project's health.

github-actions · 2025-11-26T01:43:20Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

bveeramani and others added 30 commits October 22, 2025 00:28

deflake app level autoscaling test (ray-project#57967)

4660221

num_waiter == 0 does not necessarily mean that the request has been completed. --------- Signed-off-by: abrar <[email protected]>

[data][llm] Fix vLLMEngineStage field name inconsistency for images (r…

10ff03a

…ay-project#57980)

[bugfix][serve][llm] Fix port collisions for TP/PP with NIXL/LMCache (r…

d9b0a85

…ay-project#57771) Signed-off-by: Nikhil Ghosh <[email protected]>

[Core][BUG] Fix transport type handling in DAG node initialization. (r…

c0faad4

…ay-project#57987) Signed-off-by: daiping8 <[email protected]>

[tune] update jobs test to use tune module (ray-project#57995)

d027e90

## Description Use `tune.report` instead of `train.report`. Signed-off-by: Matthew Deng <[email protected]>

[train] Improve error message if users call training function utils o…

e4e9399

…utside of a Ray Train worker (ray-project#57863) Introduce a decorator to mark functions that require running inside a worker process spawned by Ray Train. --------- Signed-off-by: Justin Yu <[email protected]>

[data] Fix type for UDFs (ray-project#57976)

4da5915

## Description Fix the typing for UDFs. This should not accept an instance as it is currently defined. Signed-off-by: Matthew Owen <[email protected]>

[data] Use ranges in test_operator_fusion.py (ray-project#58000)

1b1bd91

## Description We are using `read_parquet` in two of our tests in `test_operator_fusion.py`, this switches those to use `range` to make the tests less brittle. Signed-off-by: Matthew Owen <[email protected]>

[examples] disable tests that have been failing (ray-project#58015)

7e21548

with comments to github issues Signed-off-by: Lonnie Liu <[email protected]>

[ci] write test progress message to stderr (ray-project#58019)

a6193b2

otherwise, the ordering or messages looks strange on windows. Signed-off-by: Lonnie Liu <[email protected]>

[train] Update vicuna release test example to use V2 (ray-project#57767)

e7a79ba

Updates the vicuna lightning deepspeed example to run w/ Train V2. --------- Signed-off-by: Justin Yu <[email protected]>

bveeramani and others added 20 commits November 7, 2025 23:01

[serve][llm] Data Parallel Attention: Public API and Documentation (r…

2691094

…ay-project#58301)

[data][llm] fix vllm ray data quickstart example (ray-project#58463)

654feda

Signed-off-by: Nikhil Ghosh <[email protected]>

[Data] Add exception handling for invalid URIs in download operation (r…

71c7bd0

…ay-project#58464)

[Doc][KubeRay] eliminate vale errors (ray-project#58429)

2486ddd

Fix some vale's error and suggestions on the kai-scheduler document. See ray-project#58161 (comment) Signed-off-by: fscnick <[email protected]>

increase timeout for test_initial_replica tests (ray-project#58423)

3f7a7b4

- `test_target_capacity` windows test is failing, possibly because we have put up a short timeout of 10 seconds, increasing it to verify whether timeout is an issue or not. Signed-off-by: harshit <[email protected]>

[ci] seperate doc related jobs into its own group (ray-project#58454)

62231dd

so that they are not called lints any more Signed-off-by: Lonnie Liu <[email protected]>

[data] Update depsets for multimodal inference release tests (ray-pro…

ffb51f8

…ject#57233) Update remaining mulitmodal release tests to use new depsets.

[doc] ci: move doc annotation check to python 3.12 (ray-project#58507)

b09b076

be consistent with doc build environment Signed-off-by: Lonnie Liu <[email protected]>

[doc] change link check to run on python 3.12 (ray-project#58506)

ce1fd47

migrating all doc related things to run on python 3.12 Signed-off-by: Lonnie Liu <[email protected]>

[ci] apply isort to release test directory, part 1 (ray-project#58505)

b23adc7

excluding `*_tests` directories for now to reduce the impact Signed-off-by: Lonnie Liu <[email protected]>

[java] remove local genrule //java:ray_java_pkg (ray-project#58503)

f2dd0e2

using `bazelisk run //java:gen_ray_java_pkg` everywhere Signed-off-by: Lonnie Liu <[email protected]>

antfin-oss requested review from SongGuyang and kfstorm as code owners November 11, 2025 02:57

antfin-oss added auto-generated daily-merge labels Nov 11, 2025

antfin-oss assigned ffbin Nov 11, 2025

sourcery-ai bot reviewed Nov 11, 2025

View reviewed changes

gemini-code-assist bot reviewed Nov 11, 2025

View reviewed changes

github-actions bot added the stale label Nov 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔄 daily merge: master → main 2025-11-11 #674

🔄 daily merge: master → main 2025-11-11 #674

Uh oh!

antfin-oss commented Nov 11, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

gemini-code-assist bot commented Nov 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

79 participants

🔄 daily merge: master → main 2025-11-11 #674

Are you sure you want to change the base?

🔄 daily merge: master → main 2025-11-11 #674

Uh oh!

Conversation

antfin-oss commented Nov 11, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Nov 11, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

79 participants