Skip to content

Conversation

@antfin-oss
Copy link

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2025-11-18
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

elliot-barn and others added 30 commits October 29, 2025 16:13
updating tune release tests to run on python 3.10

Successful release test run:
https://buildkite.com/ray-project/release/builds/65655
(failing tests are already disabled)

---------

Signed-off-by: elliot-barn <[email protected]>
Remove actor handle from object that get's passed around in long poll
communication.

Return actor handle in nested objects from the task make the caller of
this task a borrower from the reference counting POV. But this pattern,
although allowed, is not very well tested. Hence breaking it by passing
actor_name from listen_for_change instead.

---------

Signed-off-by: abrar <[email protected]>
## Description

The full name was probably hallucinated from LLM.

## Related issues

## Additional information

Signed-off-by: Rui Qiao <[email protected]>
…ross-node parallelism (ray-project#57261)

Signed-off-by: jeffreyjeffreywang <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Co-authored-by: jeffreyjeffreywang <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Nikhil G <[email protected]>
…imit` (ray-project#58303)

## Description

## Related issues

Fix comment
ray-project#58264 (comment)
## Additional information

Signed-off-by: You-Cheng Lin <[email protected]>
…ist with nixl (ray-project#58263)

## Description
For nixl, reuse previous metadata if transferring the same tensor list.
This is to avoid repeated `register_memory` before `deregister_memory`

---------

Signed-off-by: Dhyey Shah <[email protected]>
Co-authored-by: Dhyey Shah <[email protected]>
Co-authored-by: Stephanie Wang <[email protected]>
…tuned_examples/`` in ``rllib`` (ray-project#56746)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

Seventh split of ray-project#56416

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Gagandeep Singh <[email protected]>
Signed-off-by: Kamil Kaczmarek <[email protected]>
Co-authored-by: Kamil Kaczmarek <[email protected]>
Co-authored-by: Mark Towers <[email protected]>
…ct#57835)

## Description
builds atop of ray-project#58047, this pr
ensures the following when `auth_mode` is `token`:
calling `ray.init() `(without passing an existing cluster address) ->
check if token is present, generate and store in default path if not
present
calling `ray.init(address="xyz")` (connecting to an existing cluster) ->
check if token is present, raise exception if one is not present

---------

Signed-off-by: sampan <[email protected]>
Signed-off-by: Sampan S Nayak <[email protected]>
Co-authored-by: sampan <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
…indefinite waiting (ray-project#58238)

## Description
Add guidance for RayService initialization timeout to prevent indefinite
waiting with `ray.io/initializing-timeout` annotation on RayService.

## Related issues
Closes ray-project/kuberay#4138

## Additional information
None

---------

Signed-off-by: wei-chenglai <[email protected]>
Signed-off-by: Wei-Cheng Lai <[email protected]>
Co-authored-by: Han-Ju Chen (Future-Outlier) <[email protected]>
…check in accelerator_context.py (ray-project#58269)

## Description

- Change the caught exception type from IndexError to TypeError
- This modification ensures that the correct exception is raised when
the expected accelerator ID is not included in the
accelerator_visible_list

`list.index` will raise a
[ValueError](https://docs.python.org/3/library/exceptions.html#ValueError)
if there is no such item.

https://docs.python.org/3/tutorial/datastructures.html

<img width="797" height="79" alt="image"
src="https://github.com/user-attachments/assets/830cf4aa-d9cb-44d3-9363-8e5cd576bae9"
/>

Now, the error logs will be correctly captured and printed.
<img width="1454" height="143" alt="image"
src="https://github.com/user-attachments/assets/2b0ed0aa-60c7-49c3-84b0-7f0d4f1ebe48"
/>

Signed-off-by: daiping8 <[email protected]>
Significant component, I keep forgetting it's buried inside of
`common/`.

Also cleaned up the mock & proto build targets that were in the
top-level `BUILD.bazel` file.

We should also at some point clean it up to follow the common pattern of
separate targets for an interface, client, & server.

---------

Signed-off-by: Edward Oakes <[email protected]>
…t#58070)

RayEvent provides a special API, merge, which allows multiple events to
be combined into a single event. This reduces gRPC message size, network
bandwidth usage, and is essential for scaling task event exports. This
PR leverages that feature.

Specifically, it clusters events into groups based on (i) entity ID and
(ii) event type. Each group is merged into a single event, which is then
added to the gRPC message body. The EntityId is a user-defined function,
implemented by the event class creator, that determines which events can
be safely merged.

```
Note: this is a redo of ray-project#56558 which gets converted because it randomize the order the events that get exported, lead to flaky tests etc. This attempt maintain the order even after merging.
```

Test:
- CI

Signed-off-by: Cuong Nguyen <[email protected]>
… in core worker (ray-project#58315)

The `metrics_agent_client_` depends on `client_call_manager_`, but
previously it was pulling out a reference to it from the core worker,
which is not guaranteed to outlive the agent client.

Modifying it to keep the `client_call_manager_` as a field of the
`core_worker_process` instead.

I think we may also need to drain any ongoing RPCs from the
`metrics_agent_client_` on shutdown. Leaving that for a future PR.

---------

Signed-off-by: Edward Oakes <[email protected]>
## Description

Historically, the intention was to avoid failures upon attempts to
modify provided batch in-place when, for ex, using Torch tensors.

However, that is unjustifiably penalizing 99.9% of use-cases for 0.1% of
scenarios. As such, we're flipping this setting to be
`zero_copy_batch=True` by default.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <[email protected]>
A few release tests currently use the `raw_metric` API to retrieve
Prometheus metrics. This approach is unreliable because it directly
polls the metric export endpoint, which can race with the Prometheus
server that also polls the same endpoint.

To address this, add a method to query metrics directly from the
Prometheus server instead. This ensures that in end-to-end and
production environments, Prometheus remains the sole poller of exported
metrics and the single source of truth for metric values.

Test:
- CI
- The memory metric is collected and the number makes sense:
https://buildkite.com/ray-project/release/builds/66031/steps/canvas?sid=019a3215-e3be-4e9b-86a3-1f0a8253fea7#019a3215-e3f2-4d0e-877f-5f43d97d6e8e/557-654

Signed-off-by: Cuong Nguyen <[email protected]>
Extend token auth support to dashboard head (all API's)

---------

Signed-off-by: sampan <[email protected]>
Signed-off-by: Sampan S Nayak <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
Co-authored-by: sampan <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
This PR also updates the cluster resource scheduler logic to account for
the list of `LabelSelector`s specified by the `fallback_strategy`,
falling back to each fallback strategy `LabelSelector` in-order until
one is satisfied when selecting the best node. We're able to support
fallback selectors by considering them in the cluster resource scheduler
in-order using the existing label selector logic in `IsFeasible` and
`IsAvailable`, returning the first valid node returned by
`GetBestSchedulableNode`.

ray-project#51564

---------

Signed-off-by: Ryan O'Leary <[email protected]>
Signed-off-by: Ryan O'Leary <[email protected]>
Co-authored-by: Mengjin Yan <[email protected]>
…t Raylet logs (ray-project#58244)

`test_gcs_fault_tolerance.py:: test_worker_raylet_resubscription` is
still flaky in CI despite bumping up the timeout. Making a few
improvements here:

- Increasing the timeout to `20s` just in case it's a timeout issue
(unlikely).
- Changing to scheduling an actor instead of using `internal_kv` for our
signal that the GCS is back up. This should better indicate that the
Raylet is resubscribed.
- Cleaning up some system logs.
- Modifying the `ObjectLostError` logs to avoid logging
likely-irrelevant plasma usage on owner death.

It's likely that the underlying issue here is that we don't actually
reliably resubscribe to all worker death notifications, as indicated in
the TODO in the PR.

---------

Signed-off-by: Edward Oakes <[email protected]>
…ay-project#58255)

## Description
Currently, the `ray.util.state.list_actors(limit=N)` API will return a
details for at most N actors. However, when N exceeds the default value
for `RAY_MAX_LIMIT_FROM_API_SERVER=10_000`, the error will fail with a
misleading message (the error msg being that the dashboard or API server
is unavailable, even if it is available). The reason why this fails is
because we don't handle `ValueErrors`, and default throw a 500 error. My
solution is to handle that, and open to suggestions/alternatives

NOTE: I noticed it still fails with internal_server error, I think this
should be a 4XX error, but it looks like that will require more code
changes since it uses a very ubiquitous function: `do_reply`. Gemini
suggests returning `rest_response` directly, happy to follow those
orders too

### Before
```python
>>> import ray
>>> import ray.util.state
>>> ray.init()
>>> ray.util.state.list_actors(limit=100000)
ray.util.state.exception.RayStateApiException: Failed to make request to http://127.0.0.1:8265/api/v0/actors. Failed to connect to API server. Please check the API server log for details. Make sure dependencies are installed with `pip install ray[default]`. Please also check dashboard is available, and included when starting ray cluster, i.e. `ray start --include-dashboard=True --head`. Response(url=http://127.0.0.1:8265/api/v0/actors?limit=100000&timeout=24&detail=False&exclude_driver=True&server_timeout_multiplier=0.8,status=500)
```

### After
```python
>>> import ray
>>> import ray.util.state
>>> ray.init()
>>> ray.util.state.list_actors(limit=100000)
ray.util.state.exception.RayStateApiException: API server internal error. See dashboard.log file for more details. Error: Given limit 100000 exceeds the supported limit 10000. Use a lower limit, or set the RAY_MAX_LIMIT_FROM_API_SERVER=limit
```

## Related issues
None
## Additional information
None

---------

Signed-off-by: iamjustinhsu <[email protected]>
ugprading tune fault release test to 3.10

Successful release test run:
https://buildkite.com/ray-project/release/builds/65673#

---------

Signed-off-by: elliot-barn <[email protected]>
upgrading jobs tests to run on python 3.10

Successful release tests:
https://buildkite.com/ray-project/release/builds/65845

---------

Signed-off-by: elliot-barn <[email protected]>
removing byod compile jobs for release test images
Now using raydepsets to generate lock files

Signed-off-by: elliot-barn <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
## Description
This PR introduces more information into the `explain` API. Before,
`explain` showed Unoptimized Logical Plan, and Optimized Physical Plan.
To make the `explain` API clearer, I introduce 4 types of plans
- Logical Plan
- Logical Plan (Optimized)
- Physical Plan
- Physical Plan (Optimized)

Example Output
```python
>>> import ray
>>> ray.data.range(1000).select_columns("id").explain()
-------- Logical Plan --------
Project[Project]
+- Read[ReadRange]

-------- Logical Plan (Optimized) --------
Project[Project]
+- Read[ReadRange]

-------- Physical Plan --------
TaskPoolMapOperator[Project]
+- TaskPoolMapOperator[ReadRange]
   +- InputDataBuffer[Input]

-------- Physical Plan (Optimized) --------
TaskPoolMapOperator[ReadRange->Project]
+- InputDataBuffer[Input]
```

## Related issues
None

## Additional information
None

---------

Signed-off-by: EkinKarabulut <[email protected]>
Signed-off-by: EkinKarabulut <[email protected]>
Signed-off-by: Rueian <[email protected]>
Signed-off-by: iamjustinhsu <[email protected]>
Co-authored-by: EkinKarabulut <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: angelinalg <[email protected]>
Co-authored-by: fscnick <[email protected]>
Co-authored-by: Jiajun Yao <[email protected]>
Co-authored-by: Rueian <[email protected]>
…ses (ray-project#58265)

## Description
> Briefly describe what this PR accomplishes and why it's needed.

Using the ip tables script created in ray-project#58241 we found a bug in
RequestWorkerLease where a RAY_CHECK was being triggered here:
https://github.com/ray-project/ray/blob/66c08b47a195bcfac6878a234dc804142e488fc2/src/ray/raylet/lease_dependency_manager.cc#L222-L223
The issue is that transient network errors can happen ANYTIME, including
when the server logic is executing and has not yet replied back to the
client. Our original testing framework using an env variable to drop the
request or reply when it's being sent, hence this was missed. The issue
specifically is that RequestWorkerLease could be in the process of
pulling the lease dependencies to it's local plasma store, and the retry
can arrive triggering this check. Created a cpp unit test that
specifically triggers this RAY_CHECK without this change and is fixed. I
decided to store the callbacks instead of replacing the older one with
the new one due to the possibility of message reordering where the new
one could arrive before the old one.

---------

Signed-off-by: joshlee <[email protected]>
soffer-anyscale and others added 22 commits November 16, 2025 21:36
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Adding a version arg to read_delta_lake to support reading from a
specific version

<!-- Please give a short summary of the change and the problem this
solves. -->
>

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [x] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: soffer-anyscale <[email protected]>
Signed-off-by: Richard Liaw <[email protected]>
Co-authored-by: Richard Liaw <[email protected]>
it uses `enum.Enum` that are not deepcopy-able

Signed-off-by: Lonnie Liu <[email protected]>
…n pushdown rule (ray-project#58683)

## Description

The projection pushdown rule was directly accessing
`_cached_output_metadata.schema`, which breaks abstraction barriers by
reaching into private implementation details. This violates
encapsulation and makes the code fragile to internal changes.

This PR fixes the issue by using the proper `infer_schema()` method
instead, which provides a clean public interface for accessing schema
information. This respects the operator's abstraction and ensures we get
the correct schema through the intended API.


## Additional information

The change is in `projection_pushdown.py:342` where we now call
`input_op.infer_schema()` instead of directly accessing
`input_op._cached_output_metadata.schema`.

Signed-off-by: Balaji Veeramani <[email protected]>
…eterministic task ordering (ray-project#58655)

## Description
I add a param that can determin whether we should check the order or
not, release restriction for flaky tests.

## Related issues
Closes ray-project#58561

## Additional information

I ran test 20 times using a simple script that rans `python -m pytest
python/ray/data/tests/test_execution_optimizer_limit_pushdown.py::test_limit_pushdown_conservative`:

- master: `Passed: 18, Failed: 2`
- feature/flaky-test_limit_pushdown_conservative: `Passed: 20, Failed:
0`

---------

Signed-off-by: ryankert01 <[email protected]>
Signed-off-by: Ryan Huang <[email protected]>
Co-authored-by: You-Cheng Lin <[email protected]>
…ect#58626)

## Description

The `test_simple_imputer` test was flaky when using the most_frequent
strategy because the output row ordering is nondeterministic after
repartitioning.
This PR makes the DataFrame comparison order-independent by using the
built-in `rows_same` utility.

## Related issues

Fixes ray-project#58563

## Additional information

Signed-off-by: justinyeh1995 <[email protected]>
…st-2 (ray-project#58437)

## Description

The `image_embedding_from_jsonl` release tests are typically run in
us-west-2, but they read data from a bucket in a different region. This
is problematic because it's expensive and unrealistic (users often read
data in the same region), and has a noticeable impact on read speeds.

To address this issue, this PR updates the release test to read from a
bucket in us-west-2.
## Related issues

Signed-off-by: Balaji Veeramani <[email protected]>
…58680)

for implementing kafka datasource.

Signed-off-by: You-Cheng Lin (Owen) <[email protected]>
Co-authored-by: Lonnie Liu <[email protected]>
…dule` instances (issue ray-project#56304) (ray-project#58243)

## Description
At present, specifying a `model_config` for a MultiAgentRLModule does
not actually pass that value to the instantiated module. This is due to
an oversight in `get_multi_rl_module_spec`, which fails to add
`model_config` as an argument when passing the rest of the attributes
forward for instantiation. With this change, the issue is resolved.

## Related issues
Fixes/Closes ray-project#56304.

## Additional information
Credit to @arazthexd.

---------

Signed-off-by: Matthew <[email protected]>
Signed-off-by: MatthewCWeston <[email protected]>
…ray-project#58397)

## Description
In improving the `SingleEnvRunner.make_env`, I found that some of the
tests could be flaky.
This PR improves the testing, in particular, to `sample` to ensure that
the tests don't fail occasionally and the documentation to reflect this.

The primary flaky problem I found is that `sample(num_timesteps=X)` will
not always return a total of `X` timesteps, rather at least X timesteps
up to the number of environments more.
I'm updated the documentation to clarify this for users. 

In addition, I've added tests for when neither the number of timesteps
or episodes are given and for the `force_reset` argument

---------

Signed-off-by: Mark Towers <[email protected]>
Co-authored-by: Mark Towers <[email protected]>
Co-authored-by: simonsays1980 <[email protected]>
so that we are not tied to the python version coming with buildkite
images

Signed-off-by: Lonnie Liu <[email protected]>
not used anywhere any more

Signed-off-by: Lonnie Liu <[email protected]>
they are failing right now

Signed-off-by: Lonnie Liu <[email protected]>
## Description

This PR adds caching for PyArrow schema operations to improve
performance during batching operations, especially for tables with a
large number of columns.

### Main Changes

- **Caching for Tensor Type Serialization/Deserialization**: Added cache
for tensor type serialization and deserialization operations. This
significantly reduces overhead for frequently accessed tensor types
during schema operations.

### Performance Impact

This optimization is particularly beneficial during batching operations
for tables with a large number of columns. In one of our tests with 200
columns, the batching time per batch decreased from **0.30s to 0.11s**
(~63% improvement).

#### Without cache:
<img width="1719" height="464" alt="Screenshot 2025-11-13 at 9 49 33β€―PM"
src="https://github.com/user-attachments/assets/46122634-dd09-40ed-a2a8-725d14f85728"
/>
We can see `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` in
different places. Each time `__arrow_ext_deserialize__` will create a
new object and `__arrow_ext_serialize__` includes expensive pickle.

#### With cache 
<img width="1717" height="476" alt="Screenshot 2025-11-13 at 9 41 15β€―PM"
src="https://github.com/user-attachments/assets/50e77253-d69d-40d9-9e1f-56e9341bc131"
/>
The time on `__arrow_ext_deserialize__` and `__arrow_ext_serialize__` is
not a bottleneck anymore.

---------

Signed-off-by: xgui <[email protected]>
Signed-off-by: Xinyuan <[email protected]>
)

## Description
Using a compose observation and action spaces for the RLlib Offline
implementation

Below is an example script for testing

```python
import shutil
from pathlib import Path
from typing import Any

import gymnasium as gym
import msgpack
import msgpack_numpy as mnp
import numpy as np

import ray
from ray.rllib.algorithms import BCConfig
from ray.rllib.connectors.common.flatten_observations import \
    FlattenObservations
from ray.rllib.env.single_agent_episode import SingleAgentEpisode


class SampleEnv(gym.Env):
    metadata = {'render.modes': ['human', 'logger']}

    def __init__(self, config: dict[str, Any] = None):
        super(SampleEnv, self).__init__()

        self.observation_space = gym.spaces.Dict({
            "deployable_area": gym.spaces.Box(
                low=0,
                high=1,
                shape=(25, 25),
                dtype=np.uint8
            ),
            "available_units": gym.spaces.Box(
                low=0.0,
                high=1.0,
                shape=(35,),
                dtype=np.float32
            ),
            "available_abilities": gym.spaces.MultiBinary(n=6),
        })
        self.action_space = gym.spaces.Dict({
            "deploy": gym.spaces.Dict({
                "unit_index": gym.spaces.Discrete(35),
                "yx": gym.spaces.Discrete(625),
            }),
            "ability": gym.spaces.Discrete(6),
        })
        self.current_step = 0

    def reset(self, seed=None, options=None):
        super().reset()
        self.current_step = 0
        return self.observation_space.sample(), {}

    def step(self, action):
        self.current_step += 1
        obs = self.observation_space.sample()
        reward = np.random.rand(1)[0]
        terminate = self.current_step == 2
        return obs, reward, terminate, False, {}


class ExamplePreLearner:

    @staticmethod
    def _gen_episode() -> SingleAgentEpisode:
        env = SampleEnv({})

        ep = SingleAgentEpisode(
            observation_space=env.observation_space,
            action_space=env.action_space,
        )
        obs, info = env.reset()
        ep.add_env_reset(obs, info)

        term = False
        while not term:
            action = env.action_space.sample()
            obs, reward, term, trunc, info = env.step(action)
            ep.add_env_step(obs, action, reward, info, terminated=term, truncated=trunc)

        env.close()
        ep.validate()
        return ep


if __name__ == "__main__":
    pl = ExamplePreLearner()

    episodes = [pl._gen_episode() for _ in range(10)]
    packed_episodes = [msgpack.packb(eps.get_state(), default=mnp.encode) for eps in episodes]
    samples_ds = ray.data.from_items(packed_episodes)

    path = "tmp/supercell_offline_learning"
    shutil.rmtree(path)
    samples_ds.write_parquet(path)

    # Read the episodes and decode them.
    read_sample_ds = ray.data.read_parquet(path)
    batch = read_sample_ds.take_batch(10)
    read_episodes = [
        SingleAgentEpisode.from_state(msgpack.unpackb(state, object_hook=mnp.decode))
        for state in batch["item"]
    ]
    assert len(episodes) == len(read_episodes)
    for eps, read_eps in zip(episodes, read_episodes):
        assert len(eps) == len(read_eps)
        assert eps.observations == read_eps.observations

    env = SampleEnv({})
    config = (
        BCConfig()
        .environment(
            observation_space=env.observation_space,
            action_space=env.action_space,
        )
        .offline_data(
            input_=[Path(path).as_posix()],
            dataset_num_iters_per_learner=1,
            input_read_episodes=True,
            input_read_batch_size=1
        )
        .learners(num_learners=0)
        .training(learner_connector=FlattenObservations)
    )
    algo = config.build()
    print("training")
    algo.train()
```

## Related issues
ray-project#57794, ray-project#50340 and internal
report by Supercell

---------

Signed-off-by: Mark Towers <[email protected]>
Signed-off-by: simonsays1980 <[email protected]>
Co-authored-by: Mark Towers <[email protected]>
Co-authored-by: simonsays1980 <[email protected]>
…_env` (ray-project#58410)

## Description
Allow users to use environments that are already vectorized for
`SingleAgentEnvRunner`
With `gymnasium.make_vec`, users have the option to either use the
`SyncVectorEnv` to vectorize a base environment or to directly create a
vector environment using the `vectorize_mode: gymnasium.VectorizeMode`.
This PR utilises the `env_runners(gym_env_vectorize_mode=...)` argument
to support `VectorizeMode.VECTOR_ENTRY_POINT`

```
import gymnasium as gym

config = ...
config.env_runners(
  gym_env_vectorize_mode=gym.VectorizeMode.VECTOR_ENTRY_POINT,
)
```

An important change related to this PR is that the values accepted for
the vectorize mode is either the enum (`VectorizedMode.ASYNC`, etc) or
the enum values (`"async"`, etc) as before it was the string version was
the enum name (`"ASYNC"`) rather than the enum value itself.

## Related issues
Completion of ray-project#57643, 

## Additional information
ray-project#58397 must be merged first

We should apply a similar change to the `MultiAgentEnvRunner.make_env`

---------

Signed-off-by: Mark Towers <[email protected]>
Co-authored-by: Mark Towers <[email protected]>
using certificate saved in aws secret store now

Signed-off-by: Lonnie Liu <[email protected]>
## Description

ray-project#58234 increased the scale of the
`map_groups` release test from SF 10 to SF 100. Since then, the release
test has been consistently failing (see
ray-project#58312).

To avoid a perpetually broken release test, this PR reverts the scale to
SF 10 while we investigate and fix the scalability issue.

## Related issues

ray-project#58312

Signed-off-by: Balaji Veeramani <[email protected]>
…ect#58699)

## Description
Previously we will try slice the block when `self._total_pending_rows >=
self._target_num_rows` or `flush_remaining` is True, but flush_remaining
doesn't mean `self._total_pending_rows >= self._target_num_rows ` so it
could make the slicing failed because our slicing logic is based on
assumption there should be at least one full block.

This PR fix the logic and added test for such case.

---------

Signed-off-by: You-Cheng Lin <[email protected]>
Getting rid of the excessive `while True` loops & timeouts in the tests
(we already wait for the dashboard to be up).

Also just cleaned up some comments and naming while I was poking around.

---------

Signed-off-by: Edward Oakes <[email protected]>
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request #680 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5400.

@gemini-code-assist
Copy link

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive set of changes across the Ray repository, focusing on improving build configurations, dependency management, CI infrastructure, and code quality. The changes aim to enhance the stability, reproducibility, and maintainability of the Ray project.

Highlights

  • Bazel Configuration: Updates to .bazelrc to enable strict action environment, set compiler options for Windows, ignore warnings for third-party code, and adjust settings for CI and macOS.
  • Buildkite Configuration: Significant modifications to .buildkite/ configurations, including restructuring of _forge.rayci.yml, adding new _images.rayci.yml and dependencies.rayci.yml, and updating various build and test pipelines.
  • Dependency Management: Changes to dependency management, including updates to requirements.txt files, compilation of pip dependencies, and modifications to dependency-related scripts.
  • CI Infrastructure: Updates to CI infrastructure, including changes to Dockerfiles, build scripts, and test configurations.
  • Code Formatting and Linting: Updates to code formatting and linting, including changes to .pre-commit-config.yaml and related scripts.
  • Documentation Build: Updates to documentation build process, including changes to .readthedocs.yaml and related scripts.
Ignored Files
  • Ignored by pattern: .gemini/** (1)
    • .gemini/config.yaml
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/stale_pull_request.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with πŸ‘ and πŸ‘Ž on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a large automated merge that introduces a significant refactoring of the CI/CD and build systems. Key improvements include modularizing Buildkite pipelines and Bazel builds, standardizing artifact packaging with rules_pkg, and switching from miniconda to miniforge. The changes also drop support for x86_64 macOS and Python 3.9 in many areas, while enhancing linting with more pre-commit hooks and modernizing the C++ codebase. Overall, these changes significantly improve the maintainability, standardization, and efficiency of the build process. I've found one potential issue related to ensuring a clean build environment on macOS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.