Skip to content

Fix expiration of action goals when EventsExecutors are used#3018

Open
skyegalaxy wants to merge 10 commits intoros2:rollingfrom
irobot-ros:smedeiros/fix-action-expire-leak
Open

Fix expiration of action goals when EventsExecutors are used#3018
skyegalaxy wants to merge 10 commits intoros2:rollingfrom
irobot-ros:smedeiros/fix-action-expire-leak

Conversation

@skyegalaxy
Copy link
Member

@skyegalaxy skyegalaxy commented Jan 16, 2026

Description

In the long actions ram usage section of the benchmarks iRobot shared back in September 2025, there was a memory leak identified in rolling for both the EventsExecutor implementations. I did some deeper profiling and found that it was the result of expired goal results never being cleared out.

This PR introduces changes to register an on_ready callback with the action server at the rcl layer. Events based executors will now be able to call set_on_ready_callback for EntityType::Expired events, and the callback will be called in the rcl timer layer to enqueue an expired event. This allows for goal expiration to happen through the expected code path of take_data_by_entity_id -> execute.

Followup to #2699.
Alternative to #3012
Requires ros2/rcl#1295

Is this user-facing behavior change?

Properly fixes a memory leak in actions where goal handles are stored indefinitely and never expired if using the EventsExecutor.

Did you use Generative AI?

No

Additional Information

EventsExecutor heaptrack allocation profile of irobot benchmark using 1MB payload, before:
image

after:
image

Copy link
Member

@mjcarroll mjcarroll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Overall, looks pretty good and solves the memory leak.

I have a few thoughts/concerns that we should take a look at before merging.

Copy link
Collaborator

@fujitatomoya fujitatomoya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the other major downside for Non EventExecutors is that it now calls execute_check_expired_goals twice on each expiration? once from the timer callback (which is added by this PR) and the another from the waitable's execute() call?

@jmachowinski
Copy link
Collaborator

Hm, we could add 'rcl_action_server_set_expired_callback' and handle this in the rcl layer to only call the callback if the timer expires etc.
Also the timer API feels off to me in regard to the events system. The timers handle different from the other entities, and the special case handling in the executors is odd. I wonder if we could modify the timer API in general to improve the situation...

@skyegalaxy
Copy link
Member Author

Something else to consider: If this leak's still happening for C++ based EventsExecutors, the coupling of goal expiration to the waitset implies that this could also be happening with the rclpy EventsExecutor. I'm not finding any analogous changes on the rclpy side to expire goals from a timer that isn't still using the waitset to check if it's expired.

Also, I've verified that this leak is also present in Kilted and whatever solution we arrive at will ultimately need to be backported.

@skyegalaxy
Copy link
Member Author

I've updated this PR with a new approach from the feedback from prior discussions, and added some rcl-side changes here: ros2/rcl#1295

// This timer callback will be exchanged at the RCL layer
// with a _timer_ callback that will call the _event_ callback
// passed in by set_on_ready_callback.
std::function<void()> timer_callback = [&] () {};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm curious, why do we need the & capture?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, it was a typo that I forgot to remove! 861fde0

@ahcorde
Copy link
Contributor

ahcorde commented Mar 16, 2026

Pulls: #3018
Gist: https://gist.githubusercontent.com/ahcorde/0bc5606fd72612c49a013adf697f5819/raw/e02bd55b5e9e35474145b7fc6c491dba86282b6a/ros2.repos
BUILD args: --packages-above-and-dependencies rclcpp
TEST args: --packages-above rclcpp
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/18492

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@jmachowinski
Copy link
Collaborator

@ahcorde this CI run will fail, it depends on ros2/rcl#1295

@skyegalaxy
Copy link
Member Author

Pulls: #3018, ros2/rcl#1295
Gist: https://gist.githubusercontent.com/skyegalaxy/c0ce90b3acad7ed083e70aa4363229f0/raw/6c48dda4690e6217f135ba3280fff8872580b690/gistfile1.txt
BUILD args: --packages-above-and-dependencies rcl
TEST args: --packages-above rcl
ROS Distro: rolling
Job: ci_launcher
ci_launcher ran: https://ci.ros2.org/job/ci_launcher/18500/

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

@skyegalaxy
Copy link
Member Author

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

Copy link
Member

@mjcarroll mjcarroll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any other feedback here. Thanks for iterating and thanks to everyone for the reviews. It looks like we are in good shape (with green CI with the rcl patch as well).

@jmachowinski
Copy link
Collaborator

@skyegalaxy Is the python events executor failure related ?

@skyegalaxy
Copy link
Member Author

@skyegalaxy Is the python events executor failure related ?

I'm not able to reproduce the failure locally on my machine after multiple consecutive attempts. I think it's not related. Regardless I'm going to run CI one more time to fix the windows complaints about atomic_uintptr_t to uintptr_t conversion.

@skyegalaxy
Copy link
Member Author

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

Signed-off-by: Skyler Medeiros <skye.galaxy@proton.me>
Signed-off-by: Skyler Medeiros <skye.galaxy@proton.me>
Signed-off-by: Skyler Medeiros <skye.galaxy@proton.me>
Signed-off-by: Skyler Medeiros <skye.galaxy@proton.me>
Signed-off-by: Skyler Medeiros <skye.galaxy@proton.me>
Signed-off-by: Skyler Medeiros <skye.galaxy@proton.me>
Signed-off-by: Skyler Medeiros <skye.galaxy@proton.me>
Signed-off-by: Skyler Medeiros <skye.galaxy@proton.me>
Signed-off-by: Skyler Medeiros <skye.galaxy@proton.me>
Signed-off-by: Skyler Medeiros <skye.galaxy@proton.me>
@skyegalaxy
Copy link
Member Author

@skyegalaxy Is the python events executor failure related ?

actually nevermind, I did some more digging in gdb and the failures are related to this change. I'll try to get that fixed soon

src/ros2/rclpy/rclpy/test/test_events_executor.py [New Thread 0x7fffe7fff6c0 (LWP 151447)]                                                                                                                                                                   
[New Thread 0x7fffe77fe6c0 (LWP 151448)]                                                                                                                                                                                                                     
[New Thread 0x7fffe6ffd6c0 (LWP 151449)]                                                                                                                                                                                                                     
[New Thread 0x7fffe67fc6c0 (LWP 151450)]                                                                                                                                                                                                                     
[New Thread 0x7fffe5ffb6c0 (LWP 151451)]                                                                                                                                                                                                                     
[New Thread 0x7fffe57fa6c0 (LWP 151452)]                                                                                                                                                                                                                     
[New Thread 0x7fffe4ff96c0 (LWP 151453)]                                                                                                                                                                                                                     
[New Thread 0x7fffe47f86c0 (LWP 151454)]                                                                                                                                                                                                                     
[New Thread 0x7fffe3ff76c0 (LWP 151455)]                                                                                                                                                                                                                     
[New Thread 0x7fffe37f66c0 (LWP 151456)]                                                                                                                                                                                                                     
[New Thread 0x7fffe2ff56c0 (LWP 151457)]                                                                                                                                                                                                                     
[New Thread 0x7fffe27f46c0 (LWP 151458)]                                                                                                                                                                                                                     
[New Thread 0x7fffe1ff36c0 (LWP 151459)]                                                                                                                                                                                                                     
[New Thread 0x7fffe17f26c0 (LWP 151460)]                                                                                                                                                                                                                     
[New Thread 0x7fffd3fff6c0 (LWP 151461)]                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                             
Thread 1 "python3" received signal SIGSEGV, Segmentation fault.                                                                                                                                                                                              
0x00007ffff72e20f7 in _enqueue_check_expired_goals (timer=0x1a61948, last_call=10001000000, type_erased_event_callback=140737353939145) at /ws/src/ros2/rcl/rcl_action/src/rcl_action/action_server.c:118                                                    
118         typed_cb_with_data->callback(typed_cb_with_data->user_data, 1);                                                                                                                                                                                  
(gdb) bt                                                                                                                                                                                                                                                     
#0  0x00007ffff72e20f7 in _enqueue_check_expired_goals (timer=0x1a61948, last_call=10001000000, type_erased_event_callback=140737353939145) at /ws/src/ros2/rcl/rcl_action/src/rcl_action/action_server.c:118                                                
#1  0x00007ffff5d31b8a in rcl_timer_call_with_info (timer=0x1a61948, call_info=0x7fffffff81f0) at /ws/src/ros2/rcl/rcl/src/rcl/timer.c:322                                                                                                                   
#2  0x00007ffff59f2e93 in rclpy::events_executor::RclTimersManager::ClockManager::DispatchTimer (this=0x1dc9060, rcl_timer=0x1a61948) at /ws/src/ros2/rclpy/rclpy/src/rclpy/events_executor/timers_manager.cpp:256                                           
#3  0x00007ffff59f463b in rclpy::events_executor::RclTimersManager::ClockManager::CallIfAlive<rcl_timer_s*>(void (rclpy::events_executor::RclTimersManager::ClockManager::*)(rcl_timer_s*), rcl_timer_s*)::{lambda()#1}::operator()() const (                
    __closure=0x200a5b0) at /ws/src/ros2/rclpy/rclpy/src/rclpy/events_executor/timers_manager.cpp:188                                                                                                                                                        
#4  0x00007ffff59fd26c in std::__invoke_impl<void, rclpy::events_executor::RclTimersManager::ClockManager::CallIfAlive<rcl_timer_s*>(void (rclpy::events_executor::RclTimersManager::ClockManager::*)(rcl_timer_s*), rcl_timer_s*)::{lambda()#1}&>(std::__inv
oke_other, rclpy::events_executor::RclTimersManager::ClockManager::CallIfAlive<rcl_timer_s*>(void (rclpy::events_executor::RclTimersManager::ClockManager::*)(rcl_timer_s*), rcl_timer_s*)::{lambda()#1}&) (__f=...) at /usr/include/c++/13/bits/invoke.h:61 
#5  0x00007ffff59fabec in std::__invoke_r<void, rclpy::events_executor::RclTimersManager::ClockManager::CallIfAlive<rcl_timer_s*>(void (rclpy::events_executor::RclTimersManager::ClockManager::*)(rcl_timer_s*), rcl_timer_s*)::{lambda()#1}&>(rclpy::events
_executor::RclTimersManager::ClockManager::CallIfAlive<rcl_timer_s*>(void (rclpy::events_executor::RclTimersManager::ClockManager::*)(rcl_timer_s*), rcl_timer_s*)::{lambda()#1}&) (__fn=...) at /usr/include/c++/13/bits/invoke.h:111                       
#6  0x00007ffff59f779b in std::_Function_handler<void (), rclpy::events_executor::RclTimersManager::ClockManager::CallIfAlive<rcl_timer_s*>(void (rclpy::events_executor::RclTimersManager::ClockManager::*)(rcl_timer_s*), rcl_timer_s*)::{lambda()#1}>::_M_
invoke(std::_Any_data const&) (__functor=...) at /usr/include/c++/13/bits/std_function.h:290                                                                                                                                                                 
#7  0x00007ffff59e8c4c in std::function<void ()>::operator()() const (this=0x7fffffff8790) at /usr/include/c++/13/bits/std_function.h:591                                                                                                                    
#8  0x00007ffff59e889d in rclpy::events_executor::EventsQueue::Run (this=0x18d6030, deadline=std::optional = {...}) at /ws/src/ros2/rclpy/rclpy/src/rclpy/events_executor/events_queue.cpp:51                                                                
#9  0x00007ffff59b512e in rclpy::events_executor::EventsExecutor::spin (this=0x18d6000, timeout_sec=std::optional = {...}, stop_after_user_callback=false) at /ws/src/ros2/rclpy/rclpy/src/rclpy/events_executor/events_executor.cpp:179                     
#10 0x00007ffff59b5475 in rclpy::events_executor::EventsExecutor::spin_until_future_complete (this=0x18d6000, future=..., timeout_sec=std::optional = {...}, stop_after_user_callback=false)                                                                 
    at /ws/src/ros2/rclpy/rclpy/src/rclpy/events_executor/events_executor.cpp:197                                                                                                                                                                            
#11 0x00007ffff59bdaa8 in operator() (__closure=0x13f5498, exec=..., future=..., timeout_sec=std::optional = {...}) at /ws/src/ros2/rclpy/rclpy/src/rclpy/events_executor/events_executor.cpp:917                                                            
#12 0x00007ffff59c1869 in pybind11::detail::argument_loader<rclpy::events_executor::EventsExecutor&, pybind11::handle, std::optional<double> >::call_impl<void, rclpy::events_executor::define_events_executor(pybind11::object)::<lambda(rclpy::events_execu
tor::EventsExecutor&, pybind11::handle, std::optional<double>)>&, 0, 1, 2, pybind11::detail::void_type>(struct {...} &, std::index_sequence, pybind11::detail::void_type &&) (this=0x7fffffff8a30, f=...) at /usr/include/pybind11/cast.h:1480               
#13 0x00007ffff59c1018 in pybind11::detail::argument_loader<rclpy::events_executor::EventsExecutor&, pybind11::handle, std::optional<double> >::call<void, pybind11::detail::void_type, rclpy::events_executor::define_events_executor(pybind11::object)::<la
mbda(rclpy::events_executor::EventsExecutor&, pybind11::handle, std::optional<double>)>&>(struct {...} &) (this=0x7fffffff8a30, f=...) at /usr/include/pybind11/cast.h:1454                                                                                  
#14 0x00007ffff59c03f6 in operator() (__closure=0x0, call=...) at /usr/include/pybind11/pybind11.h:254                                                                                                                                                       
#15 0x00007ffff59c045e in _FUN () at /usr/include/pybind11/pybind11.h:224                                                                                                                                                                                    
#16 0x00007ffff58eba48 in pybind11::cpp_function::dispatcher (self=<PyCapsule at remote 0x7ffff5da1260>,                                                                                                                                                     
    args_in=(<test_rclpy._rclpy_pybind11.EventsExecutor at remote 0x7fffeea0aeb0>, <Future(_state=<FutureState(_value_='PENDING', _name_='PENDING', __objclass__=<EnumType(_generate_next_value_=<staticmethod at remote 0x7ffff5e18eb0>, __module__='rclpy.t
ask', __doc__='States defining the lifecycle of a future.', _new_member_=<built-in method __new__ of type object at remote 0xa43820>, _use_args_=False, _member_names_=['PENDING', 'CANCELLED', 'FINISHED'], _member_map_={'PENDING': <...>, 'CANCELLED': <Fu
tureState(_value_='CANCELLED', _name_='CANCELLED', __objclass__=<...>, _sort_order_=1) at remote 0x7ffff701b560>, 'FINISHED': <FutureState(_value_='FINISHED', _name_='FINISHED', __objclass__=<...>, _sort_order_=2) at remote 0x7ffff5e1b3e0>}, _value2memb
er_map_={'PENDING': <...>, 'CANCELLED': <...>, 'FINISHED': <...>}, _unhashable_values_=[], _member_type_=<type at remote 0xa43820>, _value_repr_=None, PENDING=<...>, CANCELLED=<...>, FINISHED=<...>, __new__=<function at remote 0x7ffff78c8ea0>) at remote
 0x1426980>, _sort_order...(truncated), kwargs_in=0x0) at /usr/include/pybind11/pybind11.h:946                                
#17 0x0000000000581e4f in cfunction_call (func=<built-in method spin_until_future_complete of PyCapsule object at remote 0x7ffff5da1260>, args=<optimized out>, kwargs=<optimized out>) at ../Objects/methodobject.c:537
--Type <RET> for more, q to quit, c to continue without paging--
#18 0x0000000000548e25 in _PyObject_MakeTpCall (tstate=0xba5748 <_PyRuntime+459656>, callable=<built-in method spin_until_future_complete of PyCapsule object at remote 0x7ffff5da1260>, args=<optimized out>, nargs=3, keywords=0x0)
    at ../Objects/call.c:240
#19 0x000000000054985d in _PyObject_VectorcallTstate (kwnames=<optimized out>, nargsf=<optimized out>, args=<optimized out>, callable=<optimized out>, tstate=<optimized out>) at ../Include/internal/pycore_call.h:90
#20 0x00000000005d71d9 in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/bytecodes.c:2706
#21 0x000000000054ca34 in _PyObject_VectorcallTstate (kwnames=('result',), nargsf=<optimized out>, args=0x7fffeeabc600, callable=<function at remote 0x7ffff7232200>, tstate=0xba5748 <_PyRuntime+459656>) at ../Include/internal/pycore_call.h:92
#22 method_vectorcall (method=<optimized out>, args=0x7fffeeabc608, nargsf=<optimized out>, kwnames=('result',)) at ../Objects/classobject.c:61
#23 0x000000000054b055 in PyObject_Call () at ../Objects/call.c:376
#24 0x00000000005db2ca in _PyEval_EvalFrameDefault (tstate=<optimized out>, frame=<optimized out>, throwflag=<optimized out>) at Python/bytecodes.c:3254
#25 0x000000000054a73a in _PyFunction_Vectorcall (kwnames=('result',), nargsf=9223372036854775809, stack=0x7ffff6e3a8d8, func=<function at remote 0x7ffff72323e0>) at ../Objects/call.c:419
#26 _PyObject_FastCallDictTstate (kwargs=<optimized out>, nargsf=<optimized out>, args=0x7fffffff9300, callable=<function at remote 0x7ffff72323e0>, tstate=0xba5748 <_PyRuntime+459656>) at ../Objects/call.c:144

@skyegalaxy skyegalaxy force-pushed the smedeiros/fix-action-expire-leak branch from 861fde0 to 2d4a356 Compare March 17, 2026 22:29
@skyegalaxy
Copy link
Member Author

  • Linux Build Status
  • Linux-aarch64 Build Status
  • Linux-rhel Build Status
  • Windows Build Status

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants