Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardware interface lifecycle behavior does not follow ros2 design document #2079

Open
Martin-Oehler opened this issue Feb 28, 2025 · 18 comments
Labels

Comments

@Martin-Oehler
Copy link

Describe the bug
The hardware interface class SystemInterface implements the LifecycleNodeInterface, but its behavior does not follow the ros2 design document for managed nodes.

As described in the document, raising an error in the primary states should lead to a transition to the ErrorProcessing state. Instead, the controller manager crashes with an exception. This prevents any error handling in these situations and somewhat defeats the purpose of the lifecycle logic in my opinion. This even takes down any other hardware interfaces running in the same manager, which is fatal.

Furthermore, a custom logic is implemented for state transitions with two services:

/controller_manager/set_hardware_component_state  #initiate state transition
/controller_manager/list_hardware_components  #lifecycle state as a field

Instead of the interfaces usually offered by lifecycle nodes. This prevents re-using lifecycle management logic and code. There is also no latched/transient local topic for transitions which makes external management of the lifecycle challenging.

To Reproduce
Return CallbackReturn::ERROR in either on_configure() or on_activate() (others not tested)
The controller manager node will crash with:

terminate called after throwing an instance of 'std::runtime_error'
what():  Failed to set the initial state of the component : hardware_interface to active

Expected behavior
The hardware interface should obey the design document and transition to the ErrorProcessing state.

Environment (please complete the following information):

  • OS: Ubuntu 24.04
  • Version jazzy
@saikishor
Copy link
Member

Hello @Martin-Oehler!

Regarding the issue, at the startup of the ros2_control this is indeed the desired behaviour. Once the startup is properly done, if any of the methods return the ERROR, the error transition callback is executed.

When I use this code in production, if the robot is not able to start properly and setup all the controllers. I want it to crash and fail instead of continuing. For your case, you can control the initial state by setting the parameter hardware_components_initial_state (You can find more information here: https://control.ros.org/rolling/doc/ros2_control/controller_manager/doc/userdoc.html#parameters)

I hope this answers your questions:)

@bmagyar
Copy link
Member

bmagyar commented Mar 2, 2025

Also one thing is pretty certain that you should never expect to have lifecycle node services for anything ros2_control. It massively violates the framework design. We are using the API and try aligning with the state machine as close as possible so there is something familiar for people.

@christophfroehlich
Copy link
Contributor

I agree in terms of services and CLI, but should CM really throw an exception if something is wrong in on_configure() or on_activate()?

@bmagyar
Copy link
Member

bmagyar commented Mar 2, 2025 via email

@saikishor
Copy link
Member

I believe this was intentional only at the startup, this was introduced in #1729.

for (const auto & [component, state] : components_to_activate)
{
rclcpp_lifecycle::State active_state(
State::PRIMARY_STATE_ACTIVE, hardware_interface::lifecycle_state_names::ACTIVE);
if (
resource_manager_->set_component_state(component, active_state) ==
hardware_interface::return_type::ERROR)
{
throw std::runtime_error(
"Failed to set the initial state of the component : " + component + " to " +
active_state.label());
}
}

If the user's intent is to set an initial state to the hardware component and if it is not able to set, it is better to fail initially than later

@saikishor
Copy link
Member

Post startup, any change in component state between the lifecycle states is bulletproofed already

@Martin-Oehler
Copy link
Author

Martin-Oehler commented Mar 3, 2025

Thank you very much for your answers.

I can not follow the reasoning why it should be better to crash the whole process instead of going back into Unconfigured as described in the design document.

Consider the following two use-cases:

  • The hardware is not ready for connection yet (booting up, ..). The driver should try to repeatedly connect until successful.
  • An optional hardware component fails to connect due to hardware failure.

Both these cases can not be covered by the current implementation because the whole process just crashes and takes all interfaces with it. From my understanding, the whole point of the life-cycle system is to increase robustness and prevent exactly this case.

@christophfroehlich
Copy link
Contributor

This is something worth to discuss.

The point of view from @saikishor is that if someone gives the initial state (explicitly, or implicitly) of the components in the configuration files and the launch fails, then this is an unintended behavior and should to blow up.
If you know at startup that something might fail/not be available, then better to configure the components to stay unconfigured or don't load it at all and use the hardware_spawner and/or service calls later.

@saikishor
Copy link
Member

Yes, as @christophfroehlich explained if the users intention is to have certain hardware components state to a specific initial state and if it is not respecting it, I think it's better to crash as you can encode other recovery behaviours, because all your startup controller spawners will no more work as they cannot get their interfaces as the hardware is not in expected state.

For instance, for ethercat based systems, sometimes you have to restart the EtherCAT master, if something is not working. If my application crashes, based on the exit code I can decide to restart my EtherCAT master or not and restart my ros2_control_node app. If it continues to be active, there is no way right now to stop the spawners from spawning.

Moreover, if your hardware needs time to boot up and it needs to wait until successful, why not do this in your hardware component, you can take as long as you want and in theory it shouldn't affect the RT loop as this happens in non-RT loop. If that's not what you want, then better setup the hardware initial state parameter to set your specific hardware to unconfigured state and activate it with hardware spawners.

@Martin-Oehler
Copy link
Author

If you know at startup that something might fail/not be available, then better to configure the components to stay unconfigured or don't load it at all and use the hardware_spawner and/or service calls later.

I do not think this would be a solution for these two examples. Even if you delay the startup process by initiating the state transition at a later point with a service call, the setup might still fail and crash unless you know with absolute certainty that the hardware is already available, leading to duplicate logic.

because all your startup controller spawners will no more work as they cannot get their interfaces as the hardware is not in expected state.

Sorry, I do not have a deep knowledge about the controller startup process, but shouldn't controllers just wait until the respective hardware interfaces become active? I assume there is already some logic implemented for that as the configuration time of a hardware interface is unknown.

For instance, for ethercat based systems, sometimes you have to restart the EtherCAT master, if something is not working. If my application crashes, based on the exit code I can decide to restart my EtherCAT master or not and restart my ros2_control_node app.

In some situations, the correct error handling behavior might be crashing. But in others, it may not. My point is, that the current implementation decides for you that the error will be handled with crashing.

If the ErrorProcessing state were to be entered instead, developers could implement their own fault recovery. And that could be throwing an exception, but also something else.

@saikishor
Copy link
Member

I do not think this would be a solution for these two examples. Even if you delay the startup process by initiating the state transition at a later point with a service call, the setup might still fail and crash unless you know with absolute certainty that the hardware is already available, leading to duplicate logic.

If you do it at later point via service call, it won't be crashing anymore. The crash is only when the intial_state is not respected. From later on via service call, it is respecting the Lifecycle design.

Sorry, I do not have a deep knowledge about the controller startup process, but shouldn't controllers just wait until the respective hardware interfaces become active? I assume there is already some logic implemented for that as the configuration time of a hardware interface is unknown.

Not really, they do not wait for the hardware interface to be available. If the controller requests an interface that doesn't exist yet, then it doesn't activate and it aborts. There might be a case, where the controller requests an invalid one and there is no way we can keep it waiting expecting someone to expose that particular interface. Do you get my point?

In some situations, the correct error-handling behavior might be crashing. But in others, it may not. My point is, that the current implementation decides for you that the error will be handled with crashing.

As long as you don't set the hardware_components_intiial_state parameter (https://control.ros.org/rolling/doc/ros2_control/controller_manager/doc/userdoc.html#parameters), all are activated by default. If you have an hardware that is not deterministic in its state, I recommend you to set that particular component to unconfigured and later activate it later via services.

@Martin-Oehler
Copy link
Author

I see, thanks. I will try to implement my desired behavior with your suggestions and update this ticket with my results.

@Martin-Oehler
Copy link
Author

Martin-Oehler commented Mar 5, 2025

So I found a pretty simple solution to my problem.

As suggested, I do not initialize the hardware interface on startup by setting the parameter hardware_components_initial_state.unconfigured for my interface in the controller manager. This ensures that the controller manager does not crash when on_configure() fails.

Next, I start spawners for the hardware interface as well as for all mandatory controllers with respawn:

launch:
- node:
    pkg: "controller_manager"
    exec: "spawner"
    name: "joint_state_broadcaster_spawner"
    output: "screen"
    args: "joint_state_broadcaster"
    respawn: "true"
    respawn_delay: 5.0

- node:
    pkg: "controller_manager"
    exec: "hardware_spawner"
    name: "hardware_interface_spawner"
    output: "screen"
    args: "flipper_interface --activate"
    respawn: "true"
    respawn_delay: 5.0

This will ensure that hardware and controllers are always transitioned to the active state when possible.

However, I still find it very unintuitive that the lifecycle transitions are different depending on whether the state is progressed by the controller manager or services interface. But as the latter works as expected, this solution works fine for me now.

@christophfroehlich
Copy link
Contributor

@Martin-Oehler
Copy link
Author

Martin-Oehler commented Mar 24, 2025

I think this page seems the most related: https://control.ros.org/rolling/doc/ros2_control/hardware_interface/doc/hardware_components_userdoc.html

It also does not contain information regarding the different handling of errors (transition by controller manager vs service interface). I suggest to rework and extend the last paragraph "Handling of errors that happen during read() and write() calls" to a more general "Error handling" paragraph.

Regarding my solution, I noticed some minor nit-pick. Respawning the hardware spawner repeatedly to transition the hardware interface to active works still works when the interface is already active. However, the controller spawner complains in these cases because the controller is already active and exits with an error. This error spam can make it hard to spot actual errors.

@saikishor
Copy link
Member

It also does not contain information regarding the different handling of errors (transition by controller manager vs service interface). I suggest to rework and extend the last paragraph "Handling of errors that happen during read() and write() calls" to a more general "Error handling" paragraph.

If you can open a PR, we will be happy to review and merge it

Regarding my solution, I noticed some minor nit-pick. Respawning the hardware spawner repeatedly to transition the hardware interface to active works still works when the interface is already active. However, the controller spawner complains in these cases because the controller is already active and exits with an error. This error spam can make it hard to spot actual errors.

I'm not sure if I understand your case. Do you mind explaining it again?. What do you mean when the interface is already active? I don't understand the fact how can the controller can activate, if the HW is not yet available

@Martin-Oehler
Copy link
Author

If you can open a PR, we will be happy to review and merge it

Unfortunately, I will change employment soon and will loose access to my ROS2 system. I will see what I can do.

I'm not sure if I understand your case. Do you mind explaining it again?. What do you mean when the interface is already active? I don't understand the fact how can the controller can activate, if the HW is not yet available

For my solution, I want to ensure that A) all hardware interfaces are always active and B) that the necessary controllers (e.g. the joint state broadcaster) is always active. B) is necessary because controllers are disabled when the hardware interface registers an error. This ensures a robust behavior in case of any errors can that be recovered from.

I achieve this by starting two spawners with "respawn" and "respawn_delay" set. These will periodically attempt to switch the hardware interface as well as the controllers to the active state. Naturally, because this happens periodically and unconditionally, hardware interface and controllers will already be in the active state unless an error occurred in the meantime.

If this is the case, the hardware spawner will print out a success message on the console because the interface is already in the desired active state. The controller spawner, on the other hand, will print out an error because it was not able to switch the controller from "active" to "active". As this happens periodically, the console will be cluttered with (expected) error messages.

I hope this explanation makes it clear.

@saikishor
Copy link
Member

These will periodically attempt to switch the hardware interface as well as the controllers to the active state. Naturally, because this happens periodically and unconditionally, hardware interface and controllers will already be in the active state unless an error occurred in the meantime.

I don't understand why you have to periodically keep calling the hardware interface and controllers to be active, just so you recover from error state and go back to active state?

If this is the case, the hardware spawner will print out a success message on the console because the interface is already in the desired active state. The controller spawner, on the other hand, will print out an error because it was not able to switch the controller from "active" to "active". As this happens periodically, the console will be cluttered with (expected) error messages.

This behavior might have been fixed with #2088. Can you try it out and see if it works for you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants