Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Iron][nav2] error: Failed to find a free participant index for domain 0 #458

Open
doisyg opened this issue Jun 1, 2023 · 25 comments
Open

Comments

@doisyg
Copy link

doisyg commented Jun 1, 2023

Bug report

Required Info:

  • Operating System:
    • Ubuntu 22.04
  • Installation type:
    • latest Iron binaries
  • DDS implementation:
    • rmw_cyclonedds_cpp

Steps to reproduce issue

With export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp, start the standard navigation stack

ros2 launch nav2_bringup navigation_launch.py

Expected behavior

Same as for Humble

Actual behavior

Multiple errors:

[INFO] [launch]: Default logging verbosity is set to INFO
[INFO] [controller_server-1]: process started with pid [2561141]
[INFO] [smoother_server-2]: process started with pid [2561143]
[INFO] [planner_server-3]: process started with pid [2561145]
[INFO] [behavior_server-4]: process started with pid [2561147]
[INFO] [bt_navigator-5]: process started with pid [2561149]
[INFO] [waypoint_follower-6]: process started with pid [2561151]
[INFO] [velocity_smoother-7]: process started with pid [2561153]
[INFO] [lifecycle_manager-8]: process started with pid [2561155]
[lifecycle_manager-8] 1685658158.898666 [0] lifecycle_: Failed to find a free participant index for domain 0
[lifecycle_manager-8] [ERROR] [1685658158.898760137] [rmw_cyclonedds_cpp]: rmw_create_node: failed to create domain, error Error
[lifecycle_manager-8]
[lifecycle_manager-8] >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
[lifecycle_manager-8] This error state is being overwritten:
[lifecycle_manager-8]
[lifecycle_manager-8] 'error not set, at ./src/rcl/node.c:263'
[lifecycle_manager-8]
[lifecycle_manager-8] with this new error message:
[lifecycle_manager-8]
[lifecycle_manager-8] 'rcl node's rmw handle is invalid, at ./src/rcl/node.c:421'
[lifecycle_manager-8]
[lifecycle_manager-8] rcutils_reset_error() should be called after error handling to avoid this.
[lifecycle_manager-8] <<<
[lifecycle_manager-8] [ERROR] [1685658158.898851751] [rcl]: Failed to fini publisher for node: 1
[lifecycle_manager-8] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
[lifecycle_manager-8] what(): failed to initialize rcl node: rcl node's rmw handle is invalid, at ./src/rcl/node.c:421
[velocity_smoother-7] [INFO] [1685658158.900652561] [velocity_smoother]:
[velocity_smoother-7] velocity_smoother lifecycle node launched.
[velocity_smoother-7] Waiting on external lifecycle transitions to activate
[velocity_smoother-7] See https://design.ros2.org/articles/node_lifecycle.html for more information.
[planner_server-3] 1685658158.901798 [0] planner_se: Failed to find a free participant index for domain 0
[planner_server-3] [ERROR] [1685658158.901891797] [rmw_cyclonedds_cpp]: rmw_create_node: failed to create domain, error Error
[planner_server-3]
[planner_server-3] >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
[planner_server-3] This error state is being overwritten:
[planner_server-3]
[planner_server-3] 'error not set, at ./src/rcl/node.c:263'
[planner_server-3]
[planner_server-3] with this new error message:
[planner_server-3]
[planner_server-3] 'rcl node's rmw handle is invalid, at ./src/rcl/node.c:421'
[planner_server-3]
[planner_server-3] rcutils_reset_error() should be called after error handling to avoid this.
[planner_server-3] <<<
[planner_server-3] [ERROR] [1685658158.901982797] [rcl]: Failed to fini publisher for node: 1
[planner_server-3] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
[planner_server-3] what(): failed to initialize rcl node: rcl node's rmw handle is invalid, at ./src/rcl/node.c:421
[controller_server-1] 1685658158.902240 [0] controller: Failed to find a free participant index for domain 0
[controller_server-1] [ERROR] [1685658158.902346717] [rmw_cyclonedds_cpp]: rmw_create_node: failed to create domain, error Error
[controller_server-1]
[controller_server-1] >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
[controller_server-1] This error state is being overwritten:
[controller_server-1]
[controller_server-1] 'error not set, at ./src/rcl/node.c:263'
[controller_server-1]
[controller_server-1] with this new error message:
[controller_server-1]
[controller_server-1] 'rcl node's rmw handle is invalid, at ./src/rcl/node.c:421'
[controller_server-1]
[controller_server-1] rcutils_reset_error() should be called after error handling to avoid this.
[controller_server-1] <<<
[controller_server-1] [ERROR] [1685658158.902473321] [rcl]: Failed to fini publisher for node: 1
[controller_server-1] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
[controller_server-1] what(): failed to initialize rcl node: rcl node's rmw handle is invalid, at ./src/rcl/node.c:421
[behavior_server-4] 1685658158.903089 [0] behavior_s: Failed to find a free participant index for domain 0
[behavior_server-4] [ERROR] [1685658158.903185626] [rmw_cyclonedds_cpp]: rmw_create_node: failed to create domain, error Error
[behavior_server-4]
[behavior_server-4] >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
[behavior_server-4] This error state is being overwritten:
[behavior_server-4]
[behavior_server-4] 'error not set, at ./src/rcl/node.c:263'
[behavior_server-4]
[behavior_server-4] with this new error message:
[behavior_server-4]
[behavior_server-4] 'rcl node's rmw handle is invalid, at ./src/rcl/node.c:421'
[behavior_server-4]
[behavior_server-4] rcutils_reset_error() should be called after error handling to avoid this.
[behavior_server-4] <<<
[behavior_server-4] [ERROR] [1685658158.903255114] [rcl]: Failed to fini publisher for node: 1
[behavior_server-4] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
[behavior_server-4] what(): failed to initialize rcl node: rcl node's rmw handle is invalid, at ./src/rcl/node.c:421
[waypoint_follower-6] [INFO] [1685658158.904875824] [waypoint_follower]:
[waypoint_follower-6] waypoint_follower lifecycle node launched.
[waypoint_follower-6] Waiting on external lifecycle transitions to activate
[waypoint_follower-6] See https://design.ros2.org/articles/node_lifecycle.html for more information.
[smoother_server-2] 1685658158.906080 [0] smoother_s: Failed to find a free participant index for domain 0
[waypoint_follower-6] [INFO] [1685658158.906139702] [waypoint_follower]: Creating
[smoother_server-2] [ERROR] [1685658158.906186368] [rmw_cyclonedds_cpp]: rmw_create_node: failed to create domain, error Error
[smoother_server-2]
[smoother_server-2] >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
[smoother_server-2] This error state is being overwritten:
[smoother_server-2]
[smoother_server-2] 'error not set, at ./src/rcl/node.c:263'
[smoother_server-2]
[smoother_server-2] with this new error message:
[smoother_server-2]
[smoother_server-2] 'rcl node's rmw handle is invalid, at ./src/rcl/node.c:421'
[smoother_server-2]
[smoother_server-2] rcutils_reset_error() should be called after error handling to avoid this.
[smoother_server-2] <<<
[smoother_server-2] [ERROR] [1685658158.906262923] [rcl]: Failed to fini publisher for node: 1
[smoother_server-2] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
[smoother_server-2] what(): failed to initialize rcl node: rcl node's rmw handle is invalid, at ./src/rcl/node.c:421
[bt_navigator-5] 1685658158.906819 [0] bt_navigat: Failed to find a free participant index for domain 0
[bt_navigator-5] [ERROR] [1685658158.906923740] [rmw_cyclonedds_cpp]: rmw_create_node: failed to create domain, error Error
[bt_navigator-5]
[bt_navigator-5] >>> [rcutils|error_handling.c:108] rcutils_set_error_state()
[bt_navigator-5] This error state is being overwritten:
[bt_navigator-5]
[bt_navigator-5] 'error not set, at ./src/rcl/node.c:263'
[bt_navigator-5]
[bt_navigator-5] with this new error message:
[bt_navigator-5]
[bt_navigator-5] 'rcl node's rmw handle is invalid, at ./src/rcl/node.c:421'
[bt_navigator-5]
[bt_navigator-5] rcutils_reset_error() should be called after error handling to avoid this.
[bt_navigator-5] <<<
[bt_navigator-5] [ERROR] [1685658158.906995792] [rcl]: Failed to fini publisher for node: 1
[bt_navigator-5] terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
[bt_navigator-5] what(): failed to initialize rcl node: rcl node's rmw handle is invalid, at ./src/rcl/node.c:421

Additional information

Seems related to https://stackoverflow.com/questions/74232474/ros2-galacitc-failed-to-find-a-free-participant-index-for-domain
So suggested workaround appears to solve the issue with:
export CYCLONEDDS_URI=${CYCLONEDDS_URI:-"<CycloneDDS><Discovery><ParticipantIndex>auto</ParticipantIndex><MaxAutoParticipantIndex>100</MaxAutoParticipantIndex></Discovery></CycloneDDS>"}

@SteveMacenski
Copy link

Historically, Cyclone's worked great for us out of the box, so it would be great to get that back!

@eboasson
Copy link
Collaborator

eboasson commented Jun 2, 2023

The Cyclone version didn't change from Humble to Iron, but the RMW layer in Iron initializes the configuration a little differently because of the work that went into getting sensible default behaviours. That might affect things — in particular

const bool discovery_off =
obviously plays with these settings.

Whatever is in the environment variable gets appended to the bit that the RMW layer computes and therefore overrides the "RMW" defaults. Setting ParticipantIndex and MaxAutoParticipantIndex like that via CYCLONEDDS_URI should therefore take care of it, but of course the particular command only sets it if it wasn't set before.

What I do find suspicious in the output is that it looks more like hitting Cyclone's default of 9 than like hitting the RMW's (raised) default of 32.

Doing CYCLONEDDS_URI="$CYCLONEDDS_URI,<Tracing><Verbosity>fine</Verbosity><OutputFile>cdds.log.\${CYCLONEDDS_PID</OutputFile></Tracing>" should produce a bunch of logs files, one per process. That would give us the details on the actual settings used, as well as on the number of participants. That might be worth checking.

@doisyg
Copy link
Author

doisyg commented Jun 2, 2023

Lots of log files (that I won't be able to interpret myself), I have enabled that variable only for our navigation stack launch, see attached:
cdds.log.zip
This has been done with also:

export ROS_AUTOMATIC_DISCOVERY_RANGE=LOCALHOST
export RMW_IMPLEMENTATION=rmw_cyclonedds_cpp

@eboasson
Copy link
Collaborator

eboasson commented Jun 2, 2023

that I won't be able to interpret myself

I know ...

Anyway there's two logs of processes that do start (the two big ones), and there we can see there are 31 processes, 31 port numbers (and it's the port numbers that matter here):

# sed -Ene 's;.*SPDP.*NEW.*udp/([^@]+).*__ProcessName...([^"]*).*__Pid...([^"]*).*;\1 \2 \3;p' cdds.log.2762943|sort  
192.168.229.75:7410 robot_state_publisher 2759180
192.168.229.75:7412 max_height_tf_node 2759184
192.168.229.75:7414 teleop_node 2759188
192.168.229.75:7416 teleop_node 2759190
192.168.229.75:7418 virtual_safety_system_node 2759194
192.168.229.75:7420 twist_mux 2759192
192.168.229.75:7422 velocity_smoother 2759200
192.168.229.75:7424 joy_node 2759186
192.168.229.75:7426 horizontal_3d_flattening_node 2759218
192.168.229.75:7428 lifecycle_manager 2759202
192.168.229.75:7430 lifecycle_manager 2759208
192.168.229.75:7432 horizontal_3d_flattening_node 2759238
192.168.229.75:7434 tower_dynamic_filtering_angle_node 2759214
192.168.229.75:7436 lifecycle_manager 2759198
192.168.229.75:7438 simple_move_node 2759204
192.168.229.75:7440 collision_monitor 2759196
192.168.229.75:7442 tower_dynamic_filtering_angle_node 2759216
192.168.229.75:7444 arri_undock_node 2759206
192.168.229.75:7446 pointclouds_to_merged_scan_node 2759270
192.168.229.75:7448 horizontal_3d_flattening_node 2759244
192.168.229.75:7450 horizontal_3d_flattening_node 2759236
192.168.229.75:7452 parameter_bridge 2759326
192.168.229.75:7454 wiferion_battery_state_from_sim_node 2759337
192.168.229.75:7456 dexory_sim_manager_node 2759329
192.168.229.75:7458 python3 2759212
192.168.229.75:7460 python3 2759182
192.168.229.75:7462 lifecycle_manager 2761014
192.168.229.75:7464 lifecycle_manager 2761018
192.168.229.75:7466 amcl 2761016
192.168.229.75:7468 map_server 2761012
192.168.229.75:7472 velocity_smoother 2762941
# sed -Ene 's;.*SPDP.*NEW.*udp/([^@]+).*__ProcessName...([^"]*).*__Pid...([^"]*).*;\1 \2 \3;p' cdds.log.2762943|sort|wc
      31      93    1489

Given your environment variable quote, one would expect 100.

# grep MaxAutoParticipant cdds.log.2762943
1685714825.615466 [0] lifecycle_: config: Domain/Discovery/MaxAutoParticipantIndex/#text: 32 {0}

So quite simply, it is using the value of 32 set at

config += "<MaxAutoParticipantIndex>32</MaxAutoParticipantIndex>";
and this is insufficient. (The {0} at the end means it found the value only in the first configuration fragment.)

At

config += std::string(config_from_env);
it appends the contents of the environment variable, so I suspect that for some reason or other the environment variable hasn't actually been set to raise it to 100. That's something I'd double check.

@doisyg
Copy link
Author

doisyg commented Jun 4, 2023

Thanks for the detailed answer and investigation!

it appends the contents of the environment variable, so I suspect that for some reason or other the environment variable hasn't actually been set to raise it to 100. That's something I'd double check.

Is it safe to assume that in the meantime, using
export CYCLONEDDS_URI=${CYCLONEDDS_URI:-"<CycloneDDS><Discovery><ParticipantIndex>auto</ParticipantIndex><MaxAutoParticipantIndex>100</MaxAutoParticipantIndex></Discovery></CycloneDDS>"}
will ensure the exact same default behavior as in Humble (that I trust for having extensively tested it in production) ?

@eboasson
Copy link
Collaborator

eboasson commented Jun 5, 2023

Yes, that's a safe assumption.

@clalancette
Copy link
Contributor

The Cyclone version didn't change from Humble to Iron

FYI, this is not true. In Humble, we have CycloneDDS version 0.9.x, while in Iron we have 0.10.x. I don't know if that is the cause of this error, or if it comes from the rmw_cyclonedds layer, but it is different.

@eboasson
Copy link
Collaborator

eboasson commented Jun 5, 2023

Oops! Thanks @clalancette for correcting that. I now remember having known it ...

I don't think Cyclone changed in this particular detail from 0.9 to 0.10.

@SteveMacenski
Copy link

Are there any updates here?

@eboasson
Copy link
Collaborator

Not really, I assumed that the immediate problem turned out to be easily dealt with using some configuration and kinda forgot about it.

I did yesterday realize I might be able to better in deciding whether to use multicast: there’s something fishy in Linux with loopback over multicast, it seems to work fine but the interface doesn’t state it supports it. What I could is add a trial multicast to see if it works despite it not being advertised. That should allow eliminating this problem.

That said, one has to be careful in using multicast. Quite some effort was spent on figuring out how to prevent unwanted discovery over the network in Iron, and I think everybody is happy with the results. I don’t want to break that … I do have an inkling how to make it all play nicely, by getting some more information from the kernel for receiver packets, but that I definitely should try out first. I do think that this is worth the effort.

@doisyg
Copy link
Author

doisyg commented Jun 23, 2023

The multicast issue on loopback is a separated one. I believe that most of us now know how to workaround (ip link set lo multicast on). It would be nice to not have it, sure, but for now I would prefer to focus on having the exact same behavior as for humble, i.e. having a working setup without configuration (i.e. without export CYCLONEDDS_URI)

@eboasson
Copy link
Collaborator

Makes sense. I didn't realise the behaviour change (likely) occurred with that ip link set lo multicast on workaround in place, but with that bit of information the puzzle pieces start to make sense. With https://docs.ros.org/en/foxy/Releases/Release-Iron-Irwini.html#improved-discovery-options and the automatic discovery range set to "localhost", the Cyclone RMW layer would in Humble use multicast on the loopback interface (and so no limit on the number of participants) but in Iron maps it to unicast discovery at

disable_multicast = true;
while limiting the number of participants to 32.

IIRC, this change was done after quite some investigation that eventually led to the conclusion that this was the only way within the existing configuration options to work around packets somehow making it from a network interface X into a socket only subscribed to multicast packets on lo. (I find that weird behaviour of the kernel ...)

Bumping the number will work around the problem but, as noted in the comment, it is not ideal.

I intend to do a Cyclone release soonish anyway, and updating rolling to that new release. So let me try to figure out if I can change Cyclone in some way that solves this problem. Then we can worry about backporting it to 0.10.x for Iron or consider moving Iron to that new release.

@doisyg
Copy link
Author

doisyg commented Jun 26, 2023

Great to know that it is being worked on! Thanks

@afrixs
Copy link

afrixs commented Jul 10, 2023

for those who come here to find a solution for the new participant index issue - to preserve the previous behavior (the one not depending on participant indices) you need to manually set your ParticipantIndex to none, e.g.

export CYCLONEDDS_URI='<CycloneDDS><Domain><Discovery><ParticipantIndex>none</ParticipantIndex></Discovery></Domain></CycloneDDS>'

Note: the multicast-disabling tag was not automatically added in my case for some reason, but maybe you will need to also re-enable multicast, i.e. the whole config could look like:

export CYCLONEDDS_URI='<CycloneDDS><Domain><General><Interfaces><NetworkInterface name="lo"/></Interfaces><AllowMulticast>true</AllowMulticast></General><Discovery><ParticipantIndex>none</ParticipantIndex></Discovery></Domain></CycloneDDS>'

or in a config file (adding for readability):

<CycloneDDS>
  <Domain>
    <General>
      <Interfaces>
        <NetworkInterface name="lo"/>
      </Interfaces>
      <AllowMulticast>true</AllowMulticast>
    </General>
    <Discovery>
      <ParticipantIndex>none</ParticipantIndex>
    </Discovery>
  </Domain>
</CycloneDDS>

@SteveMacenski
Copy link

@eboasson is there a chance to change this? This seems like a bug for the default settings for Cyclone users

@eboasson
Copy link
Collaborator

Hi @SteveMacenski, it was changed in rmw_cyclonedds_cpp to support Iron's new discovery modes (specifically, to prevent a random multicast discovery packet from Ethernet wiggling its way into a loopback interface). Simply changing the default in Cyclone will break Iron's new behaviour, but I'm working on a change to Cyclone that makes it possible to support Iron's behaviour while using multicast on loopback. I have the code to extract the data I need from the kernel on all three platforms, now I "only" need to use it.

The plan then is to do a backport to 0.10 so Iron (and Humble) can get easy access to it.

@doisyg
Copy link
Author

doisyg commented Oct 19, 2023

Friendly ping ;)

@doisyg
Copy link
Author

doisyg commented Oct 20, 2023

Ping @imstevenpmwork

@eboasson
Copy link
Collaborator

@doisyg pong!

Have mercy or faith or, ideally, both ...

I just have too many things to juggle, but I haven't forgotten it and I will get there ...

@doisyg
Copy link
Author

doisyg commented Oct 20, 2023

Have mercy or faith or, ideally, both ...
Thanks! I have both ;) Just wanted to be sure it is not forgotten

@SteveMacenski
Copy link

@eboasson just wanted to also follow up. This is very important for having the same great out of the box behavior Cyclone’s had up to this point! It may seem small from a tech perspective, but its a big deal for users to buy into Cyclone over other options in initial testing!

@ros-discourse
Copy link

This issue has been mentioned on ROS Discourse. There might be relevant details there:

https://discourse.ros.org/t/manipulation-wg-moveit-maintainer-meeting-november-30-rescheduled/34686/2

@firesurfer
Copy link

firesurfer commented Jan 29, 2024

I ran into the same issue. The solution presented here (#458 (comment)) worked for me apart from that I had to manually enable multicast on the loopback interface.

So for the sake of completeness if someone is looking for a solution:

export CYCLONEDDS_URI='<CycloneDDS><Domain><General><Interfaces><NetworkInterface name="lo"/></Interfaces><AllowMulticast>true</AllowMulticast></General><Discovery><ParticipantIndex>none</ParticipantIndex></Discovery></Domain></CycloneDDS>'

And manually enable multicast for the loopback interface

sudo ifconfig lo multicast
sudo route add -net 224.0.0.0 netmask 240.0.0.0 dev lo

@tompe17
Copy link

tompe17 commented Oct 28, 2024

Is this resolved in humble? I just got this issue running Humble in an apptainer running on Ubuntu 22.04. Even "ros2 topic list" did not work. Setting CYCLONEDDS_URI according to the above solved the issue. The strange thing is that the apptainer running on Ubuntu 24.04 did work without setting the CYCLONEDDS_URI.

@Danilrivero
Copy link

Hello, ran into the same issue with Jazzy using Ubuntu 24.04. Seems like an important bug by default for many users using this RMW. Any updates for this issue? Is it worth opening an issue for Jazzy?

Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants