Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when adding a short-lived node to an executor #777

Open
adamlm opened this issue Sep 8, 2024 · 6 comments
Open

Segmentation fault when adding a short-lived node to an executor #777

adamlm opened this issue Sep 8, 2024 · 6 comments
Assignees

Comments

@adamlm
Copy link

adamlm commented Sep 8, 2024

Bug report

Description

I am trying to set up an executor "spinner thread" that will spin a specified rclcpp::Executor until commanded to stop. My goal is to construct the executor and spinner thread in one scope then add nodes to that executor in another scope. However, the rclcpp::Executor classes cause a segmentation fault under certain conditions. There appears to be a race condition in the RMW layer that manifests with short-lived nodes.

  • Operating System: Ubuntu 22.04 (Docker container); Ubuntu 22.04 host

  • Installation type: Binaries

  • Version or commit hash:

    • ros-humble-rclcpp: 16.0.10

    • ros-humble-rmw-fastrtps-cpp: 6.2.7

    • ros-humble-fastrtps: 2.6.8

    • ros-humble-rmw-cyclonedds-cpp: 0.10.4

    • ros-humble-cyclonedds: 1.3.4

  • DDS implementation:

    • Fast-DDS (Fast-RTPS)
    • Cyclone DDS
  • Client library (if applicable): rclcpp

  • Compiler:

    • Clang 14
    • GCC 11.4.0

Steps to reproduce issue

Minimal reproducible example

#include <atomic>
#include <memory>
#include <thread>

#include <rclcpp/executors/multi_threaded_executor.hpp>
#include <rclcpp/node.hpp>

auto function() -> void
{
    rclcpp::init(0, nullptr);

    rclcpp::executors::MultiThreadedExecutor executor;

    std::atomic_bool stop_spinning{false};
    auto spinner_thread{std::thread([&] {
        while (rclcpp::ok() && !stop_spinning.load()) {
            executor.spin_some();
        }
    })};

    {
        auto node{std::make_shared<rclcpp::Node>("node")};
        executor.add_node(node);
    }

    stop_spinning.store(true);
    spinner_thread.join();

    rclcpp::shutdown();
}

auto main() -> int
{
    while (true) {
        function();
    }

    return 0;
}

Steps

  • Compile and run the above example.
  • Wait for a segmentation fault error. It usually happens within a minute.

Expected behavior

The program runs indefinitely.

Actual behavior

The program exits due to a segmentation fault.

Program output when compiled with Clang

Segmentation fault (core dumped)

Program output when compiled with GCC

Segmentation fault (core dumped)

In (only) one of the many runs I did, I got the following output (when compiled with GCC):

terminate called after throwing an instance of 'rclcpp::exceptions::RCLError'
  what():  failed to add guard condition to wait set: guard condition implementation is invalid, at ./src/rcl/guard_condition.c:172, at ./src/rcl/wait.c:460
Aborted (core dumped)

Additional information

Relevant core dump

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f4154232e3f in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*) ()
   from /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so
[Current thread is 1 (Thread 0x7f41531e8640 (LWP 1202848))]
(gdb) bt
#0  0x00007f4154232e3f in rmw_fastrtps_shared_cpp::__rmw_wait(char const*, rmw_subscriptions_s*, rmw_guard_conditions_s*, rmw_services_s*, rmw_clients_s*, rmw_events_s*, rmw_wait_set_s*, rmw_time_s const*) ()
    at /opt/ros/humble/lib/librmw_fastrtps_shared_cpp.so
#1  0x00007f415428a887 in rmw_wait(rmw_subscriptions_t*, rmw_guard_conditions_t*, rmw_services_t*, rmw_clients_t*, rmw_events_t*, rmw_wait_set_t*, rmw_time_t const*)
    (subscriptions=<optimized out>, guard_conditions=<optimized out>, services=<optimized out>, clients=<optimized out>, events=<optimized out>, wait_set=<optimized out>, wait_timeout=0x7f41531d5510) at ./src/rmw_wait.cpp:33
#2  0x00007f42759b1848 in rcl_wait () at /opt/ros/humble/lib/librcl.so
#3  0x00007f4274f2470c in rclcpp::Executor::wait_for_work(std::chrono::duration<long, std::ratio<1l, 1000000000l> >) (this=0x7ffdbe823c50, timeout=...) at /usr/include/c++/11/chrono:521
#4  0x00007f4274f28853 in rclcpp::Executor::spin_some_impl(std::chrono::duration<long, std::ratio<1l, 1000000000l> >, bool) (this=0x7ffdbe823c50, max_duration=..., exhaustive=false) at /usr/include/c++/11/chrono:537
#5  0x000063ca4387dbde in function()::$_0::operator()() const (this=0x63ca44fc84e8) at <redacted_executable_path>:17
#6  0x000063ca4387db1d in std::__invoke_impl<void, function()::$_0>(std::__invoke_other, function()::$_0&&) (__f=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61
#7  0x000063ca4387daad in std::__invoke<function()::$_0>(function()::$_0&&) (__fn=...) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96
#8  0x000063ca4387da85 in std::thread::_Invoker<std::tuple<function()::$_0> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=0x63ca44fc84e8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:259
#9  0x000063ca4387da55 in std::thread::_Invoker<std::tuple<function()::$_0> >::operator()() (this=0x63ca44fc84e8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:266
#10 0x000063ca4387d979 in std::thread::_State_impl<std::thread::_Invoker<std::tuple<function()::$_0> > >::_M_run() (this=0x63ca44fc84e0) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211
#11 0x00007f4273705253 in  () at /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#12 0x00007f427338dac3 in  () at /usr/lib/x86_64-linux-gnu/libc.so.6
#13 0x00007f427341f850 in  () at /usr/lib/x86_64-linux-gnu/libc.so.6

Other notes

  • The issue occurs when using either the SingleThreadedExecutor or MultiThreadedExecutor.

  • No segmentation fault occurs* when relocating node to the outer scope.

    auto node{std::make_shared<rclcpp::Node>("node")};
    
    {
        executor.add_node(node);
    }
  • No segmentation fault occurs* when using Cyclone DDS

  • No segmentation fault occurs* when sleeping the main thread for a short duration before adding node to the executor.

    {
        auto node{std::make_shared<rclcpp::Node>("node")};
        std::this_thread::sleep_for(std::chrono::seconds(1));
        executor.add_node(node);
    }

* I concluded a segmentation fault was unlikely after running the program for several minutes.

@fujitatomoya
Copy link
Collaborator

I am not sure if this is fixed with specific PR, but this problem cannot be observed with rolling.

fujitatomoya added a commit to fujitatomoya/ros2_test_prover that referenced this issue Sep 9, 2024
@christophebedard
Copy link
Member

Can you try removing the node from the executor before it goes out of scope?

@adamlm
Copy link
Author

adamlm commented Sep 10, 2024

I am not sure if this is fixed with specific PR, but this problem cannot be observed with rolling.

@fujitatomoya I ran the example code in the latest Rolling Docker image (rolling-ros-base, image ID/digest f875aa41a6f0) and got a segfault there too.

Container start command:

docker run --rm -it ros:rolling-ros-base

The only changes I made inside the container were:

  • updating apt's package list (not upgrading the packages themselves)
  • installing Vim
  • creating a workspace at /workspace/src
  • creating an C++ package with ros2 pkg create
  • adding the example code from above

Is there anything else I can provide to help pinpoint where the issue might be?

Can you try removing the node from the executor before it goes out of scope?

@christophebedard I made the following change to the example.

{
    auto node{std::make_shared<rclcpp::Node>("node")};
    executor.add_node(node);
    executor.remove_node(node);
}

The program ran fine for over 10 minutes running on Humble with FastDDS.

@christophebedard
Copy link
Member

christophebedard commented Sep 10, 2024

Can you try removing the node from the executor before it goes out of scope?

@christophebedard I made the following change to the example.

{
    auto node{std::make_shared<rclcpp::Node>("node")};
    executor.add_node(node);
    executor.remove_node(node);
}

The program ran fine for over 10 minutes running on Humble with FastDDS.

I think it's reasonable to expect the user to remove the node from the executor before it goes out of scope/gets destroyed. See this test which does something similar to your reproducer: https://github.com/ros2/rclcpp/blob/2f71d6e249f626772da3f8a1bb7c8d141d9d0d52/rclcpp/test/rclcpp/test_executor.cpp#L75-L102

I'm not sure if this would cover your full use-case, so please let us know.

@fujitatomoya
Copy link
Collaborator

@adamlm thanks for checking this! actually what i tried is to source build with rolling. so maybe latest patches in rolling source (not available as package) fix this issue.

@wjwwood wjwwood self-assigned this Oct 3, 2024
@TSNoble
Copy link

TSNoble commented Oct 25, 2024

We're having a similar issue when creating a short-lived subscriber. I think it may be the same since:

  • We're creating and attaching the subscriber within the scope of a method (specifically an action callback)
  • The gdb backtrace is very similar (crashes within the same place in rclcpp / rmw)
  • The method call is very quick (and adding a sleep within it seems to prevent the crash)
  • We occasionally get the same rclcpp::exceptions::RCLError

We were originally upgrading from noetic to humble. I noticed #728, which seemed to be resolved by ros2/rclcpp#2142. This PR appears to be in jazzy, however, upgrading to jazzy has not fixed the issue for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants