Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Crash while thread running after FileWatcherInotfiy destructed. #157

Open
ashione opened this issue Mar 9, 2023 · 16 comments
Open
Assignees
Labels
bug Something isn't working

Comments

@ashione
Copy link

ashione commented Mar 9, 2023

My main instruction likes

int main () {
FileWatcherInotify A = new FileWatcherInotify(...);
A.watch('...');
delete A;
return 0;
}

There are two threads cross running in process,

  • main thread
  • watcher thread

We found watcher thread crashed and this backtrace remained:

#4  0x00007f78d88248f3 in uw_update_context () from /usr/lib64/libgcc_s.so.1
#5  0x00007f78d88254ae in _Unwind_Backtrace () from /usr/lib64/libgcc_s.so.1
#6  0x00007f78d8759946 in backtrace () from /usr/lib64/libc.so.6
#7  0x00007f78da46f8cf in google::GetStackTrace(void**, int, int) () from /home/admin/.pyenv/versions/3.6.13/lib/python3.6/site-packages/ray/cpp/lib/libray_api.so
#8  0x00007f78da4707f9 in google::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) ()
   from /home/admin/.pyenv/versions/3.6.13/lib/python3.6/site-packages/ray/cpp/lib/libray_api.so
#9  <signal handler called>
#10 0x00007f7876d34f2d in efsw::Private::ThreadMemberFunc<efsw::FileWatcherInotify>::run (this=<error reading variable: Cannot access memory at address 0x27f7773ffe798>)
    at bazel-out/k8-dbg/bin/external/com_github_efsw/_virtual_includes/efsw/efsw/Thread.hpp:78

main thread stack :

#0  0x00007f78d862cac7 in __pthread_timedjoin_ex () from /usr/lib64/libpthread.so.0
#1  0x00007f7876d43541 in efsw::Platform::ThreadImpl::wait (this=0x7f777cfa36a0) at external/com_github_efsw/src/efsw/platform/posix/ThreadImpl.cpp:25
#2  0x00007f7876d3de3b in efsw::Thread::wait (this=0x7f777c14c2a0) at external/com_github_efsw/src/efsw/Thread.cpp:22
#3  0x00007f7876d3dd48 in efsw::Thread::~Thread (this=0x7f777c14c2a0, __in_chrg=<optimized out>) at external/com_github_efsw/src/efsw/Thread.cpp:9
#4  0x00007f7876d3dd98 in efsw::Thread::~Thread (this=0x7f777c14c2a0, __in_chrg=<optimized out>) at external/com_github_efsw/src/efsw/Thread.cpp:12
#5  0x00007f7876d2cf0b in efsw::FileWatcherInotify::~FileWatcherInotify (this=0x7f7768002fe0, __in_chrg=<optimized out>)
    at external/com_github_efsw/src/efsw/FileWatcherInotify.cpp:50
#6  0x00007f7876d2d0d6 in efsw::FileWatcherInotify::~FileWatcherInotify (this=0x7f7768002fe0, __in_chrg=<optimized out>)
    at external/com_github_efsw/src/efsw/FileWatcherInotify.cpp:68
#7  0x00007f7876d2b351 in efsw::FileWatcher::~FileWatcher (this=0x7f777c847160, __in_chrg=<optimized out>) at external/com_github_efsw/src/efsw/FileWatcher.cpp:67
#8  0x00007f7876d2b378 in efsw::FileWatcher::~FileWatcher (this=0x7f777c847160, __in_chrg=<optimized out>) at external/com_github_efsw/src/efsw/FileWatcher.cpp:68

According to above stackes, it's watcher deletion happened before watcher entry_point running, which means m_objest had been deleted and memory undetected in that time

// Specialization using a member function
template <typename C> struct ThreadMemberFunc : ThreadFunc {
	ThreadMemberFunc( void ( C::*function )(), C* object ) :
		m_function( function ), m_object( object ) {}
	virtual void run() { ( m_object->*m_function )(); }
	void ( C::*m_function )();
	C* m_object;
};
@SpartanJ SpartanJ self-assigned this Mar 9, 2023
@SpartanJ SpartanJ added the bug Something isn't working label Mar 9, 2023
@SpartanJ
Copy link
Owner

SpartanJ commented Mar 9, 2023

Hi!
Why are you using FileWatcherInotify directly? That class is private. But as far as I can remember it should work if called directly.
I'm unable to reproduce the bug with a minimal example:

#include <efsw/efsw.hpp>

int main( int argc, char** argv ) {
	efsw::FileWatcher* fileWatcher = new efsw::FileWatcher();
	fileWatcher->watch();
	delete fileWatcher;
	return 0;
}

Neither with a watched dir:

#include <efsw/efsw.hpp>
#include <iostream>

class UpdateListener : public efsw::FileWatchListener {
  public:
	UpdateListener() {}

	void handleFileAction( efsw::WatchID, const std::string&, const std::string&, efsw::Action,
						   std::string ) {}
};

int main( int argc, char** argv ) {
	efsw::FileWatcher* fileWatcher = new efsw::FileWatcher();
	UpdateListener* ul = new UpdateListener();
	fileWatcher->addWatch( ".", ul );
	fileWatcher->watch();
	delete fileWatcher;
	delete ul;
	return 0;
}

I'm missing something?

@ashione
Copy link
Author

ashione commented Mar 9, 2023

Hi!

Why are you using FileWatcherInotify directly? That class is private. But as far as I can remember it should work if called directly.

I'm unable to reproduce the bug with a minimal example:

#include <efsw/efsw.hpp>



int main( int argc, char** argv ) {

	efsw::FileWatcher* fileWatcher = new efsw::FileWatcher();

	fileWatcher->watch();

	delete fileWatcher;

	return 0;

}

Neither with a watched dir:

#include <efsw/efsw.hpp>

#include <iostream>



class UpdateListener : public efsw::FileWatchListener {

  public:

	UpdateListener() {}



	void handleFileAction( efsw::WatchID, const std::string&, const std::string&, efsw::Action,

						   std::string ) {}

};



int main( int argc, char** argv ) {

	efsw::FileWatcher* fileWatcher = new efsw::FileWatcher();

	UpdateListener* ul = new UpdateListener();

	fileWatcher->addWatch( ".", ul );

	fileWatcher->watch();

	delete fileWatcher;

	delete ul;

	return 0;

}

I'm missing something?

I just want to make someone getting main code path and why it crashed,which does not means i use filewatcherinotfiy directly.
You might catch this from stack filewatcher was used.

@ashione
Copy link
Author

ashione commented Mar 9, 2023

In my usage scenario, it will appear once about an hour running. Especially in the case of high load machine, pthread_create returns does not mean that the thread has been created and run entry point function successfully. Actually you can deduce it from the way the code runs, which is indeed possible.
@SpartanJ

@SpartanJ
Copy link
Owner

SpartanJ commented Mar 9, 2023

Oh, I understand. I thought that your code snipped was a literal example.
If I understand well mIsActive should be only true when entryPoint is called / running and it should be atomic. Do you agree?

SpartanJ added a commit that referenced this issue Mar 9, 2023
@SpartanJ
Copy link
Owner

SpartanJ commented Mar 9, 2023

Let me know if the last commit helps with your issue.

@ashione
Copy link
Author

ashione commented Mar 10, 2023

Let me know if the last commit helps with your issue.

I pack this commit to my program but still crashed. So perhas it not works yet.

@SpartanJ
Copy link
Owner

Sorry, the fix I sent was for another kind of error and also was worst than before. I misinterpreted the crash completely. I think that now I have a clear picture of the issue but I have no idea how I can avoid it in 100% of the cases.
Reading the Linux documentation says basically that the thread can start at any time, I'm assuming that in your case is taking enough time to be called so late that the owner (the Thread instance and the ThreadImpl instance) does not exist anymore. I assumed that the call to pthread_join should prevent that (the Thread destructor calls to pthread_join before destroying ThreadImpl and its own instance), but it seems that pthread_join does nothing if the thread did not start. I sent another change that will check if the join worked, and if that didn't work will call to a pthread_cancel. If that doesn't work, I actually have no idea how this could be fixed.
In your use case, you're spawning many efsw::FileWatcher and destroying them very quickly or something alike? Otherwise, I don't see how this could happen at all.

@SpartanJ
Copy link
Owner

Also: I tried to replicate it by doing something stupid like:

int main( int argc, char** argv ) {
	int c = 0;
	UpdateListener* ul = new UpdateListener();
	while ( c++ < 10000 ) {
		efsw::FileWatcher* fileWatcher = new efsw::FileWatcher();
		fileWatcher->addWatch( ".", ul );
		fileWatcher->watch();
		delete fileWatcher;
	}

	delete ul;
	return 0;
}

And pthread_join is doing its job, it always waits for the entryPoint to run at completion.
That confuses me more.

@ashione
Copy link
Author

ashione commented Mar 11, 2023

I'm going to figure out the root cause of this issue and upload lastest clues ASAP.

@maddanio
Copy link

Maybe using c++11 threads helps?

@maddanio
Copy link

maybe try my PR that uses c++11 primitives for threading and locking:
#170

@SpartanJ
Copy link
Owner

Thanks for the collaboration. It seems that still needs some work to compile properly. You had this same issue? Because I wasn't able to reproduce it yet.

@maddanio
Copy link

what needs work? I have no problem comliling?

@maddanio
Copy link

no i did not, but the report somewhat lowered my trust in the hand rolled thread management, thus I pulled in the c++11 stuff, which I think is the best approach

@ashione
Copy link
Author

ashione commented Dec 30, 2023

no i did not, but the report somewhat lowered my trust in the hand rolled thread management, thus I pulled in the c++11 stuff, which I think is the best approach

Perphas it's better way to use standard thread library. Actually it's weird and no more detail information or core dump can be digged out.
I notice that filewatherinotify and some objects might not be threadsafe in high frequency creation or destruction.

@SpartanJ
Copy link
Owner

SpartanJ commented Jan 2, 2024

I agree, efsw was using custom thread implementation because it was created actually before C++11 and I kept compatibility for older compilers until just recently. But know it's possible to migrate everything to C++11 standard headers. The PR looks OK but I'll not merge it until I have enough time to test it on all platforms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants