Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPCC-33015 Improve system resilience when thor crashes #19309

Merged
merged 1 commit into from
Nov 22, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions common/workunit/workunit.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -14500,6 +14500,11 @@ void executeThorGraph(const char * graphName, IConstWorkUnit &workunit, const IP
// NB: check for expected success state (WUStateWait). If any other state, abort.
{
Owned<IWorkUnit> w = &workunit.lock();
//If the thor instance crashed, make sure that the workunit is no longer associated with it - otherwise a
//failure clause that causes a graph to run can abort because the instances has stopped.
if (w->getEngineSession() > 0)
w->setEngineSession(-1);

WUState state = w->getState();
if (WUStateWait != state) // expected state from successful Thor run from above
{
Expand Down
39 changes: 28 additions & 11 deletions thorlcr/master/thgraphmanager.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1473,8 +1473,13 @@ void thorMain(ILogMsgHandler *logHandler, const char *wuid, const char *graphNam
Owned<IConstWorkUnit> workunit;
factory.setown(getWorkUnitFactory());
workunit.setown(factory->openWorkUnit(currentWuid));
SessionId agentSessionID = workunit->getAgentSession();
if (agentSessionID <= 0)
SessionId agentSessionID = workunit ? workunit->getAgentSession() : 0;
if (!workunit)
{
WARNLOG("Discarding job with missing workunit wuid=%s, graph=%s", currentWuid.str(), currentGraphName.str());
currentWuid.clear();
}
else if (agentSessionID <= 0)
{
WARNLOG("Discarding job with invalid sessionID: wuid=%s, graph=%s (sessionID=%" I64F "d)", currentWuid.str(), currentGraphName.str(), agentSessionID);
currentWuid.clear();
Expand Down Expand Up @@ -1524,8 +1529,8 @@ void thorMain(ILogMsgHandler *logHandler, const char *wuid, const char *graphNam
}
}
}
currentGraphName.clear();

currentGraphName.clear();
if (lingerPeriod)
{
PROGLOG("Lingering time left: %.2f", ((float)lingerPeriod)/1000);
Expand All @@ -1545,15 +1550,27 @@ void thorMain(ILogMsgHandler *logHandler, const char *wuid, const char *graphNam
break; // timeout/abort
// else - reject/ignore duff message.
}
if (0 == currentGraphName.length()) // only ever true if !multiJobLinger

// The following is true if no workunit/graph have been received
// MORE: I think it should also be executed if lingerPeriod is 0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have created a separate jira to fix this (when Jake gets back)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ghalliday - lingerPeriod cannot be 1 (as defined by the values schema)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.. agree, lingerPeriod is optional (min 1 if set), if it is not set (0).. it does look like it is not exiting correctly, checking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, lingerPeriod cannot be 0. It defaults to 60 seconds, and the schema will not allow it to be set 0.
So the if (lingerPeriod) condition is misleading, it will always enter this block of code in practice.

if (0 == currentGraphName.length())
{
// De-register the idle lingering entry.
Owned<IWorkUnitFactory> factory;
Owned<IConstWorkUnit> workunit;
factory.setown(getWorkUnitFactory());
workunit.setown(factory->openWorkUnit(currentWuid));
Owned<IWorkUnit> w = &workunit->lock();
w->setDebugValue(instance, "0", true);
if (!multiJobLinger)
{
// De-register the idle lingering entry.
Owned<IWorkUnitFactory> factory;
Owned<IConstWorkUnit> workunit;
factory.setown(getWorkUnitFactory());
workunit.setown(factory->openWorkUnit(currentWuid));
//Unlikely, but the workunit could have been deleted while we were lingering
//currentWuid can also be blank if the workunit this started for died before thor started
//processing the graph. This test covers both (unlikely) situations.
if (workunit)
{
Owned<IWorkUnit> w = &workunit->lock();
w->setDebugValue(instance, "0", true);
}
}
break;
}
}
Expand Down
Loading