Skip to content

Commit

Permalink
service: only allow takeover if all mounts are initialized
Browse files Browse the repository at this point in the history
Summary:
The most common crash on macOS is when a graceful restart is performed while a
mount is ongoing. In this case, an EDEN_BUG is raised in EdenFS leading to
EdenFS to crash and restart.

As a very easy solution, let's check prior to starting graceful restart if
there are any ongoing mounts, and if so, just abort the graceful restart. Note
that there is still a potential race if a mount occurs right after this check,
but that should be significantly more rare and unlikely to happen in practice.

Reviewed By: kmancini

Differential Revision: D41640271

fbshipit-source-id: c0fd0cc0305cbcdd941aa0e2e422706af799d5b3
  • Loading branch information
xavierd authored and facebook-github-bot committed Dec 8, 2022
1 parent 81685da commit 1c86ea5
Show file tree
Hide file tree
Showing 3 changed files with 55 additions and 3 deletions.
12 changes: 12 additions & 0 deletions eden/fs/service/EdenServer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2027,6 +2027,18 @@ folly::Future<TakeoverData> EdenServer::startTakeoverShutdown() {
enumValue(state->state))));
}

if (getMountPoints() != getAllMountPoints()) {
return makeFuture<TakeoverData>(std::runtime_error(
"can only perform graceful restart when all mount points are initialized"));
// TODO(xavierd): There is still a potential race after this check if a
// mount is initiated at this point. Injecting a block below and starting
// a mount would manifest it. In practice, this should be fairly rare.
// Moving this further (in stopMountsForTakeover for instance) to avoid
// this race requires EdenFS to being able to gracefully handle failures
// and recover in these cases by restarting several components after they
// have been already shutdown.
}

// Make a copy of the thrift server socket so we can transfer it to the
// new edenfs process. Our local thrift will close its own socket when
// we stop the server. The easiest way to avoid completely closing the
Expand Down
8 changes: 7 additions & 1 deletion eden/integration/lib/edenclient.py
Original file line number Diff line number Diff line change
Expand Up @@ -450,7 +450,13 @@ def graceful_restart(self, timeout: float = EDENFS_START_TIMEOUT) -> None:
old_pid = self.get_pid_via_thrift()

self._process = None
self.start(timeout=timeout, takeover_from=old_pid)
try:
self.start(timeout=timeout, takeover_from=old_pid)
except Exception:
# TODO: There might be classes of errors where the old_process is
# gone and we do need to track the new process here.
self._process = old_process
raise

# Check the return code from the old edenfs process
return_code = old_process.wait()
Expand Down
38 changes: 36 additions & 2 deletions eden/integration/takeover_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@
from typing import Dict, List, Optional

import pexpect
from eden.fs.cli.util import get_pid_using_lockfile, poll_until
from eden.fs.cli.util import EdenStartError, get_pid_using_lockfile, poll_until
from eden.thrift.legacy import EdenClient
from facebook.eden.ttypes import FaultDefinition, UnblockFaultArg
from facebook.eden.ttypes import FaultDefinition, MountState, UnblockFaultArg
from fb303_core.ttypes import fb303_status

from .lib import testcase
Expand Down Expand Up @@ -434,6 +434,40 @@ def state_shutting_down() -> Optional[bool]:

p.join()

def test_takeover_during_mount(self) -> None:
self.eden.unmount(self.mount_path)

with self.eden.get_thrift_client_legacy() as client:
client.injectFault(
FaultDefinition(keyClass="mount", keyValueRegex=".*", block=True)
)

try:
mountProcess = Process(target=self.eden.mount, args=(self.mount_path,))
mountProcess.start()

def mount_initializing() -> Optional[bool]:
with self.eden.get_thrift_client_legacy() as client:
for mount_info in client.listMounts():
if mount_info.mountPoint == self.mount_path_bytes:
if mount_info.state == MountState.INITIALIZING:
return True
return False
return None

poll_until(mount_initializing, timeout=60)

with self.assertRaisesRegex(
EdenStartError, "edenfs exited before becoming healthy"
):
self.eden.graceful_restart()
finally:
with self.eden.get_thrift_client_legacy() as client:
client.unblockFault(
UnblockFaultArg(keyClass="mount", keyValueRegex=".*")
)
mountProcess.join()


@testcase.eden_repo_test(run_on_nfs=False)
class TakeoverTestNoNFSServer(TakeoverTestBase):
Expand Down

0 comments on commit 1c86ea5

Please sign in to comment.