service: only allow takeover if all mounts are initialized

Summary: The most common crash on macOS is when a graceful restart is performed while a mount is ongoing. In this case, an EDEN_BUG is raised in EdenFS leading to EdenFS to crash and restart. As a very easy solution, let's check prior to starting graceful restart if there are any ongoing mounts, and if so, just abort the graceful restart. Note that there is still a potential race if a mount occurs right after this check, but that should be significantly more rare and unlikely to happen in practice. Reviewed By: kmancini Differential Revision: D41640271 fbshipit-source-id: c0fd0cc0305cbcdd941aa0e2e422706af799d5b3
facebook · Dec 8, 2022 · 1c86ea5 · 1c86ea5
1 parent 81685da
commit 1c86ea5
Show file tree

Hide file tree

Showing 3 changed files with 55 additions and 3 deletions.
diff --git a/eden/fs/service/EdenServer.cpp b/eden/fs/service/EdenServer.cpp
@@ -2027,6 +2027,18 @@ folly::Future<TakeoverData> EdenServer::startTakeoverShutdown() {
           enumValue(state->state))));
     }
 
+    if (getMountPoints() != getAllMountPoints()) {
+      return makeFuture<TakeoverData>(std::runtime_error(
+          "can only perform graceful restart when all mount points are initialized"));
+      // TODO(xavierd): There is still a potential race after this check if a
+      // mount is initiated at this point. Injecting a block below and starting
+      // a mount would manifest it. In practice, this should be fairly rare.
+      // Moving this further (in stopMountsForTakeover for instance) to avoid
+      // this race requires EdenFS to being able to gracefully handle failures
+      // and recover in these cases by restarting several components after they
+      // have been already shutdown.
+    }
+
     // Make a copy of the thrift server socket so we can transfer it to the
     // new edenfs process.  Our local thrift will close its own socket when
     // we stop the server.  The easiest way to avoid completely closing the

diff --git a/eden/integration/lib/edenclient.py b/eden/integration/lib/edenclient.py
@@ -450,7 +450,13 @@ def graceful_restart(self, timeout: float = EDENFS_START_TIMEOUT) -> None:
         old_pid = self.get_pid_via_thrift()
 
         self._process = None
-        self.start(timeout=timeout, takeover_from=old_pid)
+        try:
+            self.start(timeout=timeout, takeover_from=old_pid)
+        except Exception:
+            # TODO: There might be classes of errors where the old_process is
+            # gone and we do need to track the new process here.
+            self._process = old_process
+            raise
 
         # Check the return code from the old edenfs process
         return_code = old_process.wait()

diff --git a/eden/integration/takeover_test.py b/eden/integration/takeover_test.py
@@ -13,9 +13,9 @@
 from typing import Dict, List, Optional
 
 import pexpect
-from eden.fs.cli.util import get_pid_using_lockfile, poll_until
+from eden.fs.cli.util import EdenStartError, get_pid_using_lockfile, poll_until
 from eden.thrift.legacy import EdenClient
-from facebook.eden.ttypes import FaultDefinition, UnblockFaultArg
+from facebook.eden.ttypes import FaultDefinition, MountState, UnblockFaultArg
 from fb303_core.ttypes import fb303_status
 
 from .lib import testcase
@@ -434,6 +434,40 @@ def state_shutting_down() -> Optional[bool]:
 
             p.join()
 
+    def test_takeover_during_mount(self) -> None:
+        self.eden.unmount(self.mount_path)
+
+        with self.eden.get_thrift_client_legacy() as client:
+            client.injectFault(
+                FaultDefinition(keyClass="mount", keyValueRegex=".*", block=True)
+            )
+
+        try:
+            mountProcess = Process(target=self.eden.mount, args=(self.mount_path,))
+            mountProcess.start()
+
+            def mount_initializing() -> Optional[bool]:
+                with self.eden.get_thrift_client_legacy() as client:
+                    for mount_info in client.listMounts():
+                        if mount_info.mountPoint == self.mount_path_bytes:
+                            if mount_info.state == MountState.INITIALIZING:
+                                return True
+                            return False
+                return None
+
+            poll_until(mount_initializing, timeout=60)
+
+            with self.assertRaisesRegex(
+                EdenStartError, "edenfs exited before becoming healthy"
+            ):
+                self.eden.graceful_restart()
+        finally:
+            with self.eden.get_thrift_client_legacy() as client:
+                client.unblockFault(
+                    UnblockFaultArg(keyClass="mount", keyValueRegex=".*")
+                )
+            mountProcess.join()
+
 
 @testcase.eden_repo_test(run_on_nfs=False)
 class TakeoverTestNoNFSServer(TakeoverTestBase):