Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fluid runtime crashes on network error #22214

Open
d2e opened this issue Aug 15, 2024 · 2 comments
Open

Fluid runtime crashes on network error #22214

d2e opened this issue Aug 15, 2024 · 2 comments
Labels
bug Something isn't working community-contribution

Comments

@d2e
Copy link

d2e commented Aug 15, 2024

Describe the bug

At certain condition when there is a network issue between fluid client and Routelicious,
It result in unhandled promise and as a result node process exits.

Exception that occur is ( GenericNetworkError(errorMessage, errorMessage.startsWith("NetworkError"))
2024-07-31T13:19:35.839Z 377aa069-abf3-4366-8300-be03a126d813 ERROR Unhandled Promise Rejection {
"errorType": "Runtime.UnhandledPromiseRejection",
"errorMessage": "Error: R11s fetch error: \r\n<title>502 Bad Gateway</title>\r\n\r\n

502 Bad Gateway

\r\n\r\n\r\n",
"reason": {
"errorType": "genericNetworkError",
"errorMessage": "R11s fetch error: \r\n<title>502 Bad Gateway</title>\r\n\r\n

502 Bad Gateway

\r\n\r\n\r\n",
"omitPropsFromLogging": {},
"_errorInstanceId": "29506e27-40f8-485b-ad56-23b989687a1c",
"fluidErrorCode": "-",
"statusCode": 502,
"driverVersion": "2.0.0-internal.7.4.3",
"canRetry": true,
"endpointReached": true,
"stack": [
"Error: R11s fetch error: \r",
"<title>502 Bad Gateway</title>\r",
"\r",
"

502 Bad Gateway

\r",
"\r",
"\r",
"",
" at createR11sNetworkError (/var/task/node_modules/@fluidframework/routerlicious-driver/dist/errorUtils.cjs:61:21)",
" at throwR11sNetworkError (/var/task/node_modules/@fluidframework/routerlicious-driver/dist/errorUtils.cjs:73:26)",
" at RouterliciousStorageRestWrapper.request (/var/task/node_modules/@fluidframework/routerlicious-driver/dist/restWrapper.cjs:162:48)",
" at runMicrotasks ()",
" at processTicksAndRejections (node:internal/process/task_queues:96:5)",
" at async /var/task/node_modules/@fluidframework/routerlicious-driver/dist/shreddedSummaryDocumentStorageService.cjs:100:31",
" at async Function.timedExecAsync (/var/task/node_modules/@fluidframework/telemetry-utils/dist/logger.js:496:25)",
" at async ShreddedSummaryDocumentStorageService.readBlob (/var/task/node_modules/@fluidframework/routerlicious-driver/dist/shreddedSummaryDocumentStorageService.cjs:95:23)"
]
},
"promise": {},
"stack": [
"Runtime.UnhandledPromiseRejection: Error: R11s fetch error: \r",
"<title>502 Bad Gateway</title>\r",
"\r",
"

502 Bad Gateway

\r",
"\r",
"\r",
"",
" at process. (file:///var/runtime/index.mjs:1276:17)",
" at process.emit (node:events:525:35)",
" at emit (node:internal/process/promises:140:20)",
" at processPromiseRejections (node:internal/process/promises:274:27)",
" at processTicksAndRejections (node:internal/process/task_queues:97:32)"
]
}

To Reproduce

I am able to reproduce the error while using wifi as well i stop the wifi during the collaboration and then when fluid client
is trying to create
Steps to reproduce the behavior:
I am able to reproduce though it require quite a bit of attempt, when i turn of wifi in mid of collaboration then also some times I am able to see this error
Its very random I have observed 1 or 2 such issue on daily basis.
Since one of the user is running fluid client on top of lambda as bot, unhandle rejection is causing it to crash.
restWrapper.mjs:101 Uncaught (in promise) Error: NetworkError: Network request failed
at restWrapper.mjs:101:1
at async VM41070 main.03c9eba.4219a8761b37ae28a6d6.js:207786:28
at async RouterliciousStorageRestWrapper.request (VM41070 main.03c9eba.4219a8761b37ae28a6d6.js:207784:21)
at async VM41070 main.03c9eba.4219a8761b37ae28a6d6.js:208204:31
at async PerformanceEvent.timedExecAsync (VM41070 main.03c9eba.4219a8761b37ae28a6d6.js:48395:25)
at async ShreddedSummaryDocumentStorageService.readBlob (VM41070 main.03c9eba.4219a8761b37ae28a6d6.js:208199:23)

Expected behavior

when I dig little bit I notice that catch is missing in ShreddedSummaryDocumentStorageService.readBlob function

public async readBlob(blobId: string): Promise {
const cachedBlob = await this.blobCache?.get(this.getCacheKey(blobId));
if (cachedBlob) {
return cachedBlob;
}

	const value = await PerformanceEvent.timedExecAsync(
		this.logger,
		{
			eventName: "readBlob",
			blobId,
		},
		async (event) => {
			const manager = await this.getStorageManager();
			const response = (await manager.getBlob(blobId)).content;
			event.end({
				size: response.size,
			});
			return response;
		},
		undefined, // workers
		this.mc.config.getNumber("Fluid.Driver.ReadBlobTelemetrySampling"),
	);
	this.blobsShaCache.set(value.sha, "");
	const bufferContent = stringToBuffer(value.content, value.encoding);
	await this.blobCache?.put(this.getCacheKey(value.sha), bufferContent);
	return bufferContent;
}

Logs

@d2e d2e added the bug Something isn't working label Aug 15, 2024
@znewton
Copy link
Contributor

znewton commented Aug 22, 2024

Since the implementation of readBlob there appropriately uses async/await, I actually doubt that the issue is within readBlob itself. We would expect readBlob to throw its errors, and catching those without throwing would possibly cause problems in some areas that should be retrying at a higher level.

Instead, I suspect the problem is within PrefetchDocumentStorageService, which attempts to pre-load the snapshot as soon as the version is requested, potentially saving a couple seconds on load. Within PrefetchDocumentStorageService, there are calls to readBlob that use the void prefix, like so:

        private prefetchTree(tree: ISnapshotTree) {
		const secondary: string[] = [];
		this.prefetchTreeCore(tree, secondary);

		for (const blob of secondary) {
			// We don't care if the prefetch succeeds
			void this.cachedRead(blob);
		}
	}

While the comment is correct, we don't care if the prefetch succeeds, an error there would throw an UncaughtException.

Proof Script:

/**
 * Creates a rejected promise.
 */
async function throwErrorInPromise(id: string): Promise<never> {
  throw new Error(`Error - ${id}`);
}

(async function main(): Promise<void> {
  process.on("uncaughtException", (error: Error) => {
    // Log, instead of die, on uncaught exceptions.
    console.error(`Uncaught exception: ${error.message}`);
  });

  process.on("unhandledRejection", (error: Error) => {
    // Log, instead of die, on uncaught rejections.
    console.error(`Unhandled rejection: ${error.message}`);
  });

  // This should result in an Unhandled Rejection
  // which will be caught by the process.on("unhandledRejection") handler
  // and logged as "Unhandled rejection: Error - void call"
  void throwErrorInPromise("void call");
  // This should just log the error as "Caught error: Error - catch call"
  throwErrorInPromise("catch call").catch((error: Error) => {
    console.error(`Caught error: ${error.message}`);
  });
  // This should just log the error as "Try..catch error: Error - await call"
  try {
    await throwErrorInPromise("await call");
  } catch (error: unknown) {
    console.error(
      `Try..catch error: ${error instanceof Error ? error.message : error}`,
    );
  }
})();

Output

Caught error: Error - catch call
Try..catch error: Error - await call
Unhandled rejection: Error - void call

@znewton
Copy link
Contributor

znewton commented Aug 22, 2024

@d2e to mitigate the issue on your end until a fix is released, you can try setting the enablePrefetch driver policy to false when instantiating the RouterliciousDocumentServiceFactory in your app, like

new RouterliciousDocumentServiceFactory(myTokenProvider, { enablePrefetch: false });

Let me know if you can/can't repro after making that change. Thanks!

@microsoft microsoft deleted a comment from borisf-msft Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community-contribution
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants