MQTT5 in Lambda, multi-initialization issue #360
Replies: 15 comments 7 replies
-
On possibility would be to try and webpack the lambda which should switch you to the browser implementation which does not use a NAPI native add on. |
Beta Was this translation helpful? Give feedback.
-
On second look, it seems like the distribution switch gets done in the typescript compilation step: https://github.com/awslabs/aws-crt-nodejs/blob/main/samples/browser/http/tsconfig.json#L40 If you're using typescript then a copy-paste of that module mapping could be a possibility. If you're not, there may be some pre-processing tool like babel that could give you a similar transformation on pure javascript. I don't think it would help, but does anything change if the client is internal to the function? Once you close the client, it's not usable anymore and with it being a global that feels a little weird. If you didn't call close at the end of the function, in theory the cached client may still be usable on a subsequent lambda invocation by a new call to start. When run in node, the SDK uses a NAPI addon to pull in native implementations of a lot of functionality (mqtt, http, auth/crypto, etc...). That native addon does not support being initialized more than once, which is the core problem you're running into. In the past, we only saw that behavior when used from a framework (Electron and similar could have it happen) that potentially ran more than one node instance in a single process. I don't know what Lambda does to try and avoid paying the cold start cost of launching node, but it seems unlikely that they'd spawn a second instance in the same sandboxed process. I'm sick right now, but when I'm coherent and able, I can also open an internal ticket to Lambda to find out more about how they cache across invocations within the same sandbox process. That might give us some clues as to an alternate workaround. The long-term fix of allowing multi-init of the NAPI module is potentially a substantial project and I would be concerned of use cases that triggered it arbitrarily like this because each module init of the CRT spawns multiple long-lived threads. Another possibility is that the multi-init problem you're facing isn't actually concurrent, instead when the module is finalized we forget to reset it to allow another init to happen (Lambda keeps node running). Checking that... Yeah, there is no reset at the moment. Comparing the multi-init check (https://github.com/awslabs/aws-crt-nodejs/blob/main/source/module.c#L1139-L1159) with the finalizer (https://github.com/awslabs/aws-crt-nodejs/blob/main/source/module.c#L1077-L1102), it's clear we don't do any kind of reset on the possibility of a re-initialization. That might actually work for the Lambda case, and would be a lot less work than the concurrent multi-init story. I can look into it. |
Beta Was this translation helpful? Give feedback.
-
How are you bundling the SDK into your lambda? I'm trying to repro, but running into the issue of getting the napi module to be found (AWS CRT binary not present in /bin/linux-x64/aws-crt-nodejs). In the interests of time-saving, following what you're doing may make repro quicker. On a more general note, we need better support in https://github.com/awslabs/aws-crt-nodejs/blob/main/lib/native/binding.js for alternative load paths for the NAPI modules. Just reading the code you can see how it will fail based on the directory assumptions once something has been packed into a unity javascript file. Unsure what the right way forward there is. |
Beta Was this translation helpful? Give feedback.
-
Thank your for the extremely easy repro setup. I have it on my local system and can now experiment with potential solutions |
Beta Was this translation helpful? Give feedback.
-
So good news and bad news: The bad - this is a genuine case of multi-init (not a reload of an unloaded module like I had guessed above). This won't be an easy fix. I know in theory how to do it, but the worry is that it will introduce some instability once we no longer have singletons for key resources. It won't be a huge delta, but it will be a delta in very delicate parts of the codebase. The good - this is such an easy repro case and this is such an unpleasant surprise, that I'm going to try and scope out the work to do the full multi-init fix. That being said, even if we address multi-init successfully, I have some significant concerns about the overall approach given that each invocation appears to be doing a full module load of the CRT. Underneath the hood, that means at least two threads spun up (one in the event loop group, one (or more) for DNS resolution) and those do not go away until node finalizes the module which we have no control over and may just not happen. Depending on the concurrency of lambda invocation, this could be a little ugly when scaled up. |
Beta Was this translation helpful? Give feedback.
-
An update: I have a proposed fix for multi-init here: awslabs/aws-crt-nodejs#451 The fix attempts to stay minimal and simple. In particular, no additional threads are going to spin up (other than the node worker thread) when multi-init occurs. All node worker threads (+ the main thread) will shared a single set of CRT resources and only when they all drop their references will a complete unload/cleanup occur. So some of my concerns in the reply above shouldn't be an issue. However, I've spent most of the last few days tracking down a related issue that I'm guessing you haven't encountered yet. Your repro, on a glibc-based Linux host (musl is unaffected), reliably crashes when the last worker thread that has a reference to the NodeJS CRT terminates. The reason for this is that some dependencies of the CRT have a bug (s2n) or flat out don't support (aws-lc) proper cleanup of the thread locals that they use. For s2n, you can see the related issues here: For aws-lc, without a public API for dealing with init/cleanup of thread-local data, we may end up having to move off of it as a dependency and instead use openssl's libcrypto (which has proper init/cleanup and doesn't crash). The upshot is that even with the multi-init fix, if you're running serverless and the JS CRT on a glibc-based Linux distribution, you will crash. My guess is you're probably developing on Mac or Windows and that's why you haven't complained about this. First steps first though, I'll try and get the multi-init PR pushed through to the SDK. |
Beta Was this translation helpful? Give feedback.
-
Related/Same Issue (worker shut down crash in linux): awslabs/aws-crt-nodejs#452 |
Beta Was this translation helpful? Give feedback.
-
As an update, the latest SDK release should fix the multi-init issue. We are still working with our dependencies to address the linux/glibc crash. |
Beta Was this translation helpful? Give feedback.
-
I've tested the proposed fixes from aws-lc and s2n and have verified that they resolve the crash. There's a separate resource leak (thread local key) issue that is also under consideration but it's slow and less pressing than resolving the crash. |
Beta Was this translation helpful? Give feedback.
-
The crash on unload should be fixed in https://github.com/aws/aws-iot-device-sdk-js-v2/releases/tag/v1.13.4 |
Beta Was this translation helpful? Give feedback.
-
Hi @bretambrose , @jmklix I had the time to test out the fixes, here is what I found: Local testing consisting of:
const connectionSuccess = once(client, 'connectionSuccess')
client.start()
await connectionSuccess
console.log('🟢Connected to AWS IoT')
await client.publish({...})
console.log('🟢Published message👌')
const stopped = once(client, 'stopped')
client.stop()
await stopped
console.log('Stopped client')
client.close()
console.log('Closed client')
const slsw = require('serverless-webpack')
const path = require('path')
const dev = process.env.NODE_ENV !== 'production'
module.exports = {
node: {
__dirname: true
},
target: 'node',
mode: dev ? 'none' : 'production',
entry: slsw.lib.entries,
devtool: 'source-map',
output: {
libraryTarget: 'commonjs',
path: path.resolve(__dirname, '.webpack'),
filename: '[name].js'
},
resolve: {
extensions: ['.ts', '.js']
},
module: {
rules: [{ test: /\.(ts|js)x?$/, exclude: /node_modules/, use: ['babel-loader'] }]
},
stats: {
preset: 'minimal'
},
externals: ['aws-sdk', '@aws-sdk/client-dynamodb', '@aws-sdk/lib-dynamodb']
} And webpack versions:
That is not good, since it may potentially kill lambdas while they are doing work.
even with a manual installation of Cloud testing consisting of:
Cloud results:
Maybe webpack is stripping the |
Beta Was this translation helpful? Give feedback.
-
I'll be starting oncall tomorrow and take a look at these as soon as I can, pending other priorities. |
Beta Was this translation helpful? Give feedback.
-
(@SlavenIvanov) Can you update your repro repo (https://github.com/SlavenIvanov/lambda-iot-device-sdk-repro) with the webpack changes that reproduce the first crash? I will attempt to apply the deltas you've listed above, but in case something is missing, an authoritative snapshot would be fantastic. I don't have any insight for the sporadic fail-to-load in serverless. For webpack + Lambda deploy, it's a known issue that the CRT attempts to load the native module at a hard-coded offset from the loading js file (https://github.com/awslabs/aws-crt-nodejs/blob/main/lib/native/binding.js#L22-L43) Bundling tools like webpack and others break that expectation. I'm not sure what the best possible solution is, but here are a couple of workarounds:
So as an example, if you bundle into a single index.js file and included the right binary at the same level: lambda.zip Then setting whatever environment variable we settle on to |
Beta Was this translation helpful? Give feedback.
-
Hi @bretambrose Thank you for taking the time to take a look at this. First off it was my mistake when I said that the missing binary error is sporadic. The I had falsely assumed that since I had been trying the reproduction in several repositories, one of which was not using webpack to build the project. When I use webpack, the issue is persistant. Second: I updated the repro: https://github.com/SlavenIvanov/lambda-iot-device-sdk-repro . If you have any issues with getting it running, please let me know. I have tried to switching the library implementation to the web versions.
And I get this error:
I've also tried defining Another solution that I tried was to use https://www.npmjs.com/package/node-loader with the following configuration in
And when I build and run the lambdas I still get the missing binaries error. When I sun So I tried to then import the binaries in the lambda as a file, so that It can get recognized by webpack and included into the build, like so:
After building I can see that the binaries then get included into the build
Again, thank you for your time. Best Regards, |
Beta Was this translation helpful? Give feedback.
-
Thanks for the update. Current state of your repo successfully repros the module loading issue but priority-wise I'm much more interested in the napi_release_threadsafe_function crash. What does it take to make that happen? |
Beta Was this translation helpful? Give feedback.
-
Hello,
We are trying to use
aws-iot-device-sdk-js-v2
library in AWS Lambda, because of missing features in the v3 js sdk (mainly subscribing to an mqtt topic).We have created a file that makes a singleton out of an mqtt5 client:
And then we import it into a Lambda handler:
The first time we call the Lambda everything goes as expected. The client connects, then disconnects
The second time we call it we get this:
Does anyone have any ideas on a workaround?
Beta Was this translation helpful? Give feedback.
All reactions