Implementing BIP39 recovery, Unattended Bootstrap and Agent Start and Status #283

tegefaulkes · 2021-11-01T03:43:31Z

Description

This PR adds in the BIP39 recovery seed functionality

Issues Fixed

checklist

Final checklist

Review checklist

- Check that the implementation matches the architectural specification (Enable un-attended bootstrapping for pk agent start and pk bootstrap with PK_RECOVERY_CODE and PK_PASSWORD #202 and Host and port binding environment variables for pk agent start #286)
- Is there adequate testing? Unit tests for new function/s, integration tests for combined functionalities, etc
- Are there any edge-cases that haven't been considered? Use your own understanding of the architectural spec for this
- Exceptions: are they named appropriately and according to standard? Are they defined in the correct place (i.e. correct domain)?
- Any residual TODOs/FIXMEs? If so, comment to contributor to move them to an issue

tegefaulkes · 2021-11-01T03:56:12Z

The new keygeneration functions like so.

new keynode. generate mnemonic and key and returns the mnemonic to the user somehow.
new keynode started with mnemonic generates they keypair from that mnemonic
An existing keynode just loads everything like normal.
recovering a keynode needs to regenerate the keypair and then check that the keypair matches the old one.

We don't want to overwrite the keypair with a different one since that will corrupt the keynode. to check if the new keypair matches we can check that the NodeIds match. more importantly we need to check that the keypair matches. Roger mentioned we can do this with a signed message. we can store a message that contains the NodeId and sign that. and verify it with the new recovered keypair. though we need to validate the private key not the public key so we likely need to decrypt the message and check that the NodeID matches.

tegefaulkes · 2021-11-01T05:27:32Z

Small problem with the generateDeterministicKeyPair() function. It's very slow. It takes 15 seconds to create a keyPair from the mnemonic.

CMCDragonkai · 2021-11-01T06:42:02Z

That makes a worthy candidate for WorkerManager. The KeyManager is supposed to use it too.

CMCDragonkai · 2021-11-01T06:42:47Z

@tegefaulkes please update the task list for this PR, copy initial task list from #202 and then expand it as you go along.

This is something @joshuakarp can help with tomorrow?

tegefaulkes · 2021-11-01T06:55:31Z

The keyManager does use the workerManager but that is currently set after creation so I'll need to make some changes to that. Also, will the worker manager actually speed this up at all?

I made a checklist in the Issue. I can copy that over here.

CMCDragonkai · 2021-11-01T07:27:17Z

The worker manager will enable other tasks to be done at the same time. But 15s it will still take unless the actual algorithm can be parallelized. Still good idea to use the worker manager. If you make sure to use ArrayBuffer.

On 1 November 2021 5:55:41 pm AEDT, Brian Botha ***@***.***> wrote: The keyManager does use the workerManager but that is currently set after creation so I'll need to make some changes to that. Also, will the worker manager actually speed this up at all? I made a checklist in the Issue. I can copy that over here. -- You are receiving this because you commented. Reply to this email directly or view it on GitHub: #283 (comment)

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

tegefaulkes · 2021-11-02T04:05:12Z

I am now checking the root cert to see if the keyPair generated from a mnemonic matches the node.

joshuakarp · 2021-11-02T04:18:27Z

I am now checking the root cert to see if the keyPair generated from a mnemonic matches the node.

Is this the process you're using?:

Use generateDeterministicKeyPair with provided mnemonic to create key
Create node ID from this keypair
Retrieve node ID from cert, and compare

I'm assuming then that the certificate is in plaintext, and we can just query the OID fields?

tegefaulkes · 2021-11-02T04:20:15Z

That pretty much sums it up.

tegefaulkes · 2021-11-02T05:07:30Z

Looks like pki.rsa.generateKeyPair function used in generateDeterministicKeyPair can take a workers parameter to speed up generation but it is still very slow.

It's not so much a problem when using polykey. generating the keypair only happens when creating the keynode or resetting/renewing the keypair. but it's blowing out the KeyManager test times. I need to add an optional keyPair parameter to the createKeyManager so we can override the generation just for testing.

CMCDragonkai · 2021-11-02T06:37:23Z

Will have some changes regarding async-init coming from https://gitlab.com/MatrixAI/Engineering/Polykey/js-polykey/-/merge_requests/213. Will be merging MR 213 prior to this and #266.

The CLI is situation is quite messy and has to go through a proper review. So I'm bundling my CLI review with the sessions MR merge. So some of the points discussed with @joshuakarp I'll include in that MR.

tegefaulkes · 2021-11-02T07:23:27Z

I added the ability to override the keypair generation when starting the KeyManager. this is mainly for testing so we can provide a keyPair generated without the mnemonic to speed up the testing. most tests should run just as fast except for the keyManager.test.ts since it needs to do renew/reset keypair tests.

We do need to see if we can speed up deterministic key generation at all. 15secs is not the end of the world when it only happens when creating a new node but it's still pretty bad.

tegefaulkes · 2021-11-03T07:23:05Z

While updating the keyPair files doesn't have any locking applied, the function I'm using to do it writes to temp files and then does an atomic rename. this is what's being used already when creating the files and when doing a keyPair renew/reset. However that is expected to be protected by the PolykeyAgent lockfile. so the recovery function still needs to check for that and do it safely.

src/PolykeyAgent.ts

src/config.ts

src/bin/utils.ts

src/bin/bootstrap/bootstrap.ts

src/bin/agent/start.ts

src/bin/options.ts

CMCDragonkai · 2021-11-04T01:05:39Z

Does this PR also apply the env variables to pk bootstrap command as well? If not, it should be an empty checklist item.

Any other things here to be done?

joshuakarp · 2021-11-04T01:27:19Z

Does this PR also apply the env variables to pk bootstrap command as well? If not, it should be an empty checklist item.

Any other things here to be done?

Yeah - there's already support for acquiring both password and recovery code on pk bootstrap. As far as I'm aware, the networking environment variables are not necessary for pk bootstrap.

However, I can't seem to see whether the usage of the recovery code has been integrated into pk bootstrap. My expectation is the following:

pk bootstrap --fresh # bootstrap a new node state

# this we can do later
pk recover --recovery-code-file=./rcf --password-file=./pf # just recovers the key (but doesn't start the agent)

# FOR: empty node state

# deterministic node state generation
pk agent start --recovery-code-file=./rcf
# should also be done with bootstrapping
pk bootstrap --recovery-code-file=./rcf

# generate key pair, and then set the password
pk agent start --recovery-code-file=./rcf --password-file=./pf
# should also be done with bootstrapping
pk bootstrap --recovery-code-file=./rcf --password-file=./pf

# FOR: non-empty node state

# generate a keypair deterministically, check if the public key is correct, and if so, prompt for password (this is the new password)
pk agent start --recovery-code-file=./rcf
# unnecessary for pk bootstrap - only occurs on empty node state

# recover the key pair, if the public key is correct, set the password
pk agent start --recovery-code-file=./rcf --password-file=./pf 
# unnecessary for pk bootstrap - only occurs on empty node state

# the --fresh option ignores existing state (but not any existing lock)
pk agent start --recovery-code-file=./rcf --password-file=./pf --fresh
# should also be done with bootstrapping - force renewal (fails if agent running)
pk bootstrap --recovery-code-file=./rcf --password-file=./pf --fresh

Originally from #202 (comment)

@tegefaulkes am I missing this somewhere?

Still in process of reviewing.

tegefaulkes · 2021-11-04T01:31:14Z

I'm looking at it and fixing it now.

tegefaulkes · 2021-11-04T01:34:12Z

I had to do a small fix to it, but it's getting the env variables at line 54

  const password = await binUtils.getPassword(options.passwordFile);
  const recoveryCode = await binUtils.getRecoveryCode(options.recoveryFile);
  const fresh = !!options.fresh;

CMCDragonkai · 2021-11-04T01:36:04Z

The options.fresh should already be a boolean, no need to cast. You just make sure it's an option without a value.

CMCDragonkai · 2021-12-06T02:00:19Z

I've added --root-key-pair-bits option, so that tests can override it to 1024. This reduces it to about 9 seconds for the tests. However I remember seeing that it can sometimes double up. So the default time out is still 20s, and this is * 2 to 40s.

CMCDragonkai · 2021-12-06T03:17:08Z

I came across a situation where if we had structured error logging we would better see what the problem is...

Right now starting in the background with pkSpawn works except this error:

{"name":"ErrorRootKeysParse","description":"Polykey error","message":"Only 8, 16, 24, or 32 bits supported: 888","exitCode":1,"data":{},"stack":"ErrorRootKeysParse: Only 8, 16, 24, or 32 bits supported: 888\n    at Object.readRootKeyPair (/home/cmcdragonkai/Projects/js-polykey/src/keys/KeyManager.ts:665:13)\n    at async Object.setupRootKeyPair (/home/cmcdragonkai/Projects/js-polykey/src/keys/KeyManager.ts:597:23)\n    at async Object.start (/home/cmcdragonkai/Projects/js-polykey/src/keys/KeyManager.ts:164:35)\n    at async Object.start (/home/cmcdragonkai/Projects/js-polykey/node_modules/@matrixai/async-init/src/CreateDestroyStartStop.ts:75:20)\n    at async Function.createKeyManager (/home/cmcdragonkai/Projects/js-polykey/src/keys/KeyManager.ts:90:5)\n    at async Function.createPolykeyAgent (/home/cmcdragonkai/Projects/js-polykey/src/PolykeyAgent.ts:169:8)\n    at async main (/home/cmcdragonkai/Projects/js-polykey/src/bin/polykey-agent.ts:49:15)\n    at async /home/cmcdragonkai/Projects/js-polykey/src/bin/polykey-agent.ts:128:5"}

CMCDragonkai · 2021-12-06T04:09:16Z

The reason is because this was missing:

const nodePath = new commander.Option(
  '-np, --node-path <path>',
  'Path to Node State',
).env('PK_NODE_PATH')
 .default(config.defaults.nodePath);

So it was attempting to use the ~/.local/share/polykey as the node path.

CMCDragonkai · 2021-12-06T04:09:53Z

This means this Exception: ErrorRootKeysParse may be thrown with a variety of messages if the password is incorrect. We should look into improving this messaging to be better in the future for UX @joshuakarp.

joshuakarp · 2021-12-06T04:15:11Z

It'd be worthwhile to do a whole review of the error message displays as a whole when we get to that stage. There's similar ambiguities with thrown exceptions that wouldn't necessarily correlate to the actions of a user (e.g. ErrorNodeGraphEmptyDatabase when no nodes have been added to the database).

CMCDragonkai · 2021-12-06T04:54:56Z

Accidentally wrote the decription instead of description in Status errors. Fixing that up.

…nd and concurrent coalescing, propagate log level to background agent

CMCDragonkai · 2021-12-06T05:07:28Z

Pushed up my latest test fixes for agent/start.test.ts. This commit also has the new pkAgent utility as well in tests/bin/utils.test.ts that can be used to create our common agent. See if that works @joshuakarp.

CMCDragonkai · 2021-12-06T05:28:44Z

Current state now:


[nix-shell:~/Projects/js-polykey]$ npm test -- tests/bin/agent/start.test.ts 

> @matrixai/[email protected] test /home/cmcdragonkai/Projects/js-polykey
> jest "tests/bin/agent/start.test.ts"

Determining test suites to run...
GLOBAL SETUP
 PASS  tests/bin/agent/start.test.ts (49.938 s)
  start
    ✓ start in foreground with parameters (9793 ms)
    ✓ start in foreground with environment variables (9145 ms)
    ✓ start in background with environment variables (14456 ms)
    ✓ concurrent starts are coalesced (6561 ms)
    ✓ concurrent bootstrap is coalesced (6730 ms)

Test Suites: 1 passed, 1 total
Tests:       5 passed, 5 total
Snapshots:   0 total
Time:        49.974 s
Ran all test suites matching /tests\/bin\/agent\/start.test.ts/i.
GLOBAL TEARDOWN

CMCDragonkai · 2021-12-06T05:49:13Z

The pk agent tests are getting pretty comprehensive now. I think a few other ones:

Test that SIGINT, SIGTERM, SIGQUIT SIGHUP all do graceful termination.
Test what happens when wrong password is used
Test the usage of recovery code
Test that if you cancel in the middle of starting (as in bootstrapping), that you can bootstrap again perfectly as long as you use --fresh.
Test the usage of --fresh as well

I'll fix up the bootstrap tests too, and then this will be merged. All other test issues are to be fixed in downstream PRs.

CMCDragonkai · 2021-12-06T08:34:42Z

So I'm removing:
    // This checks the if number of directory entries is greater than 1 due to status.json
    if (!fresh && (await fs.promises.readdir(nodePath)).length > 1) {
      throw new bootstrapErrors.ErrorBootstrapExistingState();
    }
From bootstrapState.

The main reason is that agent start cannot function in the same way because all of the state initialisation logic across the entire codebase is "partial" and "forgiving" to some extent.

So if there are files there, but nothing conflicting the pk states, they are just ignored.

If there are stuff that does conflict, the bootstrap and agent start will fail in a variety of exceptions. But that should signal to the user to have an empty directory or to use the --fresh.

I actually reversed this decision. This does make sense for the pk bootstrap command. However pk agent start it doesn't quite make sense. If existing state exists for pk agent start, it will just try to start it, and potentially create state when it does start. The only factor is existing conflicting state. But this is an ok trade off.

I've also added the fresh option with only --fresh, since -f is occupied by --format which is a more common option.

CMCDragonkai · 2021-12-06T09:26:07Z

Instead of having a separate test for each signal, I've just embedded them into each individual test to save some time. So some tests I'm using SIGTERM, another SIGINT... etc.

…upted start and bootstrap tests

CMCDragonkai · 2021-12-06T11:15:26Z

While testing the recovery code system, I came across a problem.

If the root key already exists, and the recovery code is passed in.

The way we check if our recovery code is correct is by doing:

        // Recover the key pair with the recovery code
        // Check if the generated key pair matches
        const rootKeyPairCheck = await this.generateKeyPair(bits, recoveryCode);
        if (!(await this.matchRootKeyPair(rootKeyPairCheck))) {
          throw new keysErrors.ErrorKeysRecoveryCodeIncorrect();
        }

But this relies on the bits. Which by default is 4096.

If the user selected a different bit size, they would have to remember the bit size and the recovery code together.

Instead of doing this, we need to instead save the bit size, or derive it from the saved public key.

Another problem with this mechanism is that if the root key files is in fact lost, and not just the password was forgotten, then this actually goes into generating a new root key pair, which would then be attempted to be used to decrypt the other key files like the db key and the vault key. In this scenario, if the recovery code is in fact wrong, then it would generate the wrong root key pair, and there would likely to be a decryption failure that occurs with the db key or vault key subsequently. This of course requires that both db key and vault key still exist on the filesystem.

In #164 an old issue, we explored the idea of hierarchical keys. This basically means the seed (recovery code) is used to generate all the keys in a hierarchical way which means we can recover all keys. However right now db keys and vault keys are just random separated keys, so the recovery code has limited utility. We'll have to address this later.

So basically the issue is that:

If you pass in the wrong bit size, or don't pass it in, but a non-default bit size is used, you'll get a ErrorKeysRecoveryCodeIncorrect.
If you don't have the root keypair anymore, then you won't be able to know what bit size was really used. So you'll just have to be lucky with the default bit size.
requires revisiting Hierarchical Key Generation #164. But 1. we can solve this by persisting or deriving the bit size somehow. Let's see.

CMCDragonkai · 2021-12-06T11:20:09Z

Turns out by reading the public key pem, we can acquire the public key bit size using our utility function that we already have in keys/utils:

function publicKeyBitSize(publicKey: PublicKey): number {
  return publicKey.n.bitLength();
}

CMCDragonkai · 2021-12-06T11:33:38Z

This is solved by using the above function and having a protected method called recoverRootKeyPair in KeyManager.

CMCDragonkai · 2021-12-06T11:54:46Z

[nix-shell:~/Projects/js-polykey]$ npm test -- tests/bin/agent/start.test.ts 

> @matrixai/[email protected] test /home/cmcdragonkai/Projects/js-polykey
> jest "tests/bin/agent/start.test.ts"

Determining test suites to run...
GLOBAL SETUP
 PASS  tests/bin/agent/start.test.ts (122.537 s)
  start
    ✓ start in foreground (9612 ms)
    ✓ start in background (14891 ms)
    ✓ concurrent starts are coalesced (9850 ms)
    ✓ concurrent bootstrap is coalesced (6947 ms)
    ✓ start with existing state (17818 ms)
    ✓ start when interrupted, requires fresh on next start (21816 ms)
    ✓ start from recovery code (38039 ms)

Test Suites: 1 passed, 1 total
Tests:       7 passed, 7 total
Snapshots:   0 total
Time:        122.576 s
Ran all test suites matching /tests\/bin\/agent\/start.test.ts/i.
GLOBAL TEARDOWN

…it size from existing key pair, and using testBinUtils.processExit

… bootstrapping

CMCDragonkai · 2021-12-06T12:59:19Z

Bootstrap tests all done:


[nix-shell:~/Projects/js-polykey]$ npm test -- tests/bin/bootstrap.test.ts 

> @matrixai/[email protected] test /home/cmcdragonkai/Projects/js-polykey
> jest "tests/bin/bootstrap.test.ts"

Determining test suites to run...
GLOBAL SETUP
 PASS  tests/bin/bootstrap.test.ts (37.507 s)
  bootstrap
    ✓ bootstraps node state (5154 ms)
    ✓ bootstrapping occupied node state (9685 ms)
    ✓ concurrent bootstrapping are coalesced (7267 ms)
    ✓ bootstrap when interrupted, requires fresh on next bootstrap (11185 ms)

Test Suites: 1 passed, 1 total
Tests:       4 passed, 4 total
Snapshots:   0 total
Time:        37.548 s, estimated 38 s
Ran all test suites matching /tests\/bin\/bootstrap.test.ts/i.
GLOBAL TEARDOWN

Important that when testing concurrent work, do not use pkStdio as the mocks are global and conflict with each other. Could have used pkExec in my concurrent tests, but copied from my agent/start.test.ts to use pkSpawn.

CMCDragonkai · 2021-12-06T13:01:08Z

Both agent/start tests and bootstrap tests are passing. The only thing left is the additional agent subcommands to be refactored in tests/bin/. Will do this in a later PR. Which includes #286 (comment).

This is being merged now! @joshuakarp @emmacasolin rebase on top of master!

joshuakarp · 2021-12-07T22:21:20Z

FYI @CMCDragonkai running tests/bin/agent/start.test.ts on the new laptops:

josh@matrix-vostro-5410-1> npm test -- tests/bin/agent/start                                                                  ~/Documents/Projects/js-polykey

> @matrixai/[email protected] test /home/josh/Documents/Projects/js-polykey
> jest "tests/bin/agent/start"

Determining test suites to run...
GLOBAL SETUP
Creating global.binAgentDir: /tmp/polykey-test-bin
 PASS  tests/bin/agent/start.test.ts (140.284 s)
  start
    ✓ start in foreground (10829 ms)
    ✓ start in background (15737 ms)
    ✓ concurrent starts are coalesced (13062 ms)
    ✓ concurrent bootstrap is coalesced (11122 ms)
    ✓ start with existing state (18255 ms)
    ✓ start when interrupted, requires fresh on next start (20339 ms)
    ✓ start from recovery code (38224 ms)
    ✓ start with network configuration (10550 ms)

Test Suites: 1 passed, 1 total
Tests:       8 passed, 8 total
Snapshots:   0 total
Time:        140.31 s
Ran all test suites matching /tests\/bin\/agent\/start/i.
GLOBAL TEARDOWN
Destroying global.binAgentDir: /tmp/polykey-test-bin

joshuakarp · 2021-12-07T22:22:06Z

And the same on my own laptop:

[nix-shell:~/Documents/MatrixAI/js-polykey]$ npm test -- tests/bin/agent/start.test.ts 

> @matrixai/[email protected] test /home/josh/Documents/MatrixAI/js-polykey
> jest "tests/bin/agent/start.test.ts"

Determining test suites to run...
GLOBAL SETUP
Creating global.binAgentDir: /run/user/1000/polykey-test-bin
 PASS  tests/bin/agent/start.test.ts (151.772 s)
  start
    ✓ start in foreground (11874 ms)
    ✓ start in background (16135 ms)
    ✓ concurrent starts are coalesced (10801 ms)
    ✓ concurrent bootstrap is coalesced (10876 ms)
    ✓ start with existing state (20352 ms)
    ✓ start when interrupted, requires fresh on next start (23724 ms)
    ✓ start from recovery code (42431 ms)
    ✓ start with network configuration (11073 ms)

Test Suites: 1 passed, 1 total
Tests:       8 passed, 8 total
Snapshots:   0 total
Time:        151.827 s
Ran all test suites matching /tests\/bin\/agent\/start.test.ts/i.
GLOBAL TEARDOWN
Destroying global.binAgentDir: /run/user/1000/polykey-test-bin

CMCDragonkai · 2021-12-08T00:11:21Z

Ah great that timing is the same as mine.

tegefaulkes self-assigned this Nov 1, 2021

tegefaulkes changed the title ~~Generating Mnemonic and keypair.~~ Implementing BIP39 recovery Nov 1, 2021

tegefaulkes mentioned this pull request Nov 1, 2021

Enable un-attended bootstrapping for pk agent start and pk bootstrap with PK_RECOVERY_CODE and PK_PASSWORD #202

Closed

21 tasks

tegefaulkes mentioned this pull request Nov 3, 2021

Host and port binding environment variables for pk agent start #286

Closed

3 tasks