-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Upgrading @matrixai/async-init, @matrixai/async-locks, @matrixai/db, @matrixai/errors, @matrixai/workers, Node.js and integrating @matrixai/resources #366
Conversation
RWLocks is now in https://github.com/MatrixAI/js-async-locks. Locking tests have moved to there too. This can now be used together here. Note that we pretty much only use Standard mutex usage is also Why use these? Because they fit the Now anything that needs to acquire a single lock is just a matter of doing:
The Note that
In this case, it returned the In the locks case, it is just Furthermore, the So in PK, we would bring in The |
MatrixAI/js-db#13 allows us now to use levels arbitrarily, no need to worry about reserved characters anymore. No more base encoding of arbitrary level parts. Key parts always had the ability to have any bytes. |
Example of fixing the @ready(new aclErrors.ErrorACLNotRunning())
public async sameNodePerm(
nodeId1: NodeId,
nodeId2: NodeId,
tran?: DBTransaction,
): Promise<boolean> {
const nodeId1Path = [...this.aclNodesDbPath, nodeId1.toBuffer()] as unknown as KeyPath;
const nodeId2Path = [...this.aclNodesDbPath, nodeId2.toBuffer()] as unknown as KeyPath;
if (tran == null) {
return withF(
[
this.db.transaction(),
this.locks.lockRead(
dbUtils.keyPathToKey(nodeId1Path).toString('binary'),
dbUtils.keyPathToKey(nodeId2Path).toString('binary')
),
],
async ([tran]) => this.sameNodePerm(nodeId1, nodeId2, tran)
);
}
const permId1 = await tran.get(nodeId1Path, true);
const permId2 = await tran.get(nodeId2Path, true);
if (permId1 != null && permId2 != null) {
return IdInternal.fromBuffer(permId1).equals(IdInternal.fromBuffer(permId2));
}
return false;
} Furthermore, integration of timeouts into |
#369 discovered that |
3.6.3 of js-db builds in the canary check. We should catch the |
TS demo lib has now been upgraded to v16. You should be able to bring it in here in this PR. Make sure it's a separate commit. |
@emmacasolin after v16 upgrades, please refer to MatrixAI/js-encryptedfs#63 in order to know how to start upgrading all of our dependencies. Start with the easiest ones like js-errors then to the harder ones like js-resources. We should meet and discuss further once you get there. I may be upgrading our transaction system further based on lessons from EFS. |
Updated pkgs.nix to a5774e76bb8c3145eac524be62375c937143b80c Updated node2nix to 1.11.0 and loading directly from GitHub Updated pkg to 5.6.0 and using pkg-fetch 3.3 base binaries Updated to TypeScript 4.5+ Updated @types/node to 16.11.7 Updated node-gyp-build to 4.4.0 Updated typedoc to 0.22.15, the .nojekyll generation is automatic Changed to target ES2021 as node 16 supports it Bin executable duct-tape in default.nix because node2nix no longer generates bin executables Updated pkg builds hashes to v16.14.2 and specified target node range
I've brought in the Node 16 upgrade here. Nix shell is working fine along with npm commands (e.g. lint), however, anything involving a build is failing due to the existing WIP changes here since there are errors. Will need to check the builds at some point before this PR is merged. |
This path should now be |
As explained in our chat, there's no problem with mixing PCC and OCC together. During any serialised counter update, add a PCC lock there for that specific key. This will serialise updates to the counter. If other conflicts occur due to overlapping write-set, an |
So is the nodeId of the |
The For |
@emmacasolin to make it a bit more generic, instead of passing just the function promisifyUnaryCall<T>(
client: Client,
f: (...args: any[]) => ClientUnaryCall,
clientMetadata?: POJO
): (...args: any[]) => PromiseUnaryCall<T> { This would allow you to pass any arbitrary data into the It's a generic POJO and optional too. Alternatively don't make it optional so that way it's always enforced. If we're changing the API, and you make a mandatory parameter, then change it to be: function promisifyUnaryCall<T>(
client: Client,
clientMetadata: POJO
f: (...args: any[]) => ClientUnaryCall,
): (...args: any[]) => PromiseUnaryCall<T> { |
The place I'm referring to is here for example function nodesCrossSignClaim({
db,
keyManager,
nodeManager,
sigchain,
logger,
}: {
db: DB;
keyManager: KeyManager;
nodeManager: NodeManager;
sigchain: Sigchain;
logger: Logger;
}) {
return async (
call: grpc.ServerDuplexStream<nodesPB.CrossSign, nodesPB.CrossSign>,
) => {
const genClaims = grpcUtils.generatorDuplex(call, true);
try {
...
|
Well because all of the Also I realised that |
Yes that's what I'm saying. But from my previous comment:
Ignoring the host/port since that can be retrieved from |
Well now that we allow generic POJO data, you can just write mock data. It doesn't have to be real node ID. This is for a testing situation. As for acquiring the values, in the Even if |
What's the reason to encode the NodeID? Why not leave it as normal NodeId when setting it in the data? |
I'm adding it into the metadata of the |
Yes pass it as a second argument. Don't mutate the types, ensure separation of responsibility here. |
The usual suspects of things breaking due to node v16 are:
Because vaults rely on nodes, and node rely on network. Focus on fixing the network domain first, then investigate the problem with nodes (at the same time migrating to using the Finally the |
…ors properly Making use of the updated API to use the error provided to the resourceRelease.
The |
And also need to reset the links in zenhub. |
New PR at #374. Work from there from now on. |
Description
Several core libraries have been updated and so they need to be integrated into PK. Refer to MatrixAI/js-encryptedfs#63 to see how this was handled for EFS and repeat here.
@matrixai/async-init
- no major changes@matrixai/async-locks
- all instances ofMutex
should change to useLock
orRWLockWriter
@matrixai/db
- the DB now supports proper DB transactions and has introducediterator
, however no more sublevels (useKeyPath
/LevelPath
instead)@matrixai/errors
- we make all of our errors extendAbstractError<T>
and also provide static descriptions to all of them, as well as use thecause
chain@matrixai/workers
- no major changes herenode.js
- we should upgrade to Node 16 in order to integrate promise cancellation, as well as usingPromise.any
for ICE protocol@matrixai/resources
- since the@matrixai/db
no longer does any locking, the acquisition of theDBTransaction
andLock
has to be done together withwithF
orwithG
Along the way, we can explore how to deal with indexing #188 and #257, which should be easier now that DB has root levels of
data
,transactions
.Locking changes
There were three important types of locks brought in by
js-async-locks
:Mutex
which we were using previously)In most cases, we are removing locks in favour of using optimistic concurrency control (OCC). This means that most domain locks should be removed with a few exceptions:
Discovery.ts
, theNotificationId
s and message count used byNotificationsManager.ts
, theClaimId
s and sequence numbers used bySigchain.ts
, and if the set/ping node or refresh buckets queues introduced in Testnet Deployment #326 become persistent then we would need locking there as well. This could be achieved for these cases by introducing aLockBox
to the affected domains and locking the relevant keys when we access the db, e.g.withF([this.db.transaction(), this.locks.lock(...lockRequests)], async ([tran]) => {})
wherethis.locks
is aLockBox
and...lockRequests
is an array of[KeyPath, Lock]
(see EFSINodeManager.ts
).NodeConnectionManager
andVaultManager
we need to lock groups of objects, this can be done using a LockBox where we are referencing the same ID each time we want to lock a specific object.Everywhere else we expect that conflicts will be rare, so we don't use locks in order to simplify our codebase. In the case of a conflict, we can either retry (if safe) or bubble up the error to the user.
Errors changes
The new
js-errors
allows us to bring in error chaining, along with more standardised JSON serialisation/deserialisation (for sending errors across gRPC). With this error chaining ability, there are now three ways that we can handle/propagate errors:In all places where we are catching one error and throwing a different error in its place, we should be using approach 3 (error chain). If we just want to bubble the original exception upwards then use approach 1 (re-raise/re-throw). Finally, if we want to hide the original error from the user (perhaps it contains irrelevant implementation details or could be confusing and thus requires additional context) we can use approach 2 (error override). There is a fourth approach that exists in Python for errors that occur as a direct result of handling another error, however, this does not exist in TypeScript (in such a case we would use approach 3). When using approach 2 (and in some cases approach 3) you may want to log out the original error in addition to throwing the new error.
JSON serialisation/deserialisation
When sending errors between agents/from client to agent we need to serialise errors (including the error chain if this exists). Then, on the receiving side, we need to be able to deserialise back into the original error types.
We are able to do this using
JSON.stringify()
(serialisation) andJSON.parse()
(deserialisation). These methods allow us to pass in a replacer/reviver to aid with converting our error data structure, as well as being combined withtoJSON()
andfromJSON()
utility methods on the error class itself. These are implemented onAbstractError
fromjs-errors
, however, we need to extend these to work withErrorPolykey
in order to handle the additionalexitCode
property. WhiletoJSON()
can simply callsuper.toJSON()
and add in the extra field,fromJSON()
needs to be completely reimplemented (although this can be copied fromAbstractError
for the most part). Similarly, the replacer and reviver can be based on the replacer and reviver used in thejs-errors
tests.ErrorPolykeyRemote
Errors are propagated between agents and clients as follows:
ErrorPolykeyRemote
is constructed on the client side (not the server side) inside our gRPCtoError()
utility. After the received error is deserialised, it is wrapped as the cause property of a newErrorPolykeyRemote
, which should also contain thenodeId
,host
, andport
of the agent that originally sent the error in its data property. In order to access this information, it needs to be passed through from wherever the client/agent method is called (this would be bin commands for the client service and domain methods for the agent service). The data can then be passed through to thepromisify...()
methods, which in turn calltoError()
.Testing
Now that we have an error chain, we need to adjust our tests to be able to perform checks on these. In many cases where we were originally expecting some specific
ErrorPolykey
we will now be receiving anErrorPolykeyRemote
with the original error in its cause property. For simple cases like this it is simple enough to just perform the existing checks on the cause property of the received error rather than the top-level error, however this approach becomes more complicated for longer error chains. Additionally, we may want to perform checks on the top-levelErrorPolykeyRemote
(such as checking the metadata for the sending agent).In this case, it would be useful to create an expectation utility that allows one to perform checks on the entire error chain, from the top-level error to the final error in the chain. This could look something like this:
We could also think about using the
toJSON()
method on each error to allow us to use jest'sexpect().toMatchObject()
matcher rather than having to check every error property individually. Also potentially including parameters to specify which properties of the error you do and don't want to check against.Additional context
js-errors
Database changes
Changes include but are not limited to
withF
,withG
locking directly.ErrorDBTransactionConflict
Error should never be seen by the user. We should catch and override it with a more descriptive error for the context.LevelPath
andKeyPath
s instead.db.put
,db.get
anddb.del
should be using transactions viatran.put/get/del
This applies to all domains that make use of DB OR domains that depend on others that make use of DB. The goal here is to make any even starting from the handlers atomic.
There are limitations to this however. Since a transaction can fail if there is overlapping edits between transactions. We can't really include changes to the db that will commonly or guarantee conflict. Example of this are counters or commonly updated fields. So far this has been seen in;
NotificationsManager
. Makes use of a counter so any transactions that include Adding or removing a notification WILL conflict. Reads also update metadata so concurrently reading the same message WILL conflict.Some cases we will need to make use of locking along with a transaction. A good example of this is in the
NotificationManager
where we are locking the counter update. When this is the case we need to take extra care with the locking. Unless the lock wraps the whole transaction it is still possible to conflict on the transaction. we can't compose operations that rely on this locking with larger transactions.An example of this problem is.
This means that some operations or domains can't be composed with larger transactions. It has yet to be seen if this will cause an issue since more testing is required to confirm any problem. I suppose this means we can't mix pessimistic and optimistic transactions. So far it seems it will be a problem with the following domains.
Node.js changes
After upgrading to Node v16 we will be able to bring in some new features.
Promise.any
- we are currently usingPromise.all
in ourpingNode()
method inNodeConnectionManager
(this is being done in Testnet Deployment #326) but this should change to usingPromise.any
. This is becausePromise.all
waits for every promise to resolve/reject, however we only care about whichever finishes first and we want to cancel the rest. This change can be made in Testnet Deployment #326 after this PR is merged.AggregateError
- this error is emitted byPromise.any
if all of the given promises are rejected. In our case this would mean that we were unable to ping the desired node via direct connection or signalling (and eventually relaying once this is implemented), so we may want to catch this and re-throw some other error to represent this. We will also need to add this into our error serialisation/deserialisation.AbortController
- there are a number of places that have been identified in Asynchronous Promise Cancellation with Cancellable Promises, AbortController and Generic Timer #297 where we could use the newAbortController
/AbortSignal
. This can be done in a separate PR when rebasing Testnet Deployment #326.Issues Fixed
VaultInternal
with
context API functions withNodeConnectionManager
#356NodeGraph
bucket operations #244vaults
domain #257@matrixai/js-file-locks
to introduce RWlocks in IPC #290 - this will only be important for file locking related operations for Inter-Process communicationTasks
AbstractError
and usecause
chain and static descriptionsdb.get
anddb.put
anddb.del
to useKeyPath
batch
and domain locking and instead migrate to usingDBTransaction
and thewithF
andwithG
from@matrixai/resources
createReadStream
,createKeyStream
andcreateValueStream
withdb.iterator
tran?: DBTransaction
optional parameter in the last parameter to allow one to compose a transaction contextDB
take a "prefix", thus recreating sublevels, but only for this usecase. This would mean both EFS and PK use the same DB, with EFS completely controlling a lower level.RWLock
can be used instead when doing concurrency control on theDBTransaction
to raise the isolation level to avoid non-repeatable reads, or phantom reads or lost-updates.@matrixai/async-locks
to use theLock
class instead ofMutex
fromasync-mutex
@matrixai/resources
withF
andwithG
to replace transactionsResourceAquire
making use of error handling.Tasks are divided by domain for conversion. Assignments are as follows. Domains labelled 'last' depend on a lot of other domains and will be divided later.
Final checklist
TBD spec:
Error changes:
Testing for an error with a cause.
DB changes:
domains - Each domain used to have a
transact
orwithTransaction
wrapper. Unless this wrapper actually needs some additional context, and it is just callingwithF
andwithG
directly, then avoid this and just replace it with direct usage ofwithF
andwithG
utilities. We should identify domains that require additional context, in those cases we continue to use a with transaction wrapper.transaction, and its utilisation of SI transactions
In all cases, you just throw this exception up, and propagate it all the way to the client.
The pk client however will change the message to be more specific to the user's action.
Retrying this can be identified and sprinkled into the codebase afterwards.
Work out a way to write a user-friendly message when a ErrorDBTransactionConflict occurs in the PolykeyClient (at the CLI)
Identify where automatic retries can occur, and make those catch the
ErrorDBTransactionConflict
and resubmit the transactionIdentify write-skews, this is where we must use solutions for dealing with write skews => locking and materialising the write conflict
Identify serialisation requirements - counter updates, and where PCC is demanded as part of the API, in those situations use the LockBox or lock
Handlers need to now start the transaction. For agent and client.
We have
tran?: DBTransaction
where it allows public methods to create their own transaction context.But this is only for debugging and convenience.
During production, the transaction context is meant to be setup at the handler level.
This means that each handler is its own atomic operation.
By default they start out at the handler level, this will make it easier for us to identify our atomic operations and to compare them.
Tests - introduce concurrency tests, consider some new cocurrency expectation combinators that allow us to more easily say one of these results has to be X, and the rest has to be Y.
Concurrency tests can be introduced domain by domain. - Use
Promise.allSettled
We can however focus our concurrency testing at the handler level, because we expect that to be the unit of atomic operations.
All ops removed, changed to using keypath and levelpath.
In the iterator:
tran.iterator
you will always have thekey
no matter what because it is being used for matching between the overlay and underlying db data.keyAsBuffer
andvalueAsBuffer
are by defaulttrue
, if you change them to false, it will decode the 2, and return as decoded values. If you need both, usedbUtils.deserialize()
.Just think that the iterator has the same snapshot as the original transaction.
Don't start writing concurrency tests until we have SI transactions. Serial tests are still fine.
Nodev16 Changes:
Locking Changes:
All async locks and async mutex is to be replaced with
js-async-locks
.In the
js-async-locks
:Convert object-map pattern to using LockBox by referencing the same ID.
Use INodeManager in EFS as a guide.