-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DBTransaction with Pessimistic Locking & Optimistic Locking and Deadlock Detection #17
Comments
Upon reading this http://ithare.com/databases-101-acid-mvcc-vs-locks-transaction-isolation-levels-and-concurrency/ this has clarified a few ideas that has been confusing.
|
Re-reading:
This seems to mean that snapshot isolation should be easy to do. However a new development was serializable snapshot isolation. This prevents "write skew" anomalies that exists in snapshot isolation, that currently can only be addressed by introducing locking. I don't fully understand how to implement serializable snapshot isolation SSI, so I'll leave that for later. It seems SI is sufficient for most cases, and is how most DBs are still defaulting to. |
Note, that if we had the ability to just use rocksdb, this stuff would mostly all be available for free. While leveldb does support rocksdb as the backing database https://github.com/Level/level-rocksdb, it doesn't expose any of the rocksdb transaction features discussed in https://www.mianshigee.com/tutorial/rocksdb-en/b603e47dd8805bbf.md which is unfortunate! I'm thinking that it will take more time to work out the C++ code and integration into nodejs right now (although we will have to do this soon). So I'll stick to attempting to implement this in JS/TS. |
Note that we currently maintain written-data in our transaction data path. We used to call this the "snapshot". But this technically incorrect term. The snapshot is supposed to be the point-in-time state of the DB when the transaction starts. Additionally our implementation then implements COW on top of this. This means that our transaction data path is where we are "buffering" our writes, and they are overlaid on top of the underlying DB. In snapshot isolation, rather than overlaying on top of the underlying DB, they overlay on top of a "snapshot" of the underlying DB that was first created when the transaction started. This is only accessible via the Because |
For detecting a write conflict at the end, I was originally under the impression that we only need to check whether the key-set being written to has a different set of values compared to transaction's snapshot. However most of the literature on snapshot isolation instead talks about timestamps. Meaning that even if another transaction updated the value to the same value, this would still cause a write conflict. I'm not sure why it doesn't just check if the values are different, and but maybe there's some edgecase where checking for value difference would not be enough to ensure consistency. I've asked a question about this here https://stackoverflow.com/questions/72084071/can-the-validation-phase-of-transactions-in-software-transactional-memory-stm. My guess atm is that it's faster to compare timestamps than to compare for value equality. The value might be quite large, but timestamp comparison is much easier. Also I believe these timestamps will be logical timestamps, not real-time timestamps. These logical timestamps would have to be strictly monotonic! We can do this just with a sequence counter starting at |
I want to note that our write buffer is placed in the DB (disk) itself and is therefore unbounded by memory. However our write operations are buffered in-memory and not in the DB. You might wonder why not push all write operations also into the DB transaction data path? Well because at the end you still end up reading them all in-memory to perform the write batch, so there's no advantage with putting the operations onto disk. Therefore no matter what our transaction sizes are still bounded by memory. This does impact the scalability of large atomic operations such as those in EFS: MatrixAI/js-encryptedfs#53. |
https://www.sobyte.net/post/2022-01/rocksdb-tx/ states that RocksDB does not support SSI (serializable snapshot isolation). It only supports SI which is what we are heading towards to. During the validation phase, it says:
So rocksdb doesn't just reads the timestamp for every key that has been written, it has some optimisation making use If we keep it simple and just read the timestamp for each key, for a large transaction, that would be alot of reads. And because we can only access this via the iterator, this means seeking our transaction iterator and performing reads off that to get the timestamp metadata. However our iterator has a "read ahead cache" controlled by the |
The implementation of this can be separated into 2 phases:
For the first phase:
For the second phase:
|
Based on the discussion here 6107d9c#commitcomment-73269633, the lmdb-js project exposes transactions that is combination of snapshot isolation for reads, but PCC serialisable locking for writes. It does not however have OCC snapshot isolation atm because it doesn't do write-set conflict detection. |
I've started using rocksdb directly now to implement SI. Some notes about it.
|
The only optimisation for creating a snapshot upon registering the first lazy operation is just reducing the number of needless transaction conflicts. So if both T1 and T2 started, and you know they will conflict if they ran at the same time, but if T1 was sufficiently delayed in real time in starting their read/write operations, then it's better to delay snapshot creation to when the first operation is done. If we can expose snapshots as first class types into JS although opaque, then this should be wieldable in JS. |
Turns out RocksDB even has an operation to do this exact lazy setting of snapshot. It's called I believe we should have this set by default. However I realised that the docs don't say that it also set before And if I want consistent reads, I'd need to get access to that snapshot and then set that as the option when performing the read. So although it seems useful, I'm not sure if it's actually useful, since I'm not sure if it would be called upon the next I'd like to also point out that there's a |
That In my thinking now, transactions should by default be consistent reads and consistent writes. Optimistic transactions are always consistent writes. I can't think of situations where we don't want consistent reads, if you did, you'd just use So the optimisation of setting the snapshot upon the first operation has to be done in JS-land under |
Specification
The
DBTransaction
currently does not have any locking integrated. It is only a read-committed isolation-level transaction.On top of this, users have to know to use
iterators
and potentially multiple iterators to access the snapshot guarantee of leveldb, this is essential when iterating over one sublevel, and needing to access properties of another sublevel in a consistent way.Right now users of this transaction is expected to use their own locks in order to prevent these additional phenomena:
Most of this comes down to locking a particular key that is being used, thus blocking other "threads" from starting transactions on those keys.
Key locking is required in these circumstances:
Users are therefore doing something like this:
Notice how if the transaction is passed in to
SomeDomain.someMethod
, then it doesn't bother creating its own transaction, but also doesn't bother lockingkey1
andkey2
.The problem with this pattern is that within a complex call graph, each higher-level call has to remember, or know what locks need to be locked before calling an transactional operation of
SomeDomain.someMethod
. As the hierarchy of the callgraph expands, this requirement to remember the locking context grows exponentially, and will make our programs too difficult and complex to debug.There are 2 solutions to this:
The tradeoffs between the 2 approaches are summarised here: https://agirlamonggeeks.com/2017/02/23/optimistic-concurrency-vs-pessimistic-concurrency-short-comparison/
Big database software often combine these ideas together into their transaction system, and allow the user to configure their transactions for their application needs.
A quick and dirty solution for ourselves will follow more along how RocksDB implemented their transactions: https://www.sobyte.net/post/2022-01/rocksdb-tx/. And details here: MatrixAI/Polykey#294 (comment).
Pessimistic Concurrency Control
I'm most familiar with pessimistic concurrency control, and we've currently designed many of our systems in PK to follow along. I'm curious whether OCC might be easier to apply to our PK programs, but we would need to have both transaction systems to test.
In terms of implementign PCC, we would need these things:
LockBox
intoDBTransaction
LockBox
would need to be augmented to detect deadlocks and manage re-entrant lockskey1
within the same transaction will all succeed. This doesn't mean thatkey1
is a semaphore, just that if it's already locked in the transaction, then this is fine to proceed.key1
, a subsequent call can write-lockkey1
(but must take precedence over other blocked readers & writers), and subsequent calls to write-lockkey1
will also succeed. Lock downgrades will not be allowed.As for optimistic transactions, we would do something possibly alot simpler: MatrixAI/Polykey#294 (comment)
Now there is already existing code that relies on how the db transactions work, namely the
EncryptedFS
.Any updates to the
DBTransaction
should be backwards compatible. So thatEncryptedFS
can continue functioning as normal using its own locking system.Therefore pessimistic and optimistic must be an opt-in.
For pessimistic, this may just mean adding some additional methods to the
DBTransaction
that ends up locking certain keys.Optimistic Concurrency Control
For optimistic, this can just be an additional option parameter to the
db.transaction({ optimistic: true })
that makes it an optimistic transaction.Because OCC transactions are meant to rely on the snapshot, this means every
get
call must read from the iterator. Because this can range over the entire DB, theget
call must be done on the root of the DB.But right now
iterator
also creates their own snapshot. It will be necessary that every iterator call is iterating from the same snapshot that was created at the beginning.Right now this means users must start their iterators at the beginning of their transaction if they were to do that.
This might mean we need to change our "virtual iterator" in
DBTransaction
to seek to snapshot iterator and acquire the relevant value there. We would need to maintain separate cursors for each iterator, and ensure mutual exclusion on the snapshot iterator.When using optimistic transactions, this means every transaction creates a snapshot. During low-concurrency states, this is not that bad, and I believe leveldb does some sort of COW. So it's not a full copy. During high-concurrency states, this means increased storage/memory usage for all the concurrent snapshots. It is very likely that transactional contexts are only created at the GRPC handler level, and quite likely we would have a low-concurrency state for majority of the time for each Polykey node.
Based on these ideas, it seems OCC should be less work to do then PCC.
Additional context
DBTransaction
, the usage of locks andLockBox
will still apply in other areas that may only be interacting with in-memory stateDBTransaction
in to EFS, and its usage of LockBoxTasks
DBTransaction
- this should be enough to enable PK integrationDBTransaction.lock()
callDBTransaction.lock
calls takes care or sorting, timeouts and re-entrancyThe text was updated successfully, but these errors were encountered: