-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate Automatic Indexing into DB #1
Comments
Was thinking that this secondary index is by default unique. However non-unique keys can be done by making use of the sequence range. |
The So for a structure like this:
The If we want to apply this to each sublevel we already have with the The If we imagine there are:
You can see that the We could do this by making use of our |
Every sublevel created from This indexes sublevel is a sublevel containing all the index sublevels for respective primary db. Not every sublevel requires an index though, so this can lazily initialised. On the otherhand, creating a sublevel doesn't actually do anything to the database unless the sublevel is actually used. It can simplify things right now if it is just constructed. The Having the associated indexes sublevel is one thing, it is another to keep around the index itself. The index sublevel is in fact an implementation detail. It would be ideal for the indexes sublevel to not need to be maintained by the user. The user wants to use the index sublevel instead. So here's a possible API: const mgrDomain: DBDomain = ['Manager'];
const dirsDomain: DBDomain = [mgrDomain[0], 'dirs'];
const mgrDb = await db.level(mgrDomain[0]);
const dirsDb = await db.level(dirsDomain[1], mgrDb);
// this is is still a Db, but now used as an index
const mgrDbIndex = await db.index(mgrDb, 'indexname', ['name, 'title']);
// remember we can use the db themselves, but we don't benefit from automatic serialisation/deserialisation and encrypt/decrypt
// so if we want to "get" but according to an index, we now have to refer to the index domain
const indexDomain = ['Manager', 'indexname'];
// note that when you are creating a compound index, you now have to use a separator to indicate this
await db.get(indexDomain, 'somename!sometitle'); Remember you don't actually need to create a sublevel to use This means when you first create an index, and then start Another issue is the fact that the DB is now always using buffer encoding. So that means the index dbs should also be always using buffer encoding for values. But keys themselves depend. A further problem is whether the indexes are going through the encryption/decryption? If they are not using the |
Ok so there's a problem. Because the values are encrypted before storing into the DB. The hooks which relies on the event triggers only have access to the encrypted ciphertext. Thus the usage of Ok so if we want to do this, we would need to synthesise and extract the implementation into our own |
One issue with encrypting indexes, is that the index sublevel key is based on the primary sublevel value. So if the primary sublevel Then would this mean that the index sublevel key is "sam"? If so, this would then be leaking the value itself. If we would to encrypt the index keys this ends up with some problems...
Here's an alternative. An order-preserving hash. If the index keys can go through an order-preserving hash, then it should still work nicely preserving order, and you can deterministically hash any value you want to look up and then find the "encrypted" ID and decrypt it. It may even be possible to just hash the ID instead of encrypting it. The values of the index sublevels are just keys of the primary sublevels, and keys of primary sublevels are not encrypted anyway. It makes sense to use order preserving hash on index keys because:
Does it make sense to encrypt/decrypt the index values? - No (keep them plaintext)
|
What kind of indexes do we want to support?
This gets pretty complicated. The most likely usage right now is just unique value keys, and possibly when it is not unique. If we end up with too many complicated indexes, we might as well change to using Sqlite instead. Going to prototype this with level-auto-index to see how it deals non-unique value indexes. |
Order preserving hashing is not useful because you can recover the original value. https://crypto.stackexchange.com/questions/8160/secure-order-preserving-hash-function I'm not sure if this applies to perfect minimal hashes. But it seems there is some leak. There are new algorithms for order preserving encryption but no libraries in JS and too costly to implement for this feature. However if we were to not use indexing, we would probably have to create bidirectional maps and end up not encrypting the key anyway. Therefore having an unencrypted index isn't any worse. The main thing then is that anything we index is going to be leaked and thus indexing should be limited to values we are willing to leak like ID values. Things that are not meant to be leaked should not be indexed then. So for now we can go with unencrypted & unhashed index keys. But index values are going to be encrypted as normal. A big fat warning is required to ensure Devs know that any indexed values are going to be plaintext. The best solution would where the leveldb block system itself had encryption. This would mean all keys and values are encrypted and encryption is happening at the lower level. However this is not feasible to change to atm. |
It appears there's a relationship between proper transaction support, block level encryption and secure indexing. In EFS we have an issue for integrating native snapshots into the transaction system. This requires going into the C++ binding of leveldb and making it work. Then our transactions can use leveldb snapshots natively. And in such a case our transaction could scale too. Block level encryption means encryption at the blocks of the DB rather than at record level. RocksDB has this capability facebook/rocksdb#2424, but would require changing to use rocksdb which is possible because it fits the levelup API, but any encryption mechanism would be done at the C++ level, and not involve webcrypto APIs or primitives. Indexing as in this issue is currently not entirely secure or is very limited in order to not leak values. It would work better if block encyption was available. Then we would not need to work out complicated OPE. All 3 problems can be resolved by using sqlite3 and the sqlcipher extension assuming that it does block level encryption. However cross platform requirements need analysis. This also impacts the common crypto mechanisms used by PK. Whether it is leveldb, rocksdb or sqlite, any robust situation will require calling into C++ and manipulating the C++ or C code. Although I wonder if the sqlitejs ports can use wasm and whether sqlcipher can be put through wasm. |
For now we will stick with simple indexing, and the current transaction support is sufficient. When the combination of 3 problems becomes worthwhile, then we should investigate solutions for this DB to have encryption to be at block level, and then having proper transactions and proper indexing support. The first path would likely to be investigating the usage of SQLCipher. It's not an extension, it's a fork. It would need to be usable on PK CLI and PK GUI, and for Android & iOS it looks like it would need to be compiled for those platforms. This would be large epic, as it would involve multiple subissues to resolve. The encryption work would also need to involve consideration of WASM MatrixAI/Polykey#155 and QUIC MatrixAI/Polykey#234. Basically for our next phase of work, there's going to be alot of C/C++ involved. |
Since MatrixAI/js-id#1 is done, this means we now have the ability to use |
2 ways we can implement this. 1. Embedding into LevelDB Extension Points (hooks, events, abstract-leveldown wrappers): By trying to bring encryption into the leveldb via hooking into its event system, and thus getting a good idea of how we can apply indexing just before the encryption. 2. Building on top of DB here: By doing it above the existing encryption in our The former requires diving into the leveldb internals, the latter might result in some code duplication and a recreation of an event system. The former doubles down on leveldb investment, the latter may be lighterweight and enable us to move to sqlite3 or other systems in the future. Additionally, in order to represent an index that can point to multiple keys, a multi-value index that is, we can make use of sublevels again. Basically every time we index a value, that value doesn't directly point to the primary key, but instead to another sublevel of primary keys. The values of these can be arbitrary or contain metadata relating to each record. But one can now use the order of the primary key so you can get the "latest" value for a given index. |
Compound indexing is useful for our Discovery queue to represent priorities. MatrixAI/Polykey#329 (comment) |
Way 1. cannot be done unless #4 and #5 are solved first. As these are long and complicated problems, we'll park these until later, it will involve ultimately looking into the C++ codebase of leveldb and making changes there or at the level nodejs wrapper code. Way 2. will be what we have to do to proceed for now. To be as foolproof as possible, we should wrap our sublevels as |
In MatrixAI/Polykey#244 we are storing each node entry in buckets. The buckets are ordered due to the lexi-int encoding of the bucket index, however the node ids themselves require to be indexed by the |
With the merge of #10, this can now use a special root level called |
Automatic indexing is made easier with snapshot isolation, because we can expect consistent reads, meaning while traversing an index, we can look up what the index points to easily. |
Since we're on rocksdb now, we can make use of some prior art for secondary indexing: |
There's too many ways we use indexing for it to be automated unless we first have a structured schema. So won't need this now since we have already done all the manual indexing work. |
Specification
Indexing is the act of creating a separate data structure that makes it faster to lookup data on a primary data structure.
PK has several domains that require secondary indexing:
Right now secondary indexing is implemented manually in ACL by persisting bidirectional maps in leveldb.
Indexing is tricky topic, and it's better to create a single indexing mechanism in DB so that all domains in PK can benefit from them.
Generic indexing can reply on LevelDB's eventemitter interface. The good thing is that other people have already created good libraries for this:
The second library automates some of
level-auto-index
. In fact it binds secondary index interfaces to the leveldb object.For
@matrixai/db
it makes more sense to uselevel-auto-index
so we can more tightly control how the indexes are stored. Most likely via sublevel interfaces.Additional context
Tasks
level-auto-index
[ ] - Integrate thelevel-auto-index
into this library's domain/level creation, each level including the root level should have the ability of attaching a secondary index (or multiple secondary indexes)The text was updated successfully, but these errors were encountered: