-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BACKEND: Bypass literal indexing? #119
Comments
Hello @gsvarovsky !
That is definitely unexpected and kinda worrying, even more so if batching doesn't seem to affect these latencies. On one side, it reminds me that we should really add a way to run unit and performance tests in the browser automatically to keep track of improvements and regressions. On the other side, perhaps we should have a closer look potential alternatives for browser-side storage. Or, storage alternatives in general. Conceptually,
Interesting use cases!
Breaking the abstraction does not sound that good to me, too, but I would still like to consider it as it might be the simplest and most performant option.
Funny that you mention this as a while ago I experimented with a generalized version of the approach you describe that would deduplicate all terms, basically leading quads to become tuples of references to a I wonder if we could apply this to literals only: we could have a All in all, it sounds to me like the root problems, literal duplication and write performance, must be addressed separately. The latter should probably be addressed with a replacement for The former should probably be addressed via a dedicated index, perhaps even limited to literals exceeding a certain length. |
Thanks for considering, @jacoscaz! OK, considering the literal duplication first. Here is an idea for a bit more detail on the "omit indexing" option. For brevity, I'll only consider my 3 indexes. At present, a quad
The solution involves noticing the presence of the binary literal in O when writing the quad, and truncating the index representation suitably, thus:
Then notice that When matching, the inclusion of a binary literal in the match pattern simply errors immediately. Otherwise, I have the possible index selection options (noting that I always include a Graph term if anything else is present, hence my index choice):
|
With regard to dropping According to some reports, the batch approach in |
Hello again! WRT omitting indexing, I fear truncating index representations would lead to issues with quads overriding one another. Let's assume index One relatively simple and very flexible feature we could add is the option of passing completely arbitrary |
WRT performance, yes, reimplementing atomic transactions would definitely be a non-trivial endeavor. I can't wrap my mind around the numbers you've encountered; I routinely see 15k quad/second in my imports and that translates to 15 quad/millisecond, meaning that going through
|
Have you tried profiling in the browser to figure out what functions are taking the most time? |
Yes, that's actually where I'm getting my timings from, in Chrome. It appears to be faster in Firefox but the performance tab is not as easy to parse. |
|
Ah, of course, sorry. We could proceed by only hashing the selected big literals in the indexes, not the whole quad. It has the advantage that the big literals are then findable by exact-match. (This means this proposal actually becomes "Use Hash Indexes".) No need to hash anything that isn't a big literal. If
I've been looking at SubtleCrypto in the browser and the indications are it's really fast. The trade-off of storage against CPU would be tune-able by configuration. Looking at this might also help with #95, if you treat a quad as analogous to a 'big literal' (although, not straightforward for matching inside the nested triple, need a join). The arbitrary key-value pairs would not help a user of mine with a big literal value, and it's quite likely with text fields. |
Just thinking about this a little more. With an arbitrary key-value interface I could implement my own hash-based index, and only reveal the hash values to quadstore, instead of the original literals. That could be used to remove all duplication. By the way, I am already hashing all (user) triples anyway, for my own purposes (which is why I'm looking at SubtleCrypto). |
Yeah, I that’s what I had in mind. However, I also agree that hashing exceedingly long terms would be a good feature. These two are probably more orthogonal than they are overlapping. I guess you’d be even better off with both? |
I guess I would start with the key/value interface as I find that more generic and flexible. Should that turn out not to work, I would then move to implement hashing directly in the store. What do you think? |
#119: Key-value interface: optional pre-write
We've just merged atomic writes of quads coupled with custom |
As expected, using raw KVPs instead of quads for my journal has removed all duplication overhead in storage. 💯
|
That's great to hear!
Having done some testing on IndexedDB in the last few days, that doesn't come across as terribly surprising (unfortunately). In my testing, which admittedly is still rather limited, the relative ordering of the keys being added to the store shows a high correlation to the overall latency of the batch. Writing values with keys sorted lexicographically (i.e. Unfortunately, due to how our indexing works, I don't think there would be way to fix this without switching to non-atomic writes, which is obviously a deal-breaker in almost every use case. I really need to find a way to run those performance tests on the browser side but I do think that switching to an AOF + DUMP persistence model (like redis does) instead of hitting IndexedDB directly might be the only way forward in cases of real bad performance. I've opened a dedicated issue for browser-side performance (#121 ) and will now close this one. Please do comment with your use case and as detailed as description as you can of your symptoms and findings so far.
Sounds good! |
I'd like to discuss alternatives to solve a performance issue I have. I would be very happy to submit a PR once we have decided the best approach, or work with you. Quadstore carefully balances performance with features (great job!), and this should enhance it further.
Root Problems
General Affected Scenarios
My Particular Scenarios
Journaling
m-ld uses one graph for the user data, and another graph for journaling. The journal is an append-only log of transactions, to support recovery of peer clones. I can truncate the journal periodically, but in domain with very volatile data, this would have to be unacceptably aggressive.
Not all properties of a journal entry need to be indexed in the graph for querying. In fact there is only one property that is ever matched in a query.
Note that it is necessary, for data safety, to update the journal atomically with user data updates.
User Scenarios
Trivially, as m-ld exposes a graph API, my users' data can be affected by the same issues.
Current Workarounds
graph
is always included in the match pattern).base64Binary
literal containing a JSON object. This helps by reducing the count of triples in each transaction. However this does not help with storage volume, because the bundled properties are still repeated in the graph indexes.My Proposed Alternative Solutions
Non-Quad Data
We could add a way to include non-quad key-value pairs in the batch write APIs. This would allow an API user to perform additional data persistence operations atomically with quad writes, while applying their own optimisations.
Opinion: I think this could be an unacceptable rupture in quadstore's data abstraction. Also, it would not help me with User Scenarios unless I also broke my own abstraction!
Omit Indexing for Some Literals
We could add a configurable way to disable indexing for selected literals. Selection could be configured by data type (e.g.
base64Binary
/hexBinary
), and/or by data size (e.g. >32 bytes), or any selector based on a quad literal. Affected literal values would only appear once, in a backend value. In trade-off, the value would not be queryable, only retrievable.Opinion: this is my preferred approach, although I'm not fully sure yet how to implement it.
Use Hash Indexes for Some Literals
We could add a configurable way to use hash values for indexing of selected literals. As above, by any selector based on quad literal. The unhashed value would only appear once, in a backend value. It would still be possible to exact-match a literal, but not range match.
Note that the hash could be strong (cryptographic, slow) or weak (fast). The latter would require a post-filter on query results to remove hash-colliding values.
Opinion: the complexity of this option is off-putting. It feels like there would have to be quite a bit of analysis of the trade-offs. However, it is possible that m-ld may require this option in future anyway (for reasons that would need another essay!).
The text was updated successfully, but these errors were encountered: