BACKEND: Bypass literal indexing? #119

gsvarovsky · 2020-11-21T15:30:21Z

I'd like to discuss alternatives to solve a performance issue I have. I would be very happy to submit a PR once we have decided the best approach, or work with you. Quadstore carefully balances performance with features (great job!), and this should enhance it further.

Root Problems

IndexedDB performance is terrible. Irrespective of batching, I observe individual key writes at >1ms (including transaction commit). Using 6 indexes, it only takes 16 triples to reach a 100ms transaction time.
With the best literal encoding possible, indexing means that a literal value is repeated multiple times in storage.

General Affected Scenarios

Storage of moderate to high numbers of literals that do not require indexing.
Storage of moderately sized to large literals, e.g. string or binary.

My Particular Scenarios

Journaling

m-ld uses one graph for the user data, and another graph for journaling. The journal is an append-only log of transactions, to support recovery of peer clones. I can truncate the journal periodically, but in domain with very volatile data, this would have to be unacceptably aggressive.

Not all properties of a journal entry need to be indexed in the graph for querying. In fact there is only one property that is ever matched in a query.

Note that it is necessary, for data safety, to update the journal atomically with user data updates.

User Scenarios

Trivially, as m-ld exposes a graph API, my users' data can be affected by the same issues.

Current Workarounds

I customise the indexes (reduce from 6 to 3) because I never query across the user data and the journal data (the graph is always included in the match pattern).
I bundle some properties of the journal into a base64Binary literal containing a JSON object. This helps by reducing the count of triples in each transaction. However this does not help with storage volume, because the bundled properties are still repeated in the graph indexes.

My Proposed Alternative Solutions

Non-Quad Data

We could add a way to include non-quad key-value pairs in the batch write APIs. This would allow an API user to perform additional data persistence operations atomically with quad writes, while applying their own optimisations.

Opinion: I think this could be an unacceptable rupture in quadstore's data abstraction. Also, it would not help me with User Scenarios unless I also broke my own abstraction!

Omit Indexing for Some Literals

We could add a configurable way to disable indexing for selected literals. Selection could be configured by data type (e.g. base64Binary/hexBinary), and/or by data size (e.g. >32 bytes), or any selector based on a quad literal. Affected literal values would only appear once, in a backend value. In trade-off, the value would not be queryable, only retrievable.

Opinion: this is my preferred approach, although I'm not fully sure yet how to implement it.

Use Hash Indexes for Some Literals

We could add a configurable way to use hash values for indexing of selected literals. As above, by any selector based on quad literal. The unhashed value would only appear once, in a backend value. It would still be possible to exact-match a literal, but not range match.

Note that the hash could be strong (cryptographic, slow) or weak (fast). The latter would require a post-filter on query results to remove hash-colliding values.

Opinion: the complexity of this option is off-putting. It feels like there would have to be quite a bit of analysis of the trade-offs. However, it is possible that m-ld may require this option in future anyway (for reasons that would need another essay!).

The text was updated successfully, but these errors were encountered:

jacoscaz · 2020-11-22T10:21:23Z

Hello @gsvarovsky !

IndexedDB performance is terrible. Irrespective of batching, I observe individual key writes at >1ms

That is definitely unexpected and kinda worrying, even more so if batching doesn't seem to affect these latencies. On one side, it reminds me that we should really add a way to run unit and performance tests in the browser automatically to keep track of improvements and regressions. On the other side, perhaps we should have a closer look potential alternatives for browser-side storage. Or, storage alternatives in general. Conceptually, quadstore evolved from the approach first taken by levelgraph but there's no reason we must stick to such approach if something better comes along. Similarly, we could use something else than level-js.

Storage of moderate to high numbers of literals that do not require indexing.
Storage of moderately sized to large literals, e.g. string or binary.

Interesting use cases!

Non-Quad Data

Breaking the abstraction does not sound that good to me, too, but I would still like to consider it as it might be the simplest and most performant option.

Omit Indexing for Some Literals

quadstore does not index single terms but entire quads. Presumably, and I'm just brainstorming here, disabling indexing for a given kind of literals would mean forcing quadstore to add quads matching the criteria only to a subset of its configured indexes. This might slightly lower the latencies you are experiencing but you've already played with dropping indexes; I don't think this would fare any better than that.

Use Hash Indexes for Some Literals

Funny that you mention this as a while ago I experimented with a generalized version of the approach you describe that would deduplicate all terms, basically leading quads to become tuples of references to a terms index. A smart use of (configurable) caching helped in limiting the performance impact of having lookups-on-write. However, the big issue here is the complexity of implementing multi-filter queries, which quickly leads back to per-quad indexing thus making the whole effort kinda moot.

I wonder if we could apply this to literals only: we could have a literals index with keys being length-limited, lossy, sortable encodings of literals which we would also use in place of the actual literal representation in the standard indexes. However, note that this would probably eliminate the issue with repeating multiple large literals but would actually worsen write performance as you'd have an additional index on top of the others. Keeping such index in sync with the others might also turn out to be a non-trivial task.

All in all, it sounds to me like the root problems, literal duplication and write performance, must be addressed separately.

The latter should probably be addressed with a replacement for level-js, perhaps something that writes to memory and then dumps the whole thing to localStorage (snapshotting?).

The former should probably be addressed via a dedicated index, perhaps even limited to literals exceeding a certain length.

gsvarovsky · 2020-11-22T12:44:07Z

Thanks for considering, @jacoscaz! OK, considering the literal duplication first.

Here is an idea for a bit more detail on the "omit indexing" option. For brevity, I'll only consider my 3 indexes. At present, a quad <<S P O G>> looks like this in the backend:

G,S,P,O : <<S P O G>>
G,O,S,P : <<S P O G>>
G,P,O,S : <<S P O G>>

The solution involves noticing the presence of the binary literal in O when writing the quad, and truncating the index representation suitably, thus:

G,S,P,  : <<S P O G>>
G,      : <<S P O G>>
G,P,    : <<S P O G>>

Then notice that G, is redundant and remove that too. A similar operation on the default 6 indexes reduces them to 4. (This also addresses write performance, a bit.)

When matching, the inclusion of a binary literal in the match pattern simply errors immediately. Otherwise, I have the possible index selection options (noting that I always include a Graph term if anything else is present, hence my index choice):

Pattern	Index
`_,_,_,_`	`G,S,P,O` (first found)
`G,S,_,_`	`G,S,P,O`
`G,S,P,_`	`G,S,P,O`
`G,_,P,_`	`G,P,O,S`

gsvarovsky · 2020-11-22T12:50:27Z

With regard to dropping level-js: personally I need IndexedDB because I need atomic transactions, which would be very hard to do in the other common browser option, local storage. WebSQL is very poorly supported. So the only motivation would be if level-js has some problems itself.

According to some reports, the batch approach in level-js, which waits for each put or del to complete before submitting the next one, is the wrong thing to do. But I've experimented with that and doing them all at once only makes a tiny difference to the whole transaction duration.

jacoscaz · 2020-11-22T16:42:42Z

Hello again!

WRT omitting indexing, I fear truncating index representations would lead to issues with quads overriding one another. Let's assume index GSP and quads (s1, p1, o1, g1) and (s1, p1, o2, g1). Both of these quads would be represented as something like g1-s1-p1, leading the store to replace the one indexed first with that indexed later on. This is the reason why quadstore does not support partial indexes. I suspect there's a way to counter this by adding some sort of hashed version of the entire quad at the end of the index representation but that would considerably slow down insertion, more so with large literals involved.

One relatively simple and very flexible feature we could add is the option of passing completely arbitrary (key, value) pairs together with quads in all relevant operations (put, multiPut, patch, multiPatch) and new dedicated methods to retrieve those values. Projects such as m-ld would then be able to use this as the foundation for custom functionality. Does this sound like something that could help?

jacoscaz · 2020-11-22T16:50:59Z

WRT performance, yes, reimplementing atomic transactions would definitely be a non-trivial endeavor.

I can't wrap my mind around the numbers you've encountered; I routinely see 15k quad/second in my imports and that translates to 15 quad/millisecond, meaning that going through level-js and IndexedDB is resulting in whole orders of magnitude of difference. Could you run the dist/perf/loadfile.js script with the 21million.rdf file on the same machine and report what you get?

cat 21million.rdf | head -n 1000000 > 1million.rdf
node dist/perf/loadfile.js ./1million.rdf

jacoscaz · 2020-11-22T17:08:20Z

Have you tried profiling in the browser to figure out what functions are taking the most time?

gsvarovsky · 2020-11-22T17:19:10Z

Have you tried profiling in the browser to figure out what functions are taking the most time?

Yes, that's actually where I'm getting my timings from, in Chrome. It appears to be faster in Firefox but the performance tab is not as easy to parse.

gsvarovsky · 2020-11-22T17:27:23Z

node dist/perf/loadfile.js ./1million.rdf
TIME: 104913 s
DISK: 524912

gsvarovsky · 2020-11-22T17:41:04Z

truncating index representations would lead to issues with quads overriding one another

Ah, of course, sorry. We could proceed by only hashing the selected big literals in the indexes, not the whole quad. It has the advantage that the big literals are then findable by exact-match. (This means this proposal actually becomes "Use Hash Indexes".) No need to hash anything that isn't a big literal.

If H is the the hash of O:

G,S,P,H : <<S P O G>>
G,H,S,P : <<S P O G>>
G,P,H,S : <<S P O G>>

I've been looking at SubtleCrypto in the browser and the indications are it's really fast. The trade-off of storage against CPU would be tune-able by configuration.

Looking at this might also help with #95, if you treat a quad as analogous to a 'big literal' (although, not straightforward for matching inside the nested triple, need a join).

The arbitrary key-value pairs would not help a user of mine with a big literal value, and it's quite likely with text fields.

gsvarovsky · 2020-11-22T18:59:20Z

Just thinking about this a little more. With an arbitrary key-value interface I could implement my own hash-based index, and only reveal the hash values to quadstore, instead of the original literals. That could be used to remove all duplication.

By the way, I am already hashing all (user) triples anyway, for my own purposes (which is why I'm looking at SubtleCrypto).

jacoscaz · 2020-11-22T20:57:56Z

That could be used to remove all duplication.

Yeah, I that’s what I had in mind. However, I also agree that hashing exceedingly long terms would be a good feature.

These two are probably more orthogonal than they are overlapping. I guess you’d be even better off with both?

jacoscaz · 2020-11-22T21:04:57Z

I guess I would start with the key/value interface as I find that more generic and flexible. Should that turn out not to work, I would then move to implement hashing directly in the store. What do you think?

#119: Key-value interface: optional pre-write

jacoscaz · 2020-11-24T14:52:15Z

We've just merged atomic writes of quads coupled with custom key, value pairs (see #120). @gsvarovsky how are things looking on your end? Do you already have an idea of whether that will be enough to address the duplication issues you are experiencing with m-ld?

gsvarovsky · 2020-11-25T10:30:03Z

As expected, using raw KVPs instead of quads for my journal has removed all duplication overhead in storage. 💯

Very strangely, and unhappily, I'm still seeing very similar total batch time in IndexedDB for a user transaction including (atomic) journaling. I'll be looking into that in more detail.
I have not yet attempted to implement a hash index in order to extend the benefit to user quads, but this is a lower priority for me and I will consider it later.

jacoscaz · 2020-11-25T11:16:27Z

As expected, using raw KVPs instead of quads for my journal has removed all duplication overhead in storage. 💯

That's great to hear!

Very strangely, and unhappily, I'm still seeing very similar total batch time in IndexedDB for a user transaction including (atomic) journaling. I'll be looking into that in more detail.

Having done some testing on IndexedDB in the last few days, that doesn't come across as terribly surprising (unfortunately). In my testing, which admittedly is still rather limited, the relative ordering of the keys being added to the store shows a high correlation to the overall latency of the batch. Writing values with keys sorted lexicographically (i.e. aaa, aaab, aaac, aba, abc, abf, ...) seems to be a lot faster than writing values with sparse/unsorted keys - particularly in Chrome.

Unfortunately, due to how our indexing works, I don't think there would be way to fix this without switching to non-atomic writes, which is obviously a deal-breaker in almost every use case. I really need to find a way to run those performance tests on the browser side but I do think that switching to an AOF + DUMP persistence model (like redis does) instead of hitting IndexedDB directly might be the only way forward in cases of real bad performance.

I've opened a dedicated issue for browser-side performance (#121 ) and will now close this one. Please do comment with your use case and as detailed as description as you can of your symptoms and findings so far.

I have not yet attempted to implement a hash index in order to extend the benefit to user quads, but this is a lower priority for me and I will consider it later.

Sounds good!

gsvarovsky mentioned this issue Nov 21, 2020

Basic performance improvements m-ld/m-ld-js#28

Closed

10 tasks

gsvarovsky added a commit to m-ld/node-quadstore that referenced this issue Nov 23, 2020

quadstorejs#119: Key-value interface: optional pre-write

4f3bf45

gsvarovsky mentioned this issue Nov 23, 2020

#119: Key-value interface: optional pre-write #120

Merged

gsvarovsky added a commit to m-ld/node-quadstore that referenced this issue Nov 24, 2020

quadstorejs#119: Code review, docs and tests

43008af

jacoscaz added a commit that referenced this issue Nov 24, 2020

Merge pull request #120 from m-ld/key-value-interface

2cbe497

#119: Key-value interface: optional pre-write

jacoscaz closed this as completed Nov 25, 2020

jacoscaz mentioned this issue Nov 25, 2020

BACKEND: slow browser-side performance when using level-js / IndexedDB #121

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BACKEND: Bypass literal indexing? #119

BACKEND: Bypass literal indexing? #119

gsvarovsky commented Nov 21, 2020

jacoscaz commented Nov 22, 2020 •

edited

Loading

gsvarovsky commented Nov 22, 2020

gsvarovsky commented Nov 22, 2020

jacoscaz commented Nov 22, 2020

jacoscaz commented Nov 22, 2020 •

edited

Loading

jacoscaz commented Nov 22, 2020

gsvarovsky commented Nov 22, 2020

gsvarovsky commented Nov 22, 2020

gsvarovsky commented Nov 22, 2020 •

edited

Loading

gsvarovsky commented Nov 22, 2020

jacoscaz commented Nov 22, 2020

jacoscaz commented Nov 22, 2020

jacoscaz commented Nov 24, 2020

gsvarovsky commented Nov 25, 2020

jacoscaz commented Nov 25, 2020

BACKEND: Bypass literal indexing? #119

BACKEND: Bypass literal indexing? #119

Comments

gsvarovsky commented Nov 21, 2020

Root Problems

General Affected Scenarios

My Particular Scenarios

Journaling

User Scenarios

Current Workarounds

My Proposed Alternative Solutions

Non-Quad Data

Omit Indexing for Some Literals

Use Hash Indexes for Some Literals

jacoscaz commented Nov 22, 2020 • edited Loading

gsvarovsky commented Nov 22, 2020

gsvarovsky commented Nov 22, 2020

jacoscaz commented Nov 22, 2020

jacoscaz commented Nov 22, 2020 • edited Loading

jacoscaz commented Nov 22, 2020

gsvarovsky commented Nov 22, 2020

gsvarovsky commented Nov 22, 2020

gsvarovsky commented Nov 22, 2020 • edited Loading

gsvarovsky commented Nov 22, 2020

jacoscaz commented Nov 22, 2020

jacoscaz commented Nov 22, 2020

jacoscaz commented Nov 24, 2020

gsvarovsky commented Nov 25, 2020

jacoscaz commented Nov 25, 2020

jacoscaz commented Nov 22, 2020 •

edited

Loading

jacoscaz commented Nov 22, 2020 •

edited

Loading

gsvarovsky commented Nov 22, 2020 •

edited

Loading