-
Notifications
You must be signed in to change notification settings - Fork 507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suboptimal (O(D^2)) memory usage #396
Comments
Summary: Before this change, there are 3 Myers diff algorithms used in the dependency tree: - diff-match-patch (1.0.5) - diff (4.0.1) - diff-sequences (via jest -> jest-diff -> diff-sequences) We'd like to simplify the dependency tree. The short answer is: - Use `diff-sequences`, or `jest-diff` which uses `diff-sequences` internally. For best performance, do: - Strip common prefix and suffix. - Make line comparison O(1), avoid `line1 === line2` which can be O(line length). - Consider skipping "cleanup" in `jest-diff` for long input. ---- Long answer of picking a diff library: I wrote a benchmark script to get some idea about their performance: const fs = require('fs') const dmp = new (require('diff-match-patch').diff_match_patch)(); const diff = require('diff'); const ds = require('diff-sequences').default; const jd = require('jest-diff'); dmp.Diff_Timeout = 120; // Diff functions. Output format: Chunk[] // Chunk is one of: // [0, n]: n common lines (same on both side) // [-1, n]: n left-side-only lines // [1, n]: n right-side-only lines function diff1(chars1, chars2) { return dmp.diff_main(chars1, chars2).map(v => [v[0], v[1].length]); } function diff1a(chars1, chars2) { return dmp.diff_main(chars1, chars2, false).map(v => [v[0], v[1].length]); } function diff2(chars1, chars2) { return diff.diffChars(chars1, chars2).map(v => { const d = v.added ? 1 : (v.removed ? -1 : 0); return [d, v.count]; }); } function diff3(chars1, chars2) { function isCommon(ai, bi) { return chars1[ai] == chars2[bi]; } const r = []; let lastA = 0, lastB = 0; function foundSequence(n, na, nb) { if (na > lastA) { r.push([-1, na - lastA]); lastA = na; } if (nb > lastB) { r.push([1, nb - lastB]); lastB = nb; } if (n > 0) { r.push([0, n]); lastA += n; lastB += n; } } ds(chars1.length, chars2.length, isCommon, foundSequence); foundSequence(0, chars1.length, chars2.length); return r; } function diff3a(chars1, chars2) { return jd.diffStringsRaw(chars1, chars2, false).map((d) => [d[0], d[1].length]); } function diff3b(chars1, chars2) { return jd.diffStringsRaw(chars1, chars2, true).map((d) => [d[0], d[1].length]); } function bench(a, b) { const {chars1, chars2} = dmp.diff_linesToChars_(a, b); function stringify(obj) { if (obj.length > 20) { return `${obj.length} items`; } else { return JSON.stringify(obj); } } [ ['diff-match-patch', diff1], ['diff-match-patch (checklines=false)', diff1a], ['diff-sequences', diff3], ['jest-diff (diff-sequences), no cleanup', diff3a], ['jest-diff (diff-sequences), with cleanup', diff3b], ['jsdiff', diff2], ].forEach(([name, diffFunc]) => { // node --expose_gc if (global.gc) { gc(); } const label = ` ${name}`; console.time(label); console.log(' ', stringify(diffFunc(chars1, chars2))); console.timeEnd(label); }); } let a, b; console.log('\nwith common prefix and suffix 1'); a = 'aaaaaaa\n'.repeat(50000) + 'bbbb\n' + 'dddd\n'.repeat(50000); b = 'aaaaaaa\n'.repeat(50000) + 'cccc\n' + 'dddd\n'.repeat(50000); bench(a, b); console.log('\nwith common prefix and suffix 2'); a = 'aaaaaaa\n'.repeat(50000) + 'bbbbbbb\n' + 'dddd\n'.repeat(50000); b = 'aaaaaaa\n'.repeat(50100) + 'cccc\n' + 'dddd\n'.repeat(49900); bench(a, b); console.log('\nwithout common prefix or suffix 1'); a = 'c\n' + 'aaaaaaa\n'.repeat(50000) + 'dddd\n'.repeat(50000); b = 'aaaaaaa\n'.repeat(50000) + 'dddd\n'.repeat(50100) + 'z\n'; bench(a, b); console.log('\nwithout common prefix or suffix 2'); a = 'cccc\n' + 'aaaaaaa\n'.repeat(50000) + 'bbbbbbb\n' + 'dddd\n'.repeat(50000) + 'z\n'; b = 'aaaaaaa\n'.repeat(50100) + 'cccc\n' + 'dddd\n'.repeat(49900) + 'z\ny\n'; bench(a, b); // Hearthstone cards.json in different languages. // This is somewhat challenging since many lines are changed. // wget https://api.hearthstonejson.com/v1/168129/enUS/cards.json -O 1 // wget https://api.hearthstonejson.com/v1/168129/zhCN/cards.json -O 2 // python3 -m json.tool < 1 > 1.json // python3 -m json.tool < 2 > 2.json console.log('\ncards.json with different languages'); a = fs.readFileSync('1.json', {encoding: 'utf-8'}); b = fs.readFileSync('2.json', {encoding: 'utf-8'}); bench(a, b); The output looks like: with common prefix and suffix 1 [[0,50000],[-1,1],[1,1],[0,50000]] diff-match-patch: 5.073ms [[0,50000],[-1,1],[1,1],[0,50000]] diff-match-patch (checklines=false): 0.481ms [[0,50000],[-1,1],[1,1],[0,50000]] diff-sequences: 7.589ms [[0,50000],[-1,1],[1,1],[0,50000]] jest-diff (diff-sequences), no cleanup: 10.915ms [[0,50000],[-1,1],[1,1],[0,50000]] jest-diff (diff-sequences), with cleanup: 10.588ms [[0,50000],[-1,1],[1,1],[0,50000]] jsdiff: 22.664ms with common prefix and suffix 2 [[0,50000],[-1,101],[1,101],[0,49900]] diff-match-patch: 10.688ms [[0,50000],[-1,101],[1,101],[0,49900]] diff-match-patch (checklines=false): 2.619ms [[0,50000],[-1,101],[1,101],[0,49900]] diff-sequences: 12.687ms [[0,50000],[-1,101],[1,101],[0,49900]] jest-diff (diff-sequences), no cleanup: 11.055ms [[0,50000],[-1,101],[1,101],[0,49900]] jest-diff (diff-sequences), with cleanup: 4.356ms [[0,50000],[-1,1],[1,101],[0,49900],[-1,100]] jsdiff: 59.359ms without common prefix or suffix 1 [[-1,1],[0,100000],[1,101]] diff-match-patch: 632.863ms [[-1,1],[0,100000],[1,101]] diff-match-patch (checklines=false): 607.796ms [[-1,1],[0,50000],[1,51],[0,50000],[1,50]] diff-sequences: 12.366ms [[-1,1],[0,50000],[1,51],[0,50000],[1,50]] jest-diff (diff-sequences), no cleanup: 11.096ms [[-1,1],[0,100000],[1,51],[1,50]] jest-diff (diff-sequences), with cleanup: 1.029s [[-1,1],[0,100000],[1,101]] jsdiff: 13.163ms without common prefix or suffix 2 [[-1,1],[0,50000],[-1,101],[1,101],[0,49901],[1,1]] diff-match-patch: 2.773s [[-1,1],[0,50000],[-1,101],[1,101],[0,49901],[1,1]] diff-match-patch (checklines=false): 1.402s [[-1,1],[0,50000],[-1,101],[1,101],[0,49901],[1,1]] diff-sequences: 22.216ms [[-1,1],[0,50000],[-1,101],[1,101],[0,49901],[1,1]] jest-diff (diff-sequences), no cleanup: 20.546ms [[-1,1],[0,50000],[-1,101],[1,101],[0,49901],[1,1]] jest-diff (diff-sequences), with cleanup: 19.222ms [[-1,1],[0,50000],[-1,1],[1,101],[0,49900],[-1,100],[0,1],[1,1]] jsdiff: 33.82ms cards.json with different languages 67781 items diff-match-patch: 1:04.122 (m:ss.mmm) 57514 items diff-match-patch (checklines=false): 2:00.283 (m:ss.mmm) 67781 items diff-sequences: 1:09.486 (m:ss.mmm) 67781 items jest-diff (diff-sequences), no cleanup: 1:06.452 (m:ss.mmm) 52937 items jest-diff (diff-sequences), with cleanup: 1:09.118 (m:ss.mmm) ... (jsdiff cannot complete this test case in 20+ minutes) Observations: - In the last test case, `jsdiff` does not implement O(D^2) -> O(D) space optimization so it is practically unusable (reported as kpdecker/jsdiff#396). `diff-match-patch` and `jest-diff` both implement the linear space optimization, and have similar performance. - `diff-match-patch` strips common prefix and suffix, which makes it faster than `jest-diff` in "common prefix and suffix" test cases. - Both `diff-match-patch` and `jest-diff` can take a long time on "cleanup". See the "without common prefix or suffix 1" test case. We probably want to only enable cleanup for smaller input. - `diff-match-patch` performs visibly worse on the "without common prefix or suffix 2" test case. From the code it looks like `diff-match-patch` uses some kind of heuristics that tries to speed up things but ends up slowing it down. - Without cleanup, `jest-diff` might output `[1,51],[1,50]` that can be "obviously" merged to `[1,101]`. We might use a lightweight cleanup logic for that. - Reading the code, `diff-match-patch` turns lines into char codes. It cannot handle 65536 unique lines. (https://github.com/google/diff-match-patch/blob/62f2e689f498f9c92dbc588c58750addec9b1654/javascript/diff_match_patch_uncompressed.js#L503) Conclusions: - `jest-diff` (and `diff-sequences` under the hood) is overall the best choice. It has expected time and space complexities, and provides flexibility to skip the potentially slow "cleanup", and can support >65k unique lines. - `jest-diff` misses the "skip common prefix / suffix" optimization that `diff-match-patch` has, and seems practically important (editing a line in the editor - all lines are common prefixes and suffixes except for the line being edited). The optimization is not hard to implement. This diff implements it. - For certain use-cases (ex. linelog) where the diff content is not needed (at least for the left / "a" side), it should use `diff-sequences` to avoid overhead preparing the diff content. - `jest-diff`'s `diffLines` outputs one line per `Diff` but we want one chunk per `Diff`. - `jest-diff`'s `diffStringsRaw` produces one `Diff` per chunk, and because [`string.slice` is O(1) in V8](https://stackoverflow.com/a/72545403), it has acceptable performance. But mapping lines to chars would introduce the 65535 unique line limit undesirably. Reviewed By: evangrayk Differential Revision: D43857949 fbshipit-source-id: 9a3d85ebf10c9b82da8ab5cba4e14e519bbf264d
I'm curious to see a better algorithm... |
Hmmmmm... I definitely think it makes sense to implement a version of the algorithm based on the linear space refinement in which we explore the edit graph in both directions (from the top-left, and from the bottom-right) until they join up in the middle. This will have the practical effect of making jsdiff faster and reducing its memory usage (though without, I think, reducing its worst-case time or space complexity). I think the actual linear space refinement proposed by Myers is different to (and cleverer than) this, though, and I'm not convinced of the wisdom of fully implementing it; certainly I think it shouldn't be on by default. If I'm understanding it correctly - and it took me a while to get this - the key idea of that refinement is that you can navigate through the edit graph from both sides without keeping track of the paths you've taken, just how far they've got in each diagonal, until your two paths hit each other somewhere in the middle of the edit graph, and then the snake they hit each other on is guaranteed to be the middle snake of the optimal diff, and then the way you compensate for not having kept a record of the paths you took is by recursively running the algorithm again on the regions before and after the middle snake to find the middle snakes of those, and so on recursively until all the middle snakes you've found together amount to a full path through the edit graph. Effectively you're repeating lots of work, roughly doubling the time taken in some cases, to compensate for not keeping a record of the work you've done. From a computer scientist's perspective, judging performance only in terms of time and space complexity, this is a brilliant trick that gets the space complexity down without worsening the time complexity. But pragmatically speaking, I'm not convinced there's any point? I have encountered situations where jsdiff basically hung "forever" trying to compute a large diff but never any situation where it ran out of memory or even used a nontrivial amount of my computer's memory. Any tradeoff that involves making jsdiff slower in order to reduce its memory requirements intuitively feels like a bad tradeoff to me (though I am open to being persuaded otherwise). I'd be hesitant accept a contribution adding this as an optional off-by-default feature without some persuasion that it's worth it. (Do note that the perf issues around |
I'm gonna label this as planned for next release with the intent that I'll implement the idea of navigating the edit graph from both ends, but not actually implement the linear space algorithm. |
Part of the space complexity can manifest as time complexity via the GC. The hung "forever" cases might be actually caused by the space complexity. I didn't observe "out of memory" error either. The hearthstone |
No, I don't think so - or at least, I don't think GC can possibly increase the time complexity in a way you wouldn't expect if you analyzed the time complexity without regard for GC. Every change object creation is already accounted for in the time complexity and there are only as many object garbage collections are there are object creations. Again, do also note that #411 (merged to master but not yet released to npm) has drastically reduced the amount of garbage collection needed and completely eliminated |
For example, if you append N element to a linked list. It is O(N) time complexity since appending is O(1) per element. However, if for some reason the GC pauses O(N) times and had to scan the whole linked list each time, GC would cause a O(N^2) time complexity. |
Ah, interesting. Yeah, that makes sense; I was assuming the cost of garbage collection would be proportional to the number of objects in need of collecting but of course that's not the case because of the cost of finding the items that can be collected, which I was completely ignoring. Hmm... |
I think I'm persuaded that it'd make sense from a performance perspective to make this change, but one thing still gives me pause, which is that the linear space refinement from the paper gives different results to the core Myers algorithm. Consider diffing old text Here's our edit graph: At D=0, the forward and reverse paths both just preserve one A (i.e. move diagonally): Then when D=1, we do an insertion on the forward path and similarly a vertical move on the reverse path. (I'm omitting paths that go off the edge of the edit graph, although technically we also do those if we're following the algorithm faithfully.) Now the forward path is at (x, y) = (1, 2) and the reverse path is at (u, v) = (0, 1), satisfying the definition of the paths "overlapping" in diagonal k=-1. (Note we use D = 2 when we apply this definition, even though the variable D in the algorithm is 1 at this point; they're not the same D.) But now, per the algorithm, "The last snake of the reverse path is the middle snake." - i.e. the empty snake from (0, 1) to (0, 1). We're thus now forced to build a 2-path that goes through (0, 1), even though the core Myers algorithm would not go through that point. I was slightly doubting my own interpretation, but the interactive visualisation at https://blog.robertelder.org/diff-algorithm/ agrees: So if we made this change we'd be modifying the results jsdiff returns. Do you see a tweak to the published refinement that would let us avoid this? |
Git's |
Hmm. I wonder if that consistently achieves results identical to the core Myers diff algorithm, though? (It's not immediately obvious to me whether there's any reasonable way to achieve that at all.) Would need to scrutinise the post-processing logic carefully... |
It seems https://github.com/kpdecker/jsdiff/blob/3b654c2ed7d5262ed9946de841ad8dae990286c7/src/diff/base.js does not implement "4b. A Linear Space Refinement" mentioned in Myers' paper. So the memory usage is O(D^2). From the code it looks like
O(len(bestPath) * len(components in bestPath))
might be that complexity.This makes it practically too slow to complete diff calculation if there are many changes. For example, comparing the prettified Hearthstone
cards.json
in different languages wouldn't complete in 20 minutes usingjsdiff
:Reproduce:
Profiler shows that nodejs spent 70% time on GC:
As a comparison, it seems
diff-match-patch
implements the linear space refinement and can completes the above test case in ~24s. (For reference,git diff --stat --minimal
completes in ~0.3s).The text was updated successfully, but these errors were encountered: