-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The performance of DOM Parser and Schema-Based Parser. #52
Comments
Would you mind sharing the hardware and Java version you used to run these benchmarks? The output from |
Also, what do you mean by:
Both attached snippets look the same. |
Consider the JSON as a tree, the depth I meant is the depth of the node of the tree. For example,
The V2 benchmark visited the data of "statuses.user.screen_name"(depth 3) while V3 is "statuses.id"(depth 2).
I'm sorry, I paste the wrong snippets. |
And I re-run the benchmarks in a more stable environment, the results are similar. The benchmarks I ran are in: |
Thanks for the update. I run your benchmarks on my desktop using two versions of Java (18 and 21). I got the following results: JDK 21.0.1, OpenJDK 64-Bit Server VM, 21.0.1+12-LTS
JDK 18.0.2.1, OpenJDK 64-Bit Server VM, 18.0.2.1+1
In general, Java 21 usually performs better, which is not surprising. However, the problem you've described is still valid. Let me go through your questions to make sure we are on the same page:
In this question, you are referring to the poor performance of the schema-based parser in comparison to Jackson for shorter JSONs. Overall, in all the above cases, some version of simdjson beats Jackson.
This question is strictly related to the previous one, as the schema-based parser again performs unexpectedly poorly. If my interpretation of the benchmark results and your concerns is correct, then we can narrow down the problem to the performance of the schema-based parser. I've profiled it while running the The flamegraph clearly shows that Java reflection is the culprit of the poor performance. Simdjson clears its internal cache of resolved classes every time the parse method is called. After commenting out the line in which the cache is cleared, I got the following results:
I suggest that you comment out this line and rerun the benchmarks in your environment. This is not the ultimate solution, of course. I just want to make sure that we are on the same page and that you don't see any other unexpected disparities between the parsers in terms of performance. |
We are now on the same page! I rerun the benchmarks with commenting out the
BTW, is there a problem if commenting out the |
If we comment it out without changing anything else, there can be a problem because the cache will grow infinitely. In some cases, this is acceptable because the cache can contain as many entries as there are classes in the application. I'll need to think about it. Perhaps the cache needs to have a more sophisticated eviction policy (LRU?). |
Additionally, consider a real situation instead of benchmark, I believe the classResolver is different when parsing different JSON. So comment out the |
Perhaps the cache needs a fixed size(configurable)? Regardless of the eviction policy. |
This is the penalty for using reflection, so you would need to pay it regardless of which parser you use if the parser relies on it. I have an idea on how to replace reflection with an alternative approach, but it requires more research. Also, I wonder how realistic this problem is. How many different schemas can you have in your system? |
Great! Looking forward to it.
TBH, I've no idea. IDK the JSON size distribution in real world. In my case, I do have some small JSON. |
This is understandable, but I wasn't asking about the size of the JSON. Your question was:
So, I assumed that you have a situation where there are many different types of JSON schemas, and every time you parse a JSON, you use a different schema. In such a scenario, the cache is useless because the parser cannot reuse the classes that are already in the cache. However, I cannot think of a scenario where you have, say, a million different schemas and use a different one each time. Is this your case? |
I get your point. The number of different schemas should not be large. <T> T walkDocument(byte[] padded, int len, Class<T> expectedType) {
jsonIterator.init(padded, len);
classResolver.reset(); I think in current code, the cache would be cleared even the
What I mean JSON is JSON strings (i.e |
Yes, exactly. This is why I mentioned that replacing this simple eviction policy with a more sophisticated one could be a good improvement. The new policy would keep already resolved schemas between |
Just for interest's sake,
Jackson also use reflection, right? Why doesn't it show the poor performance? |
But, is there any benchmark in which Jackson beats simdjson? I thought that for smaller inputs they are on a par. |
Before commenting out, yes(https://github.com/ZhaiMo15/simdjson-java/blob/performanceTest/src/jmh/java/org/performance/TwitterBenchmarkV4.java), otherwise no.
Do you mean in smaller inputs case, the percentage of parsing(compared to reflection) is small, even simdjson can speed up parsing, the total performance is slightly changed? IDK which part is the refection. |
I've been testing the performance of Simdjson recently. The basic test is similar to default test, using twitter.json, as below:
What's different is I shrunk the size of statuses, default is 101, I tested 101, 51, and 1 respectively, the result is below:
size 101:
size 51:
size 1:
What's more, I changed the depth of test, the default is 3 and I changed it to 2, as below:
The results are:
size 101:
size 51:
size 1:
Here are my questions:
The text was updated successfully, but these errors were encountered: