Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further discussion of schema-based parsing. #47

Open
ZhaiMo15 opened this issue May 11, 2024 · 2 comments
Open

Further discussion of schema-based parsing. #47

ZhaiMo15 opened this issue May 11, 2024 · 2 comments

Comments

@ZhaiMo15
Copy link

Good to know that the schema-based parsing were implemented!
I have two more questions:

  1. As line 54 in src/main/java/org/simdjson/SchemaBasedJsonIterator.java says: "Lists at the root are not supported. Consider using an array instead.". So current version cannot handle the case? For example, github_events.json.
  2. Can the schema-based parsing be more powerful?
    In current version, we need to explicitly tell the parser the class:
SimdJsonTwitter twitter = simdJsonParser.parse(buffer, buffer.length, SimdJsonTwitter.class);

record SimdJsonUser(boolean default_profile, String screen_name) {
}

record SimdJsonStatus(SimdJsonUser user) {
}

record SimdJsonTwitter(List<SimdJsonStatus> statuses) {
}

In Jackson, we can use readValue to parse json into Map (or List), in that case, we don't need to define lots of "record" if the class is complicated.
In one word, something like Object twitter = simdJsonParser.parse(buffer, buffer.length, Map.class);

@piotrrzysko
Copy link
Member

As line 54 in src/main/java/org/simdjson/SchemaBasedJsonIterator.java says: "Lists at the root are not supported. Consider using an array instead.". So current version cannot handle the case? For example, github_events.json.

The current version can handle it:

GithubEvent[] events = parser.parse(json, json.length, GithubEvent[].class);

However, we can try adding support for lists at the root. The reason I haven't done this yet is that, in Java, it's a bit challenging to pass information about a generic type parameter. We cannot do something like:

List<GithubEvent> events = parser.parse(json, json.length, List<GithubEvent>.class);

We can consider introducing an API like this:

List<GithubEvent> events = parser.parseList(json, json.length, GithubEvent.class);

Can the schema-based parsing be more powerful?

I'm open to that. However, the power of schema-based parsing is that we can skip parsing fields that are not included in the schema. For a Map, we would need to go through all fields. It would be interesting to see how this affects performance.

Please let me know what you think about it. Also, would you mind sharing if you use or consider using simdjson-java in any project? That would be very valuable information, especially if you could describe your use case (how much data you process, what your expectations are regarding performance, etc.).

@ZhaiMo15
Copy link
Author

However, the power of schema-based parsing is that we can skip parsing fields that are not included in the schema. For a Map, we would need to go through all fields.

Thx. I get your point.

I'm talking about Map and List is because my current project is using UDFJson.java(https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/UDFJson.java). I'd like to convert all Jackson to Simdjson.

Before schema-based parsing, I use simdJsonParser.parse() to parse json to JsonValue, and then use iterator to build a map, to match the objectMapper.readValue(jsonString, MAP_TYPE);. However, because of the twice loops(first in parsing, second in building map), the performance is bad.

Therefore, I believe the schema-based parsing to Map, even though will go though all fields, is faster than Jackson.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants