Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve concat_json #2557

Merged
merged 3 commits into from
Nov 5, 2024
Merged

Conversation

ttnghia
Copy link
Collaborator

@ttnghia ttnghia commented Oct 31, 2024

This PR improves concat_json in json_utils.*:

  • Remove some redundant code, and
  • Adds a boolean parameter nullify_invalid_rows to concat_json, allowing to control whether we should mark the input rows containing invalid JSON objects as "to be nullified" after parsing the input JSON strings.

The added parameter is to facilitate different behaviors of parsing JSON with varying types of schema. In particular, Spark's from_json will produce different null rows when specifying schema as map vs struct types:

scala> val df = Seq("{'teacher': 'abc', 'student': 'abc'}", "invalid", "null", "", "  ").toDF
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df.show(false)
+------------------------------------+
|value                               |
+------------------------------------+
|{'teacher': 'abc', 'student': 'abc'}|
|invalid                             |
|null                                |
|                                    |
|                                    |
+------------------------------------+


scala> df.selectExpr("from_json(value, 'map<string,string>')").show(false)
+--------------------------------+
|entries                         |
+--------------------------------+
|{teacher -> abc, student -> abc}|
|null                            |
|null                            |
|null                            |
|null                            |
+--------------------------------+


scala> df.selectExpr("from_json(value, 'struct<teacher:string, student:string>')").show(false)
+----------------+
|from_json(value)|
+----------------+
|{abc, abc}      |
|{null, null}    |
|{null, null}    |
|null            |
|null            |
+----------------+

@ttnghia ttnghia requested a review from revans2 October 31, 2024 17:59
@ttnghia ttnghia self-assigned this Oct 31, 2024
@ttnghia ttnghia marked this pull request as ready for review November 1, 2024 05:30
Signed-off-by: Nghia Truong <[email protected]>
@ttnghia ttnghia changed the title Add nullify_invalid_rows parameter to concat_json Enhance concat_json Nov 1, 2024
@ttnghia ttnghia changed the title Enhance concat_json Improve concat_json Nov 1, 2024
Comment on lines +167 to +168
// Currently, set `nullify_invalid_rows = false` as `concatenateJsonStrings` is used only for
// `from_json` with struct schema.
Copy link
Collaborator Author

@ttnghia ttnghia Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...and this Java binding concatenateJsonStrings will be removed completely in the next PR for new from_json implementation.

Comment on lines -113 to -133
if (i + 3 < size &&
(d_str[i] == 'n' && d_str[i + 1] == 'u' && d_str[i + 2] == 'l' && d_str[i + 3] == 'l')) {
i += 4;

// Skip the very last whitespace characters.
bool is_null_literal{true};
for (; i < size; ++i) {
ch = d_str[i];
if (not_whitespace(ch)) {
is_null_literal = false;
break;
}
}

// The current row contains only `null` string literal and not any other non-whitespace
// characters. Such rows need to be masked out as null when doing concatenation.
if (is_null_literal) {
output[idx] = thrust::make_tuple(false, false);
return;
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null string literal is not any different from other garbage (invalid JSON) strings.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI Spark integration tests all passed with this removal.

Signed-off-by: Nghia Truong <[email protected]>
@ttnghia
Copy link
Collaborator Author

ttnghia commented Nov 1, 2024

build

@ttnghia ttnghia merged commit ee4a6c4 into NVIDIA:branch-24.12 Nov 5, 2024
3 checks passed
@ttnghia ttnghia deleted the improve_concat_json branch November 5, 2024 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants