Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize get_json_object by calling the main kernel only once #2129

Merged
merged 19 commits into from
Jun 14, 2024

Conversation

ttnghia
Copy link
Collaborator

@ttnghia ttnghia commented Jun 7, 2024

This PR optimizes the host code for get_json_object function, improving performance by calling the main kernel only once.

Currently, the main kernel is called twice, one to compute the output string sizes and another one is to write the output strings. The idea of this work is to pre-allocate a temporary char column to write the output without knowing the output string sizes. Then, while calling to the main kernel, we write out both the output strings along with their sizes, using just one kernel call. A (relatively cheap) gathering step is then executed to extract the final output strings column.

Closes #2124.

Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
@ttnghia ttnghia self-assigned this Jun 7, 2024
// size so that we will not write output strings into an out-of-bound position.
// Checking out-of-bound needs to be performed in the main kernel to make sure we will not have
// data corruption.
constexpr auto padding_ratio = 1.01;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: How do we get this ratio? I think we have some case that will make the output larger than input_size*1.01, like number normalization, adding quotes under some styles, and escaping some chars.

Copy link
Collaborator Author

@ttnghia ttnghia Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the total size, not one row's size. If we have 1M rows, this will be the size of approx. 1'010'000 rows. In other word, we have around 10'000 more rows to avoid invalid memory access issue. We don't care if data of row n is written cross its boundary to row n+1 which causes corruption, since we will discard the entire buffer if such overflow is detected. The extra 10'000 rows at the end is just to make sure data of the last row will not be written into out-of-bound of the entire buffer.

Copy link
Collaborator Author

@ttnghia ttnghia Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extra 10'000 rows seem to be too much. We can turn up this padding ratio to be a dynamic value. For example: the padding should always be the average size of 100 rows. That should be large enough to avoid invalid memory access when writing the last row, but it is not guaranteed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One extreme case:

scala> val d = Seq("['\n']", "['\n\n\n\n\n\n\n\n\n\n']")
scala> val df = d.toDF
scala> df.createOrReplaceTempView("tab")
scala> spark.sql("select length(value) from tab").show()
+-------------+
|length(value)|
+-------------+
|             5|
|           17|
+-------------+

scala> spark.sql("select length(get_json_object(value, '$')) from tab").show()
+---------------------------------+
|length(get_json_object(value, $))|
+---------------------------------+
|                                 6|
|                               30|
+---------------------------------+

This case will cause invalid memory access:
round up 2 * 1.01 = 3, avg size is 11, total allocated size is 33, but total write size will be 35 (5 + 30).
It causes overlap writing as expected(is not an issue), but the tailing writing causes invalid memory access.

One option:
Add a method in json_generator like:

write_byte(char c) {

  if (curr_size >= max_allowed_size) {
     // write nothing
     curr_size++;
  } else {
     *curr_pos = c;
     curr_size++;
  }
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my latest code, I pad the buffer by 100 * average_row_size. In this case, we have avg_row_size = 22/2=11, so buffer size will be 22 + 100*11, no overflow.

Copy link
Collaborator Author

@ttnghia ttnghia Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that code and check the benchmark to see how it will affect the overall performance.
Edit: I indeed attempted to implement such bound check write, but gave up as it requires to modify all the json generator, json parser and ftos converter. In the meantime, I have an idea to further optimize writing with shared memory (#2130) so let this be a future work.

Copy link
Collaborator Author

@ttnghia ttnghia Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the padding, which is now computed using max row size. There is some overhead with this but it is small. This way can avoid invalid memory access in normal situations and also extreme cases such as when the column has all null rows except one very long row.

@ttnghia
Copy link
Collaborator Author

ttnghia commented Jun 11, 2024

build

@res-life
Copy link
Collaborator

res-life commented Jun 12, 2024

It would be nice to add a unit test:

  • Causes the 2nd kernel invoking.

Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
@ttnghia
Copy link
Collaborator Author

ttnghia commented Jun 13, 2024

build

@ttnghia ttnghia merged commit 41c1b52 into NVIDIA:branch-24.08 Jun 14, 2024
3 checks passed
@ttnghia ttnghia deleted the optimize_get_json branch June 14, 2024 03:02
wjxiz1992 pushed a commit to wjxiz1992/spark-rapids-jni that referenced this pull request Jun 17, 2024
…IA#2129)

* Implement the CPU code

Signed-off-by: Nghia Truong <[email protected]>

* Reimplement kernel

Signed-off-by: Nghia Truong <[email protected]>

* Fix the kernel caller

Signed-off-by: Nghia Truong <[email protected]>

* Optimize validity computation

Signed-off-by: Nghia Truong <[email protected]>

* Cleanup

Signed-off-by: Nghia Truong <[email protected]>

* Cleanup

Signed-off-by: Nghia Truong <[email protected]>

* Cleanup

Signed-off-by: Nghia Truong <[email protected]>

* Add comment

Signed-off-by: Nghia Truong <[email protected]>

* Turning kernel occupancy

Signed-off-by: Nghia Truong <[email protected]>

* Cleanup

Signed-off-by: Nghia Truong <[email protected]>

* Change padding for the scratch buffer

Signed-off-by: Nghia Truong <[email protected]>

* Update docs

Signed-off-by: Nghia Truong <[email protected]>

* Add test for overflow case

Signed-off-by: Nghia Truong <[email protected]>

* Pad the output buffer using max row size

Signed-off-by: Nghia Truong <[email protected]>

* Update test

Signed-off-by: Nghia Truong <[email protected]>

* Change the padding ratio

Signed-off-by: Nghia Truong <[email protected]>

---------

Signed-off-by: Nghia Truong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Improve get_json_object by calling the main kernel only once
3 participants