feat: Add Spark from_json function #11709

zhli1142015 · 2024-12-02T09:49:36Z

Why I Need to Reimplement JSON Parsing Logic Instead of Using CAST(JSON):

Failure Handling:
On failure, from_json(JSON) returns NULL. For instance, parsing {"a 1} would result in {NULL}.
Root Type Restrictions:
Only ROW, ARRAY, and MAP types are allowed as root types.
Boolean Handling:
Only true and false are considered valid boolean values. Numeric values or strings will result in NULL.
Integral Type Handling:
Only integral values are valid for integral types. Floating-point values and strings will produce NULL.
Float/Double Handling:
All numeric values are valid for float/double types. However, for strings, only specific values like "NaN" or "INF" are valid.
Array Handling:
Spark allows a JSON object as input for an array schema only if the array is the root type and its child type is a ROW.
Map Handling:
Keys in a MAP can only be of VARCHAR type. For example, parsing {"3": 3} results in {"3": 3} instead of {3: 3}.
Row Handling:
Spark supports partial output mode. However, it does not allow an input JSON array when parsing a ROW.

netlify · 2024-12-02T09:49:51Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`ccb7ba9`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/677f5201d034010008588f83

zhli1142015 · 2024-12-02T09:53:36Z

cc @rui-mo and @PHILO-HE , thanks.

rui-mo

Thanks. Added some initial comments.

velox/functions/sparksql/specialforms/CMakeLists.txt

velox/functions/sparksql/specialforms/FromJson.cpp

jinchengchenghh

The json input is variable, how can we make sure all the implement matches to Spark, Maybe we need to search from_json in Spark and make sure the result is correct.

velox/docs/functions/spark/json.rst

velox/functions/sparksql/specialforms/FromJson.cpp

zhli1142015 · 2024-12-06T06:51:33Z

The json input is variable, how can we make sure all the implement matches to Spark, Maybe we need to search from_json in Spark and make sure the result is correct.

The current implementation supports only Spark's default behavior, and we should fall back to Spark's implementation when specific unsupported cases arise. These include situations where user-provided options are non-empty, schemas contain unsupported types, schemas include a column with the same name as spark.sql.columnNameOfCorruptRecord, or the configuration spark.sql.json.enablePartialResults is disabled.

The only existing unit tests in Spark related to this function are found in JsonExpressionsSuite and JsonFunctionsSuite. I have verified that these tests pass and added missing tests to ensure the current implementation aligns with Spark's behavior. For further details, please refer to the new unit tests included in this PR.

rui-mo

Thanks.

velox/docs/functions/spark/json.rst

velox/functions/sparksql/specialforms/FromJson.h

velox/functions/sparksql/specialforms/FromJson.cpp

rui-mo

Thanks for the update! Added some comments.

velox/docs/functions/spark/json.rst

velox/functions/sparksql/specialforms/FromJson.cpp

rui-mo

Thanks for iterating!

velox/docs/functions/spark/json.rst

velox/functions/sparksql/specialforms/FromJson.cpp

PHILO-HE

Looks basically good.
Are nested complex types supported? E.g., array element is an array, struct or map. It would be better to clarify this in document and add some tests if lacked. Thanks!

velox/docs/functions/spark/json.rst

zhli1142015 · 2024-12-24T11:58:35Z

@zhli1142015 Got the data from the team, so this is the case where it was giving wrong results for row number 6 val data = Seq( (6, """[{"holidayTag":"1"},{"holidayTag":"2"}]""") ) spark.sql("select from_json(item_array, 'array</string/>') AS itemArray from newT") // newT is temp view created out out of the dataframe created from this data. Do we plan to support this or will it fallback for now?

Thanks, this is actually a missed case, I updated the logic to address this case.

ayushi-agarwal · 2024-12-24T16:23:21Z

@zhli1142015 Got the data from the team, so this is the case where it was giving wrong results for row number 6 val data = Seq( (6, """[{"holidayTag":"1"},{"holidayTag":"2"}]""") ) spark.sql("select from_json(item_array, 'array</string/>') AS itemArray from newT") // newT is temp view created out out of the dataframe created from this data. Do we plan to support this or will it fallback for now?

Thanks, this is actually a missed case, I updated the logic to address this case.

Thanks for the quick response and updating it.

rui-mo

Hi @zhli1142015, could you help document the limitations for current implementation as mentioned in #11709 (comment) and #11709 (comment)?

zhli1142015 · 2024-12-27T08:06:14Z

Hi @zhli1142015, could you help document the limitations for current implementation as mentioned in #11709 (comment) and #11709 (comment)?

Updated, thanks.

address comments address comments address comments address comments address comments minor change address comments minor change

zhli1142015 · 2025-01-06T02:09:10Z

Kindly ping~, @rui-mo and @PHILO-HE , do you still have more comments? Thanks.

rui-mo

Added some comments on the documentation. Thanks.

velox/docs/functions/spark/json.rst

velox/functions/sparksql/specialforms/FromJson.cpp

ayushi-agarwal · 2025-01-08T18:02:27Z

I am hitting this error after I used the recent change in one case of from_json. @zhli1142015 Any idea what might have gone wrong, I will also try to find more details.

Caused by: org.apache.gluten.exception.GlutenException: org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::getOutput failed for [operator: ValueStream, plan node ID: 0]: Error during calling Java code from native code: org.apache.gluten.exception.GlutenException: org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: (4294967295 vs. 75)
Retriable: False
Expression: idx < children_.size()
Context: Top-level Expression: from_json(n0_1)
Additional Context: Operator: FilterProject[1] 1 Operator: ValueStream[0] 0
Function: childAt
File: /home/cicdkey/gluten-build/incubator-gluten/dev/../ep/build-velox/build/velox_ep/velox/type/Type.h
Line: 1001
Stack trace:
# 0  _ZN8facebook5velox7process10StackTraceC1Ei
# 1  _ZN8facebook5velox14VeloxExceptionC1EPKcmS3_St17basic_string_viewIcSt11char_traitsIcEES7_S7_S7_bNS1_4TypeES7_
# 2  _ZN8facebook5velox6detail14veloxCheckFailINS0_17VeloxRuntimeErrorERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEvRKNS1_18VeloxCheckFailArgsET0_
# 3  _ZN8facebook5velox9functions8sparksql12_GLOBAL__N_119ExtractJsonTypeImplIN8simdjson8fallback8ondemand5valueEE14KindDispatcherILNS0_8TypeKindE32EvE5applyES8_RNS0_4exec13GenericWriterEb.isra.0
# 4  _ZN8facebook5velox9functions8sparksql12_GLOBAL__N_119ExtractJsonTypeImplIRN8simdjson8fallback8ondemand8documentEE14KindDispatcherILNS0_8TypeKindE32EvE5applyES9_RNS0_4exec13GenericWriterEb.constprop.0
# 5  _ZN8facebook5velox9functions8sparksql12_GLOBAL__N_116FromJsonFunctionILNS0_8TypeKindE32EE19extractJsonToWriterERN8simdjson8fallback8ondemand8documentERNS0_4exec12VectorWriterINS0_7GenericINS0_7AnyTypeELb0ELb0EEEvEE
# 6  _ZNK8facebook5velox9functions8sparksql12_GLOBAL__N_116FromJsonFunctionILNS0_8TypeKindE32EE5applyERKNS0_17SelectivityVectorERSt6vectorISt10shared_ptrINS0_10BaseVectorEESaISD_EERKSB_IKNS0_4TypeEERNS0_4exec7EvalCtxERSD_
# 7  _ZN8facebook5velox4exec4Expr13applyFunctionERKNS0_17SelectivityVectorERNS1_7EvalCtxERSt10shared_ptrINS0_10BaseVectorEE
# 8  _ZN8facebook5velox4exec4Expr11evalAllImplERKNS0_17SelectivityVectorERNS1_7EvalCtxERSt10shared_ptrINS0_10BaseVectorEE
# 9  _ZN8facebook5velox4exec4Expr4evalERKNS0_17SelectivityVectorERNS1_7EvalCtxERSt10shared_ptrINS0_10BaseVectorEEPKNS1_7ExprSetE
# 10 _ZN8facebook5velox4exec7ExprSet4evalEiibRKNS0_17SelectivityVectorERNS1_7EvalCtxERSt6vectorISt10shared_ptrINS0_10BaseVectorEESaISB_EE
# 11 _ZN8facebook5velox4exec13FilterProject7projectERKNS0_17SelectivityVectorERNS1_7EvalCtxE
# 12 _ZN8facebook5velox4exec13FilterProject9getOutputEv
# 13 _ZZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEEENKUlvE8_clEv
# 14 _ZN8facebook5velox4exec6Driver11runInternalERSt10shared_ptrIS2_ERS3_INS1_13BlockingStateEERS3_INS0_9RowVectorEE
# 15 _ZN8facebook5velox4exec6Driver4nextEPN5folly10SemiFutureINS3_4UnitEEE
# 16 _ZN8facebook5velox4exec4Task4nextEPN5folly10SemiFutureINS3_4UnitEEE
# 17 _ZN6gluten24WholeStageResultIterator4nextEv
# 18 Java_org_apache_gluten_vectorized_ColumnarBatchOutIterator_nativeHasNext
# 19 0x00007faa8c60fa10


	at org.apache.gluten.iterator.ClosableIterator.hasNext(ClosableIterator.java:41)
	at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
	at org.apache.gluten.iterator.IteratorsV1$InvocationFlowProtection.hasNext(IteratorsV1.scala:159)
	at org.apache.gluten.iterator.IteratorsV1$IteratorCompleter.hasNext(IteratorsV1.scala:71)
	at org.apache.gluten.iterator.IteratorsV1$PayloadCloser.hasNext(IteratorsV1.scala:37)
	at org.apache.gluten.iterator.IteratorsV1$LifeTimeAccumulator.hasNext(IteratorsV1.scala:100)
	at scala.collection.Iterator.isEmpty(Iterator.scala:387)
	at scala.collection.Iterator.isEmpty$(Iterator.scala:387)
	at org.apache.gluten.iterator.IteratorsV1$LifeTimeAccumulator.isEmpty(IteratorsV1.scala:90)
	at org.apache.gluten.execution.VeloxColumnarToRowExec$.toRowIterator(VeloxColumnarToRowExec.scala:121)

zhli1142015 · 2025-01-09T04:39:47Z

Thanks for reporting this. I think you may encounter below case. As schema names we got are all lower case, we can't get correct mapping between filed and data. We need to fallback this case.

       Seq[(String)](
          ("""{"id":1,"Id":2}"""),
          ("""{"id":3,"Id":4}""")
        )
          .toDF("txt")
          .write
          .parquet(path.getCanonicalPath)

        spark.read.parquet(path.getCanonicalPath).createOrReplaceTempView("tbl")

        runQueryAndCompare("select txt, from_json(txt, 'id INT, Id INT') from tbl") {
          checkSparkOperatorMatch[ProjectExec]
        }

ayushi-agarwal · 2025-01-09T05:51:20Z

Thanks for reporting this. I think you may encounter below case. As schema names we got are all lower case, we can't get correct mapping between filed and data. We need to fallback this case.
       Seq[(String)](
          ("""{"id":1,"Id":2}"""),
          ("""{"id":3,"Id":4}""")
        )
          .toDF("txt")
          .write
          .parquet(path.getCanonicalPath)

        spark.read.parquet(path.getCanonicalPath).createOrReplaceTempView("tbl")

        runQueryAndCompare("select txt, from_json(txt, 'id INT, Id INT') from tbl") {
          checkSparkOperatorMatch[ProjectExec]
        }

Ok, and for this case also results don't match, spark returns [null,1] and [null,3] whereas velox returns [0,1] and [0,3]
runQueryAndCompare("select txt, from_json(txt, 'id INT, id INT') from tbl") {
checkSparkOperatorMatch[ProjectExec]
}

zhli1142015 · 2025-01-09T05:57:35Z

I think here Velox is not invloved as gluten fallback this case, I'm not sure how you get the different result.
apache/incubator-gluten@cc68e23

velox/functions/sparksql/specialforms/FromJson.cpp

ayushi-agarwal · 2025-01-09T06:26:45Z

I think here Velox is not invloved as gluten fallback this case, I'm not sure how you get the different result. apache/incubator-gluten@cc68e23

It is because of this check I think n!=f.name, the case I tried is id and id so this condition will become false, this condition will need modification for this case
n != f.name && n.toLowerCase(Locale.ROOT) == f.name.toLowerCase(Locale.ROOT))

zhli1142015 · 2025-01-09T06:48:07Z

I think here Velox is not invloved as gluten fallback this case, I'm not sure how you get the different result. apache/incubator-gluten@cc68e23

It is because of this check I think n!=f.name, the case I tried is id and id so this condition will become false, this condition will need modification for this case n != f.name && n.toLowerCase(Locale.ROOT) == f.name.toLowerCase(Locale.ROOT))

Does your schema contain duplicate fields? Wouldn't this cause issues for other operations?

ayushi-agarwal · 2025-01-09T06:57:56Z

I think here Velox is not invloved as gluten fallback this case, I'm not sure how you get the different result. apache/incubator-gluten@cc68e23

It is because of this check I think n!=f.name, the case I tried is id and id so this condition will become false, this condition will need modification for this case n != f.name && n.toLowerCase(Locale.ROOT) == f.name.toLowerCase(Locale.ROOT))

Does your schema contain duplicate fields? Wouldn't this cause issues for other operations?

Ideally it should not happen in real world scenario, I was just creating some random test cases to check the behaviour difference

zhli1142015 · 2025-01-09T07:17:22Z

Got it, thanks for the clarification. I've updated the Gluten PR.

zhli1142015 requested review from assignUser and majetideepak as code owners December 2, 2024 09:49

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 2, 2024

zhli1142015 force-pushed the add_from_json branch from 006efc5 to 89d888e Compare December 4, 2024 07:01

rui-mo reviewed Dec 5, 2024

View reviewed changes

rui-mo mentioned this pull request Dec 5, 2024

[VL] Unsupported spark function list [please leave a comment if you plan to pick some] apache/incubator-gluten#4039

Open

99 tasks

zhli1142015 force-pushed the add_from_json branch from d1c7d69 to d74a262 Compare December 5, 2024 05:47

zhli1142015 requested a review from rui-mo December 5, 2024 07:31

jinchengchenghh reviewed Dec 6, 2024

View reviewed changes

zhli1142015 requested a review from jinchengchenghh December 6, 2024 06:52

zhli1142015 force-pushed the add_from_json branch 2 times, most recently from a284e49 to 2762885 Compare December 10, 2024 07:57

rui-mo reviewed Dec 10, 2024

View reviewed changes

rui-mo changed the title ~~feat: Add from_json Spark function~~ feat: Add Spark from_json function Dec 10, 2024

rui-mo reviewed Dec 10, 2024

View reviewed changes

velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved

velox/functions/sparksql/specialforms/FromJson.cpp Show resolved Hide resolved

zhli1142015 requested a review from rui-mo December 11, 2024 03:39

rui-mo reviewed Dec 12, 2024

View reviewed changes

zhli1142015 force-pushed the add_from_json branch from 68dab93 to 5bdc4c2 Compare December 12, 2024 08:19

zhli1142015 requested a review from rui-mo December 12, 2024 10:56

zhli1142015 force-pushed the add_from_json branch 3 times, most recently from c3696df to d5d801b Compare December 17, 2024 04:22

rui-mo reviewed Dec 17, 2024

View reviewed changes

PHILO-HE reviewed Dec 18, 2024

View reviewed changes

velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved

zhli1142015 force-pushed the add_from_json branch from f19beba to e3e80be Compare December 18, 2024 09:39

zhli1142015 requested review from rui-mo and PHILO-HE December 18, 2024 09:40

zhli1142015 force-pushed the add_from_json branch from d6d3687 to 8689330 Compare December 24, 2024 11:16

zhli1142015 force-pushed the add_from_json branch from 8689330 to d47f11b Compare December 25, 2024 06:21

rui-mo reviewed Dec 27, 2024

View reviewed changes

zhli1142015 requested a review from rui-mo December 27, 2024 08:05

zhouyuan mentioned this pull request Jan 2, 2025

[GLUTEN-8340][VL] Enable from_json function apache/incubator-gluten#8320

Open

zhli1142015 force-pushed the add_from_json branch 2 times, most recently from 37f1f82 to 1ce4f81 Compare January 3, 2025 02:45

zhli1142015 added 2 commits January 6, 2025 10:03

feat: Add from_json Spark function

178d4e3

address comments address comments address comments address comments address comments minor change address comments minor change

address comments

4f63a5b

zhli1142015 force-pushed the add_from_json branch from 1ce4f81 to 4f63a5b Compare January 6, 2025 02:03

minor fix

48024c2

rui-mo reviewed Jan 8, 2025

View reviewed changes

velox/docs/functions/spark/json.rst Show resolved Hide resolved

velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved

velox/docs/functions/spark/json.rst Outdated Show resolved Hide resolved

velox/docs/functions/spark/json.rst Show resolved Hide resolved

rui-mo reviewed Jan 8, 2025

View reviewed changes

velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved

velox/functions/sparksql/specialforms/FromJson.cpp Outdated Show resolved Hide resolved

velox/functions/sparksql/specialforms/FromJson.cpp Show resolved Hide resolved

zhli1142015 requested a review from rui-mo January 8, 2025 06:58

address comments

dd458b9

zhli1142015 force-pushed the add_from_json branch from 1f6a76c to dd458b9 Compare January 8, 2025 08:32

minor change

ccb7ba9

ayushi-agarwal reviewed Jan 9, 2025

View reviewed changes

velox/functions/sparksql/specialforms/FromJson.cpp Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Spark from_json function #11709

feat: Add Spark from_json function #11709

zhli1142015 commented Dec 2, 2024 •

edited

Loading

netlify bot commented Dec 2, 2024 •

edited

Loading

zhli1142015 commented Dec 2, 2024

rui-mo left a comment

jinchengchenghh left a comment

zhli1142015 commented Dec 6, 2024

rui-mo left a comment

rui-mo left a comment

rui-mo left a comment

PHILO-HE left a comment

zhli1142015 commented Dec 24, 2024

ayushi-agarwal commented Dec 24, 2024

rui-mo left a comment

zhli1142015 commented Dec 27, 2024

zhli1142015 commented Jan 6, 2025

rui-mo left a comment

ayushi-agarwal commented Jan 8, 2025

zhli1142015 commented Jan 9, 2025

ayushi-agarwal commented Jan 9, 2025

zhli1142015 commented Jan 9, 2025 •

edited

Loading

ayushi-agarwal commented Jan 9, 2025

zhli1142015 commented Jan 9, 2025

ayushi-agarwal commented Jan 9, 2025 •

edited

Loading

zhli1142015 commented Jan 9, 2025

feat: Add Spark from_json function #11709

Are you sure you want to change the base?

feat: Add Spark from_json function #11709

Conversation

zhli1142015 commented Dec 2, 2024 • edited Loading

netlify bot commented Dec 2, 2024 • edited Loading

✅ Deploy Preview for meta-velox canceled.

zhli1142015 commented Dec 2, 2024

rui-mo left a comment

Choose a reason for hiding this comment

jinchengchenghh left a comment

Choose a reason for hiding this comment

zhli1142015 commented Dec 6, 2024

rui-mo left a comment

Choose a reason for hiding this comment

rui-mo left a comment

Choose a reason for hiding this comment

rui-mo left a comment

Choose a reason for hiding this comment

PHILO-HE left a comment

Choose a reason for hiding this comment

zhli1142015 commented Dec 24, 2024

ayushi-agarwal commented Dec 24, 2024

rui-mo left a comment

Choose a reason for hiding this comment

zhli1142015 commented Dec 27, 2024

zhli1142015 commented Jan 6, 2025

rui-mo left a comment

Choose a reason for hiding this comment

ayushi-agarwal commented Jan 8, 2025

zhli1142015 commented Jan 9, 2025

ayushi-agarwal commented Jan 9, 2025

zhli1142015 commented Jan 9, 2025 • edited Loading

ayushi-agarwal commented Jan 9, 2025

zhli1142015 commented Jan 9, 2025

ayushi-agarwal commented Jan 9, 2025 • edited Loading

zhli1142015 commented Jan 9, 2025

zhli1142015 commented Dec 2, 2024 •

edited

Loading

netlify bot commented Dec 2, 2024 •

edited

Loading

zhli1142015 commented Jan 9, 2025 •

edited

Loading

ayushi-agarwal commented Jan 9, 2025 •

edited

Loading