Skip to content

Comparing Complex Schema Types

matthewgillett edited this page Jan 4, 2024 · 1 revision

Here we will explain how to compare complex schema sources that may contain nested types, such as Json and DynamoDb.

Creating the complex schema AppleTable

In order to create an AppleTable for a Json or DynamoDb source, we need to provide the first level list of fields for the entire source. All columns and any nested structure types are converted to a string equivalent to simplify large dynamic schemas and to work properly with Spark's strict schema requirement.

For example if we have the following Json file:

[
  {
    "key1": "TEST1",
    "key2": 1,
    "attribute1": "test number 1",
    "attribute2": [
      "1", 
      "2"
    ],
    "attribute3": "1"
  },
  {
    "key1": "TEST2",
    "key2": 1,
    "attribute1": "test number 2",
    "attribute2": {
      "element1": "2"
    },
    "attribute3": "2"
  },
  {
    "key1": "TEST3",
    "key2": 3,
    "attribute1": "test number 3",
    "attribute2": {
      "element1": "3",
      "element2": "4"
    },
    "attribute3": "3"
  }
]

The Spark Dataframe schema has the following structure. Note that all types are string and only first-level columns.

root
 |-- attribute1: string (nullable = true)
 |-- attribute2: string (nullable = true)
 |-- attribute3: string (nullable = true)
 |-- key1: string (nullable = true)
 |-- key2: string (nullable = true)

With the following data. Note that:

  • Actual strings in the data are quoted with double quotes.
  • Lists are represented as comma separated values contained within [ and ].
  • Maps are represented as comma separated "key":values contained with { and }.
+---------------+-------------------------------+----------+-------+----+
|attribute1     |attribute2                     |attribute3|key1   |key2|
+---------------+-------------------------------+----------+-------+----+
|"test number 1"|["1","2"]                      |"1"       |"TEST1"|1   |
|"test number 2"|{"element1":"2"}               |"2"       |"TEST2"|1   |
|"test number 3"|{"element1":"3","element2":"4"}|"3"       |"TEST3"|3   |
+---------------+-------------------------------+----------+-------+----+

To create an AppleTable for a Json source:

val fieldNames = Array(
          "attribute1",
          "attribute2",
          "attribute3",
          "key1",
          "key2"
        )

val appleTable = SparkFactory.parallelizeJSONSource("path/to/file", "json_test", fieldNames)

Comparing complex schema to complex schema

Comparing two complex schema sources is straightforward and no special conversions are applied.

Comparing complex schema to simple schema

Comparing a complex schema source to a simple schema source (such as JDBC with Postgres) requires that the simple schema source is converted behind the scenes to a complex type AppleTable. This conversion of the simple schema source is essentially adding the string quotes to all columns, and is transparent to the user.

For example, if we have the following Spark Dataframe with schema and data from a JDBC connection:

root
 |-- attribute1: string (nullable = true)
 |-- attribute2: string (nullable = true)
 |-- attribute3: string (nullable = true)
 |-- key1: string (nullable = true)
 |-- key2: integer (nullable = true)

+-------------+----------+----------+-----+----+
|attribute1   |attribute2|attribute3|key1 |key2|
+-------------+----------+----------+-----+----+
|test number 1|1         |1         |TEST1|1   |
|test number 2|2         |2         |TEST2|1   |
|test number 3|true      |3         |TEST3|3   |
+-------------+----------+----------+-----+----+

MegaSparkDiff will convert the Dataframe to the following upon execution of its AppleTable comparison to appropriately compare to a complex schema source:

root
 |-- attribute1: string (nullable = true)
 |-- attribute2: string (nullable = true)
 |-- attribute3: string (nullable = true)
 |-- key1: string (nullable = true)
 |-- key2: integer (nullable = true)

+---------------+----------+----------+-------+----+
|attribute1     |attribute2|attribute3|key1   |key2|
+---------------+----------+----------+-------+----+
|"test number 1"|"1"       |"1"       |"TEST1"|1   |
|"test number 2"|"2"       |"2"       |"TEST2"|1   |
|"test number 3"|"true"    |"3"       |"TEST3"|3   |
+---------------+----------+----------+-------+----+