Speed up JSON reading? (Formerly: Parquet writing) #2094

marklit · 2023-01-09T07:50:47Z

marklit
Jan 9, 2023

Version of Awkward Array

2.0.5

Description and code to reproduce

The following was run on Ubuntu 20 on a e2-highcpu-32 GCP VM with 32 GB of RAM and 32 vCPUs.

I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet with pyarrow and I attempted to do the same with fastparquet.

$ ogr2ogr -f GeoJSONSeq /vsistdout/ California.geojson \
    | jq -c '.properties * {geom: .geometry|tostring}' \
    > California.jsonl
$ head -n1 California.jsonl | jq .

{
  "release": 1,
  "capture_dates_range": "",
  "geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
}

Awkward is able to produce a 947 MB Parquet file in 64.60 seconds.

/usr/bin/time -v \
    python3 -c "import awkward as ak; f = open('California.jsonl', 'rb'); arr = ak.from_json(f, line_delimited=True); narr = arr.geom.layout.content.to_numpy(); arr2 = ak.from_json(narr.tobytes(), line_delimited=True); ak.to_parquet(arr2, 'awkward.snappy.pq', compression='snappy', row_group_size=37738); f.close()"

With ClickHouse I'm able to complete the same task in 18.26 seconds. Its resulting file size is 794 MB.

$ /usr/bin/time -v \
    clickhouse local \
          --input-format JSONEachRow \
          -q "SELECT *
              FROM table
              FORMAT Parquet" \
    < California.jsonl \
    > ch.snappy.pq

The resulting Awkward Parquet almost matches ClickHouse in terms of row groups and using snappy compression.

<pyarrow._parquet.FileMetaData object at 0x7fb89c696d10>
  created_by: parquet-cpp-arrow version 10.0.1
  num_columns: 2
  num_rows: 11542912
  num_row_groups: 306
  format_version: 2.6
  serialized_size: 73744

<pyarrow._parquet.FileMetaData object at 0x7f0926d54860>
  created_by: parquet-cpp version 1.5.1-SNAPSHOT
  num_columns: 3
  num_rows: 11542912
  num_row_groups: 306
  format_version: 1.0
  serialized_size: 228389

Below is a flame graph from Awkward's execution.

I ran a 10-line version of the above file through both PyArrow and ClickHouse. This is what strace and perf reported.

$ sudo su
$ source .pq/bin/activate
$ strace -wc \
    python3 -c "import awkward as ak; f = open('cali10.jsonl', 'rb'); arr = ak.from_json(f, line_delimited=True); narr = arr.geom.layout.content.to_numpy(); arr2 = ak.from_json(narr.tobytes(), line_delimited=True); ak.to_parquet(arr2, 'awkward.snappy.pq', compression='snappy', row_group_size=37738); f.close()"

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 61.01    0.158111         261       604        50 openat
 10.15    0.026304          14      1760       146 stat
  6.44    0.016697          17       937           read
  4.57    0.011833          11       992           fstat
  3.98    0.010303          11       876         5 lseek
  2.84    0.007351          13       563           close
  2.61    0.006760          16       411           mmap
  2.17    0.005612          12       455       438 ioctl
  1.26    0.003253          27       117           munmap
  1.15    0.002983          90        33           clone
  0.81    0.002101          19       110           mprotect
  0.80    0.002083          22        92           getdents64
  0.60    0.001552          19        80           futex
  0.56    0.001448          14       102           getcwd
  0.35    0.000901          16        56           brk
  0.26    0.000671           9        68           rt_sigaction
  0.09    0.000224         224         1           execve
  0.08    0.000213          11        18           pread64
  0.06    0.000159          15        10           write
  0.03    0.000069          13         5         2 readlink
  0.03    0.000067          11         6           getpid
  0.02    0.000051          12         4           getrandom
  0.02    0.000044          14         3           uname
  0.02    0.000042          21         2           open
  0.01    0.000037          18         2           pipe2
  0.01    0.000034          16         2           madvise
  0.01    0.000032          10         3           sigaltstack
  0.01    0.000031          10         3           rt_sigprocmask
  0.01    0.000030          10         3           dup
  0.01    0.000028          28         1           wait4
  0.01    0.000023          11         2           sched_getaffinity
  0.01    0.000022          11         2         1 arch_prctl
  0.01    0.000014          14         1           sysinfo
  0.01    0.000014          13         1         1 access
  0.00    0.000011          11         1           fcntl
  0.00    0.000011          11         1           prlimit64
  0.00    0.000011          11         1           gettid
  0.00    0.000009           9         1           set_tid_address
  0.00    0.000009           8         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.259150                  7330       643 total

$ perf stat -dd \
    ython3 -c "import awkward as ak; f = open('cali10.jsonl', 'rb'); arr = ak.from_json(f, line_delimited=True); narr = arr.geom.layout.content.to_numpy(); arr2 = ak.from_json(narr.tobytes(), line_delimited=True); ak.to_parquet(arr2, 'awkward.snappy.pq', compression='snappy', row_group_size=37738); f.close()"

  4,150.43 msec task-clock                #   11.326 CPUs utilized
       105      context-switches          #   25.299 /sec
         2      cpu-migrations            #    0.482 /sec
    12,034      page-faults               #    2.899 K/sec

ClickHouse's syscall counts were all much lower:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 29.52    0.019018        1584        12           futex
 21.15    0.013625          63       214           gettid
 11.19    0.007209         514        14           mprotect
 11.06    0.007123         791         9         4 stat
  8.72    0.005617         108        52           close
  5.16    0.003327        1109         3           poll
  2.19    0.001412          23        60           mmap
  2.09    0.001344          39        34         1 openat
  1.27    0.000816          18        44           read
...
  0.15    0.000098          48         2           write

As were context switch and page fault counts.

  44      context-switches          #  372.955 /sec
4997      page-faults               #   42.356 K/sec

These are the versions of software involved:

awkward-2.0.5-py3-none-any.whl (541 kB)
awkward_cpp-6-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
ClickHouse 22.13.1.1361 (official build)

agoose77 · 2023-01-09T09:46:30Z

agoose77
Jan 9, 2023
Collaborator

I only briefly looked at this, but the flamegraph (thanks for that, by the way!) suggests that ~83% of the time is spent in the ak::ArrayBuilder portion of our codebase, which is used to interpreting the JSON. If you were to repeat this test, but only profile the to_parquet portion, I'd anticipate the wall-time to be closer to ~11 seconds. I also note that the two datasets seem to have different numbers of columns, though I've not looked closely to see whether we write some additional metadata here.

So, would you be able to repeat the test and cover only the to_parquet call? :) If you can't easily test particular Python statements, loading from pickle (by pickling the from_json result in a different execution) should be fairly fast.

0 replies

marklit · 2023-01-09T10:36:21Z

marklit
Jan 9, 2023
Author

Sorry, I should have mentioned I'm interested in the end-to-end process rather than just the parquet writing on its own. Is there anything I can do to speed up the JSON reading portion of this workload if it's 83% of the work?

I didn't notice it only produced two columns. If you have any tips for the JSON side of things, I can launch a new VM and see if I can figure out why there aren't three columns as well at the same time.

0 replies

agoose77 · 2023-01-09T10:56:28Z

agoose77
Jan 9, 2023
Collaborator

Sorry, I should have mentioned I'm interested in the end-to-end process rather than just the parquet writing on its own.

Oh, sure: we're interested in the end-to-end workflow too! We cater to a number of different use cases, and at some point we are more concerned with the performance of long-lived data, i.e. data that is transformed / operated upon before serialisation. @jpivarski and @ianna have a better feel for the performance implications of using ArrayBuilder, and where along that axis we're comfortable jumping off the optimisation train, so I've pinged them for their inputs :)

0 replies

jpivarski · 2023-01-09T15:33:34Z

jpivarski
Jan 9, 2023
Maintainer

If you have any tips for the JSON side of things

There is an option: ak.from_json can take a schema argument. With a JSON schema, the function can avoid the generic ArrayBuilder, which has to remain open to the possibility that later parts of a dataset have a more general data type than earlier parts.

Although the exact speedup will depend on the data type, our tests found that from_json with a schema was about four times faster than without: #1165 (comment).

Also in that test, running RapidJSON (the C++ library that we use for JSON parsing) by itself with no output was not quite twice as fast as running RapidJSON with array output. As a pipeline, walking over the schema description and filling arrays is about as expensive as parsing the JSON itself, and therefore, there's not much further you can go. (Even with a perfectly streamlined walk over an optimized schema description, writing to contiguous memory will have some cost.)

Looking at your data type, it's not very complicated, so I doubt walking through the data type is the bottleneck here. Also, it has JSON nested in JSON:

{
  "release": 1,
  "capture_dates_range": "",
  "geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
}

so your first interpretation makes a big string column for "geom" and your second interpretation interprets that string.

If you're willing to get hacky and optimize this specific case—i.e. write code that won't work for other data types—then I've got some suggestions.

I take it from your code that you're interested in the coordinates, and I'm going to assume that the "type" field is always "Polygon" (so the "coordinates" will always be depth-3). Those coordinates always start with "[" and nothing else can start with "[" (everything that comes before it is a JSON object, bare integer, or string). Since the coordinates are all numeric—no strings—none of its characters are string-quoted (with a backslash).

So we can start by stripping out just the coordinates with

>>> outer_json[outer_json.index("[") : outer_json.rindex("]") + 1]
'[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]'

Just to be clear, if you have any expectation that the data type can change, this is extremely brittle. But fast!

It only produces one of these outer JSON objects, right? If so, then you can immediately parse the inner JSON. If not, you'll have to concatenate many of these strings. String concatenation is probably faster in pure Python (accumulate a list and "".join(...)) than anything that Awkward can do, and UNIX command-line tools (part of your workflow, I see) are likely even faster:

% fgrep '[' outer_json.json | sed 's/[^[]*//' | sed 's/}"//'
[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]

Once you have a single string or file-like stream (ak.from_json accepts any file-like object with a read method), you can use from_json with line_delimited=True (doesn't actually require delimiters, just a sequence of whole JSON objects with nothing but whitespace in between) and a schema.

Here's the schema for an array of lists of lists of numbers:

{"type": "array", "items": {"type": "array", "items": {"type": "array", "items": {"type": "number"}}}}

So this will read your concatenated (or only instance of) inner JSON:

>>> ak.from_json(inner_json, line_delimited=True, schema=that_schema)
<Array [[[-114, 34.3], ..., [-114, 34.3]]] type='1 * var * var * float64'>

(The distinction between concatenated/line_delimited=True and a single instance could result in a different number of "var" in the type: be sure you're getting the depth that you want. Slicing with [0] removes an unwanted, length-1 dimension and slicing with np.newaxis adds one.)

0 replies

marklit · 2023-01-09T18:11:07Z

marklit
Jan 9, 2023
Author

I copy and pasted the Python from another ticket that suggested I looked at your project. I've corrected it to produce a PQ file that matches what ClickHouse produced. In my latest test, ClickHouse produced a PQ file in 18.49 seconds and Awkward did it in 28.07s. Awkward is the fastest converter in the Python space that I've seen so far.

I'll look into your schema specification suggestion but in the meantime, here are the commands and stats.

$ /usr/bin/time -v \
    python3 -c "import awkward as ak; f = open('California.jsonl', 'rb'); arr = ak.from_json(f, line_delimited=True); ak.to_parquet(arr, 'awkward.snappy.pq', compression='snappy', row_group_size=37738); f.close()"

<pyarrow._parquet.FileMetaData object at 0x7f1bc06793b0>
  created_by: parquet-cpp-arrow version 10.0.1
  num_columns: 3
  num_rows: 11542912
  num_row_groups: 306
  format_version: 2.6
  serialized_size: 213334

$ sudo su
$ source .pq/bin/activate
$ strace -wc \
    python3 -c "import awkward as ak; f = open('cali10.jsonl', 'rb'); arr = ak.from_json(f, line_delimited=True); ak.to_parquet(arr, 'cali10.awkward.snappy.pq', compression='snappy', row_group_size=37738); f.close()"

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 23.10    0.026720          15      1760       146 stat
 15.36    0.017765          18       937           read
 10.70    0.012379          12       992           fstat
  9.37    0.010839          17       604        50 openat
  9.31    0.010770          12       877         5 lseek
  6.57    0.007599          13       563           close
  6.45    0.007463          16       447           mmap
  5.00    0.005785          12       455       438 ioctl
  3.21    0.003718          24       153           munmap
  2.05    0.002369          71        33           clone
  1.93    0.002235          20       110           mprotect
  1.88    0.002170          23        92           getdents64
  1.26    0.001463          14       102           getcwd
  1.11    0.001286          18        68           futex
  0.88    0.001021          18        56           brk
  0.69    0.000799          11        68           rt_sigaction
  0.22    0.000251         251         1           execve
  0.19    0.000223          12        18           pread64
  0.16    0.000189          14        13           write
  0.06    0.000075          14         5         2 readlink
  0.06    0.000070          11         6           getpid
  0.05    0.000054          13         4           getrandom
  0.04    0.000051          17         3           uname
  0.04    0.000044          22         2           open
  0.03    0.000038          19         2           pipe2
  0.03    0.000037          12         3           sigaltstack
  0.03    0.000036          11         3           dup
  0.03    0.000033          16         2           madvise
  0.03    0.000032          10         3           rt_sigprocmask
  0.02    0.000026          12         2           sched_getaffinity
  0.02    0.000026          25         1           wait4
  0.02    0.000019           9         2         1 arch_prctl
  0.01    0.000017          16         1           sysinfo
  0.01    0.000017          16         1         1 access
  0.01    0.000013          13         1           gettid
  0.01    0.000012          11         1           set_tid_address
  0.01    0.000011          11         1           fcntl
  0.01    0.000011          11         1           prlimit64
  0.01    0.000011          10         1           set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00    0.115675                  7394       643 total

$ perf stat -dd \
    python3 -c "import awkward as ak; f = open('cali10.jsonl', 'rb'); arr = ak.from_json(f, line_delimited=True); ak.to_parquet(arr, 'cali10.awkward.snappy.pq', compression='snappy', row_group_size=37738); f.close()"

  4,146.79 msec task-clock                #   11.560 CPUs utilized
        97      context-switches          #   23.392 /sec
         1      cpu-migrations            #    0.241 /sec
    12,030      page-faults               #    2.901 K/sec

If you have any further ideas, please share them. If not, I'm okay if you close this ticket.

0 replies

jpivarski · 2023-01-09T18:26:47Z

jpivarski
Jan 9, 2023
Maintainer

Actually, I think it should remain as a Discussion. I don't have any further ideas, though.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up JSON reading? (Formerly: Parquet writing) #2094

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Speed up JSON reading? (Formerly: Parquet writing) #2094

marklit Jan 9, 2023

Version of Awkward Array

Description and code to reproduce

Replies: 6 comments

agoose77 Jan 9, 2023 Collaborator

marklit Jan 9, 2023 Author

agoose77 Jan 9, 2023 Collaborator

jpivarski Jan 9, 2023 Maintainer

marklit Jan 9, 2023 Author

jpivarski Jan 9, 2023 Maintainer

marklit
Jan 9, 2023

agoose77
Jan 9, 2023
Collaborator

marklit
Jan 9, 2023
Author

agoose77
Jan 9, 2023
Collaborator

jpivarski
Jan 9, 2023
Maintainer

marklit
Jan 9, 2023
Author

jpivarski
Jan 9, 2023
Maintainer