Replies: 6 comments
-
I only briefly looked at this, but the flamegraph (thanks for that, by the way!) suggests that ~83% of the time is spent in the So, would you be able to repeat the test and cover only the |
Beta Was this translation helpful? Give feedback.
-
Sorry, I should have mentioned I'm interested in the end-to-end process rather than just the parquet writing on its own. Is there anything I can do to speed up the JSON reading portion of this workload if it's 83% of the work? I didn't notice it only produced two columns. If you have any tips for the JSON side of things, I can launch a new VM and see if I can figure out why there aren't three columns as well at the same time. |
Beta Was this translation helpful? Give feedback.
-
Oh, sure: we're interested in the end-to-end workflow too! We cater to a number of different use cases, and at some point we are more concerned with the performance of long-lived data, i.e. data that is transformed / operated upon before serialisation. @jpivarski and @ianna have a better feel for the performance implications of using |
Beta Was this translation helpful? Give feedback.
-
There is an option: ak.from_json can take a Although the exact speedup will depend on the data type, our tests found that Also in that test, running RapidJSON (the C++ library that we use for JSON parsing) by itself with no output was not quite twice as fast as running RapidJSON with array output. As a pipeline, walking over the schema description and filling arrays is about as expensive as parsing the JSON itself, and therefore, there's not much further you can go. (Even with a perfectly streamlined walk over an optimized schema description, writing to contiguous memory will have some cost.) Looking at your data type, it's not very complicated, so I doubt walking through the data type is the bottleneck here. Also, it has JSON nested in JSON: {
"release": 1,
"capture_dates_range": "",
"geom": "{\"type\":\"Polygon\",\"coordinates\":[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]}"
} so your first interpretation makes a big string column for If you're willing to get hacky and optimize this specific case—i.e. write code that won't work for other data types—then I've got some suggestions. I take it from your code that you're interested in the coordinates, and I'm going to assume that the So we can start by stripping out just the coordinates with >>> outer_json[outer_json.index("[") : outer_json.rindex("]") + 1]
'[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]]' Just to be clear, if you have any expectation that the data type can change, this is extremely brittle. But fast! It only produces one of these outer JSON objects, right? If so, then you can immediately parse the inner JSON. If not, you'll have to concatenate many of these strings. String concatenation is probably faster in pure Python (accumulate a list and % fgrep '[' outer_json.json | sed 's/[^[]*//' | sed 's/}"//'
[[[-114.127454,34.265674],[-114.127476,34.265839],[-114.127588,34.265829],[-114.127565,34.265663],[-114.127454,34.265674]]] Once you have a single string or file-like stream (ak.from_json accepts any file-like object with a Here's the schema for an array of lists of lists of numbers: {"type": "array", "items": {"type": "array", "items": {"type": "array", "items": {"type": "number"}}}} So this will read your concatenated (or only instance of) inner JSON: >>> ak.from_json(inner_json, line_delimited=True, schema=that_schema)
<Array [[[-114, 34.3], ..., [-114, 34.3]]] type='1 * var * var * float64'> (The distinction between concatenated/ |
Beta Was this translation helpful? Give feedback.
-
I copy and pasted the Python from another ticket that suggested I looked at your project. I've corrected it to produce a PQ file that matches what ClickHouse produced. In my latest test, ClickHouse produced a PQ file in 18.49 seconds and Awkward did it in 28.07s. Awkward is the fastest converter in the Python space that I've seen so far. I'll look into your schema specification suggestion but in the meantime, here are the commands and stats. $ /usr/bin/time -v \
python3 -c "import awkward as ak; f = open('California.jsonl', 'rb'); arr = ak.from_json(f, line_delimited=True); ak.to_parquet(arr, 'awkward.snappy.pq', compression='snappy', row_group_size=37738); f.close()"
$ sudo su
$ source .pq/bin/activate
$ strace -wc \
python3 -c "import awkward as ak; f = open('cali10.jsonl', 'rb'); arr = ak.from_json(f, line_delimited=True); ak.to_parquet(arr, 'cali10.awkward.snappy.pq', compression='snappy', row_group_size=37738); f.close()"
$ perf stat -dd \
python3 -c "import awkward as ak; f = open('cali10.jsonl', 'rb'); arr = ak.from_json(f, line_delimited=True); ak.to_parquet(arr, 'cali10.awkward.snappy.pq', compression='snappy', row_group_size=37738); f.close()"
If you have any further ideas, please share them. If not, I'm okay if you close this ticket. |
Beta Was this translation helpful? Give feedback.
-
Actually, I think it should remain as a Discussion. I don't have any further ideas, though. |
Beta Was this translation helpful? Give feedback.
-
Version of Awkward Array
2.0.5
Description and code to reproduce
The following was run on Ubuntu 20 on a
e2-highcpu-32
GCP VM with 32 GB of RAM and 32 vCPUs.I downloaded the California dataset from https://github.com/microsoft/USBuildingFootprints and converted it from JSONL into Parquet with pyarrow and I attempted to do the same with fastparquet.
Awkward is able to produce a 947 MB Parquet file in 64.60 seconds.
/usr/bin/time -v \ python3 -c "import awkward as ak; f = open('California.jsonl', 'rb'); arr = ak.from_json(f, line_delimited=True); narr = arr.geom.layout.content.to_numpy(); arr2 = ak.from_json(narr.tobytes(), line_delimited=True); ak.to_parquet(arr2, 'awkward.snappy.pq', compression='snappy', row_group_size=37738); f.close()"
With ClickHouse I'm able to complete the same task in 18.26 seconds. Its resulting file size is 794 MB.
The resulting Awkward Parquet almost matches ClickHouse in terms of row groups and using snappy compression.
Below is a flame graph from Awkward's execution.
I ran a 10-line version of the above file through both PyArrow and ClickHouse. This is what
strace
andperf
reported.$ perf stat -dd \ ython3 -c "import awkward as ak; f = open('cali10.jsonl', 'rb'); arr = ak.from_json(f, line_delimited=True); narr = arr.geom.layout.content.to_numpy(); arr2 = ak.from_json(narr.tobytes(), line_delimited=True); ak.to_parquet(arr2, 'awkward.snappy.pq', compression='snappy', row_group_size=37738); f.close()"
ClickHouse's syscall counts were all much lower:
As were context switch and page fault counts.
These are the versions of software involved:
Beta Was this translation helpful? Give feedback.
All reactions