Performance degenerates a lot when reading from multiple threads compared with a single thread (when running Clickhouse) #1985

sighingnow · 2022-05-06T04:41:17Z

sighingnow
May 6, 2022

What happened:

Hi folks,

We are trying to run ClickHouse benchmark on juicefs (with OSS as the underlying object storage), and under the settings that juicefs has already cached the whole file to the local disk we notice a huge performance gap (compared with running the benchmark on Local SSD) when executing ClickHouse with 4 threads, but such degeneration doesn't happen if we limit the ClickHouse thread to 1.

More specifically, we are running the clickhouse benchmark with scale factor 1000, and playing query 29th query (the involved table Referer sizes around 24Gi, the query is a full table scan operation), and given clickhouse 100Gi local SSD as the cache directory.

After serveral runs to make sure the involved file are fully cached locally by juicefs, we notices the following performance numbers

threads	ssd runtime (seconds)	juicefs runtime (seconds)
4	24	56
1	88	100

You could see that the juicefs suffers much more performance degenerated when the workload executing in a multiple thread fashion. Is that behavour expected for juicefs?

Thanks!

What you expected to happen:

The performance gap shouldn't be such large for 4 thread settings.

How to reproduce it (as minimally and precisely as possible):

Playing the clickhouse benchmark inside a juicefs mounted directory.

Anything else we need to know?

Environment:

JuiceFS version (use juicefs --version) or Hadoop Java SDK version: juicefs version 1.0.0-beta2+2022-03-04T03:00:41Z.9e26080
Cloud provider or hardware configuration running JuiceFS: aliyun ecs.i3g.2xlarge, (local ssd instance with 4 physical cores and 32Gi memory)
OS (e.g cat /etc/os-release): Ubuntu 20.04.3 LTS
Kernel (e.g. uname -a): Linux mk1 5.4.0-100-generic #113-Ubuntu SMP Thu Feb 3 18:43:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Object storage (cloud provider and region, or self maintained): OSS
Metadata engine info (version, cloud provider managed or self maintained): redis
Network connectivity (JuiceFS to metadata engine, JuiceFS to object storage): localhost, redis and juicefs are deployed on the same instance
Others: clickhouse latest version

SandyXSD · 2022-05-06T05:25:27Z

SandyXSD
May 6, 2022
Collaborator

Could you run juicefs stats <mount_point> --verbosity 1 when running the benchmark and post the result here? It's used to make sure:

all data blocks are cached, no Get from OSS
latency to metadata engine is acceptable

Moreover, increasing metadata cache for read-only workload might help, see: https://juicefs.com/docs/cloud/cache#metadata-cache.

0 replies

davies · 2022-05-06T09:39:06Z

davies
May 6, 2022
Maintainer

@sighingnow You may also want to increase the memory buffer used (--buffer-size) for read-ahead .

0 replies

sighingnow · 2022-05-07T03:43:47Z

sighingnow
May 7, 2022
Author

Thanks for the guidance @davies @SandyXSD.

TL,DR: the root cause in our cases is the high CPU usage of juicefs when workload (ClickHouse) reading from 4 threads. Juicefs consumes about 250% CPUs when reading from 4 thread in ClickHouse, cuasing contention of CPU resources with the workload above juicefs. Is such CPU usage is expected behaviour for juicefs?

The juicefs stats results are as follows,

with 1 thread:

24.5%  315M   70M|1654  0.02   205M    0 |  14  0.10 | 204M    0 |   0     0
30.0%  316M   70M|2127  0.02   260M    0 |  18  0.22 | 260M    0 |   0     0
29.2%  317M   74M|1978  0.02   246M    0 |  14  0.11 | 244M    0 |   0     0
29.3%  319M   74M|1966  0.07   241M 1345B|  32  0.11 | 244M 1345B|   0  1345B
29.7%  320M   39M|2042  0.04   250M 3104B|  33  0.10 | 253M 3104B|   0  3104B
28.3%  320M   39M|1950  0.02   243M    0 |   6  0.12 | 244M    0 |   0     0
25.7%  321M   39M|1752  0.02   216M    0 |  20  0.10 | 216M    0 |   0     0

with 2 threads:

39.0%  363M   32M|2618  0.03   320M    0 |  20  0.14 | 324M    0 |   0     0
50.1%  363M   32M|3352  0.03   412M    0 |   8  0.13 | 412M    0 |   0     0
23.0%  363M   29M|1507  0.05   182M    0 |  27  0.62 | 183M    0 |   0     0
30.3%  363M   29M|2064  0.02   257M    0 |   8  0.11 | 256M    0 |   0     0
36.9%  361M   33M|2532  0.02   315M    0 |  12  0.10 | 316M    0 |   0     0
58.1%  361M   57M|3794  0.03   453M    0 |  14  0.13 | 483M    0 |   0     0
69.9%  361M   65M|4755  0.03   565M    0 |  11  0.13 | 564M    0 |   0     0
61.9%  361M   61M|4230  0.03   503M    0 |   9  0.13 | 508M    0 |   0     0
48.3%  361M   65M|3293  0.03   390M    0 |  12  0.14 | 388M    0 |   0     0
51.9%  361M   57M|3460  0.03   404M    0 |  20  0.20 | 408M    0 |   0     0
48.3%  361M   57M|3317  0.03   392M    0 |  13  0.10 | 392M    0 |   0     0
63.7%  361M   57M|4688  0.03   567M    0 |   7  0.14 | 568M    0 |   0     0

with 4 threads:

 252%  290M   17M|3520  0.86   437M    0 |  17  0.18 |4041M    0 |   0     0
 233%  290M   10M|3619  0.70   451M    0 |   8  0.26 |3717M    0 |   0     0
 249%  288M   24M|4036  0.60   501M    0 |  13  0.19 |4098M    0 |   0     0
 256%  288M 9344K|4180  0.64   519M    0 |   8  0.46 |4129M    0 |   0     0
 164%  288M   32M|3686  0.35   456M    0 |  13  0.13 |2478M    0 |   0     0
 124%  290M   36M|3278  0.25   379M    0 |  22  0.48 |1875M    0 |   0     0
 194%  301M   55M|3080  0.66   379M    0 |  31  0.54 |3086M    0 |   0     0
 197%  303M   57M|2888  0.77   359M    0 |  12  0.15 |3077M    0 |   0     0
 212%  308M   63M|3478  0.60   433M    0 |  13  0.27 |3331M    0 |   0     0
 205%  312M   55M|3343  0.62   417M    0 |   7  0.52 |3215M    0 |   0     0
 259%  316M   56M|3521  0.89   438M    0 |  14  0.22 |4238M    0 |   0     0
 268%  322M   53M|3660  0.86   453M    0 |  19  0.19 |4350M    0 |   0     0
 223%  327M   58M|3260  0.79   405M    0 |  12  0.23 |3491M    0 |   0     0
 247%  332M   59M|3531  0.78   440M    0 |   8  0.14 |4053M    0 |   0     0
 256%  339M   62M|3472  0.86   432M    0 |  16  0.22 |4128M    0 |   0     0
 253%  343M   59M|3340  0.98   416M    0 |   6  0.64 |3979M    0 |   0     0

Our machine has 4 physical cores and 2 threads per core

root@mk1:/mnt/scripts# lscpu
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
Stepping:                        7
CPU MHz:                         2499.998
BogoMIPS:                        4999.99
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        4 MiB
L3 cache:                        35.8 MiB
NUMA node0 CPU(s):               0-7

When running the workload using 4 threads, juicefs and the workload itself (ClickHouse) will contend for CPU resources and yields bad latency numbers, thus poor performances. It can be verified that running ClickHouse with 2 threads yields a similar performance number with 4 threads, and about half of using 1 thread.

threads	running time (seconds)	juicefs average read latency(us, from `juicefs profile`)
1	98.518	25
2	54.316157.842	34
4	50.548	799

@sighingnow You may also want to increase the memory buffer used (--buffer-size) for read-ahead .

It doesn't work. --buffer-size=300M and --buffer-size=1200M with 4 thread yields the same performance number.

0 replies

SandyXSD · 2022-05-07T05:41:35Z

SandyXSD
May 7, 2022
Collaborator

It looks like the bandwidth of block cache read is much greater than that of fuse read when running with 4 threads, which might be a reason causing higher CPU usage. We'll look into that to see if this behavior could be improved.

0 replies

sighingnow · 2022-05-07T06:23:41Z

sighingnow
May 7, 2022
Author

It looks like the bandwidth of block cache read is much greater than that of fuse read when running with 4 threads, which might be a reason causing higher CPU usage. We'll look into that to see if this behavior could be improved.

Thank you! I have also noticed similar high CPU usage occurs in the benchmarking documentation page under a -p 4 settings: https://juicefs.com/docs/community/performance_evaluation_guide/.

Looking forward to your insights!

0 replies

davies · 2022-05-08T08:43:13Z

davies
May 8, 2022
Maintainer

@sighingnow Can you tell us how to reproduce this issue?

0 replies

sighingnow · 2022-05-09T02:19:34Z

sighingnow
May 9, 2022
Author

@sighingnow Can you tell us how to reproduce this issue?

setup juicefs

sudo juicefs format --force --storage oss \
    --bucket https://..... <an oss endpoint> \
    --block-size 4096\
    --access-key $OSS_KEY\
    --secret-key $OSS_KEY_SECRET \
    <a localhost redis endpoint> fusejfsc-4096

(we are using oss as the underlying storage, but I think it doesn't matter as we ensure all data are cached in local disk)

juicefs mount <a localhost redis endpoint> /mnt/jfs-100g \
      --cache-dir /clickhouse/jfsc_100g_4096 \
      --cache-size 10000000000 \
      -d

(the cache directory /clickhouse/jfsc_100g_4096 is on a local SSD, on aliyun ecs.i3g.2xlarge)

setup clickhouse

prepare the code,

cd /clickhouse/jfsc_100g_4096
git clone https://github.com/ClickHouse/ClickHouse.git --depth=1

prepare the benchmark suite (to save your time, you could first mofidying the SQL file benchmark/clickhouse/queries.sql to keep the 29th line only SELECT domainWithoutWWW(Referer) AS key, avg(length(Referer)) AS l, count() AS c, any(Referer) FROM {table} WHERE Referer != '' GROUP BY key HAVING c > 100000 ORDER BY l DESC LIMIT 25; before the following command)

./hardware.sh 1000

monitor the behaviour of juicefs

The first run of ./hardware.sh 1000 will setup all necessary environment that the benchmark required, by default the clickhouse will usage n-phyical cores threads to execute the query, then you can monitor the CPU usage of juicefs by juicefs stats or monitor the latency of operations by juicefs profile and run ./hardware.sh 1000 again to repeatly run the query 29.

0 replies

sighingnow · 2022-05-09T02:21:46Z

sighingnow
May 9, 2022
Author

We also noticed similar performance degeneration other queries (e.g., query 21, 23, 28, 29) as well, but just research on query 29 should be enough to notice the high CPU usage of juicefs.

0 replies

sighingnow · 2022-05-09T06:19:56Z

sighingnow
May 9, 2022
Author

@sighingnow Can you tell us how to reproduce this issue?

Also, I think some micro benchmark could reproduce the high CPU usage issue as well :)

0 replies

sanwan · 2022-05-09T08:52:23Z

sanwan
May 9, 2022

I will try above content.

0 replies

SandyXSD · 2022-05-09T14:31:44Z

SandyXSD
May 9, 2022
Collaborator

The reason is that JuiceFS limits the number of random read threads for each opened file descriptor(to save memory). It looks like that current limit(2) is not enough for scenarios like this Clickhouse query, which concurrently reads a huge file at different offsets in the same process.
For a quick fix, you may try increasing the value of readSessions to a number bigger than Clickhouse threads. For example, readSessions = 6 should be enough for your test with 4 threads.

Btw, the high CPU usage in JuiceFS benchmark doc is caused by getting objects from storage, not related to this issue.

0 replies

sighingnow · 2022-05-09T15:04:09Z

sighingnow
May 9, 2022
Author

For a quick fix, you may try increasing the value of readSessions to a number bigger than Clickhouse threads. For example, readSessions = 6 should be enough for your test with 4 threads.

@SandyXSD Thanks for the quick investigation! It does work, by improving the 4 thread performance on juicefs from 56 seconds to 35 seconds, but there's still a huge gap between SSD (around 24 seconds), and I still noticed about 100+% CPU usage with juicefs stats ./. Increasing the readSessions to larger value to 16 also doesn't improve the case.

 108%  719M  195M|6864  0.06   781M    0 |   6  0.13 | 784M    0 |   0     0
 120%  719M  191M|7463  0.06   848M    0 |  14  0.16 | 848M    0 |   0     0
 114%  719M  195M|7212  0.05   822M    0 |   6  0.11 | 824M    0 |   0     0
 117%  719M  199M|7329  0.06   826M    0 |  14  0.14 | 836M    0 |   0     0
 105%  719M  195M|6527  0.06   732M    0 |   6  0.14 | 736M    0 |   0     0
 107%  719M  208M|6668  0.05   748M    0 |  14  0.25 | 741M    0 |   0     0

Do you folks have any further insights about the problem?

Thanks!

0 replies

sighingnow · 2022-05-09T15:05:04Z

sighingnow
May 9, 2022
Author

Btw, the high CPU usage in JuiceFS benchmark doc is caused by getting objects from storage, not related to this issue.

Thanks for clarification.

0 replies

SandyXSD · 2022-05-10T05:04:59Z

SandyXSD
May 10, 2022
Collaborator

Yes, increasing readSessions won't help as long as it's already big enough.

The current result is expected. Because JuiceFS is a network file system built based on FUSE, it consumes more CPU (for splitting buffer, copy data between kernel & userspace, etc.) than local kernel file systems, and usually brings a bit higher latency. We have the plan to improve performance after v1.0-GA is released, but for now, it's not the main focus.

0 replies

sighingnow · 2022-05-10T06:08:32Z

sighingnow
May 10, 2022
Author

We have the plan to improve performance after v1.0-GA is released, but for now, it's not the main focus.

Copy that. Thanks for the information.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degenerates a lot when reading from multiple threads compared with a single thread (when running Clickhouse) #1985

{{title}}

Replies: 15 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Performance degenerates a lot when reading from multiple threads compared with a single thread (when running Clickhouse) #1985

sighingnow May 6, 2022

Replies: 15 comments

SandyXSD May 6, 2022 Collaborator

davies May 6, 2022 Maintainer

sighingnow May 7, 2022 Author

SandyXSD May 7, 2022 Collaborator

sighingnow May 7, 2022 Author

davies May 8, 2022 Maintainer

sighingnow May 9, 2022 Author

sighingnow May 9, 2022 Author

sighingnow May 9, 2022 Author

sanwan May 9, 2022

SandyXSD May 9, 2022 Collaborator

sighingnow May 9, 2022 Author

sighingnow May 9, 2022 Author

SandyXSD May 10, 2022 Collaborator

sighingnow May 10, 2022 Author

sighingnow
May 6, 2022

SandyXSD
May 6, 2022
Collaborator

davies
May 6, 2022
Maintainer

sighingnow
May 7, 2022
Author

SandyXSD
May 7, 2022
Collaborator

sighingnow
May 7, 2022
Author

davies
May 8, 2022
Maintainer

sighingnow
May 9, 2022
Author

sighingnow
May 9, 2022
Author

sighingnow
May 9, 2022
Author

sanwan
May 9, 2022

SandyXSD
May 9, 2022
Collaborator

sighingnow
May 9, 2022
Author

sighingnow
May 9, 2022
Author

SandyXSD
May 10, 2022
Collaborator

sighingnow
May 10, 2022
Author