What is blk-mq:
<iframe frameborder="0" allowfullscreen="true" src="https://www.kernel.org/doc/html/latest/block/blk-mq.html"></iframe>What is NIC:
Network Interface Controller
What is numa:
Non-uniform memory access
where the memory access time depends on the memory location relative to the processor.
To design a storage stack that can both achieve us-scale tail latency and high throughput.
It is important because none of the existing SOTA storage stack can achieve the two goals simultaneously.
Two SOTA storage stacks: Linux and SPDK.
- Linux: suffer from high tail latencies due to head-of-line blocking. Require careful orchestration of compute, storage and network.
- SPDK: Good for 1 application 1 core. But when multiple applications share a core, storage stacks and kernel CPU scheduler interacts frequently.
To simultaneously enable μs-scale tail latency for L-apps and high throughput for T-apps
In comparison to SPDK, blk-switch improves the average and P99 tail latency by as much as 12× and 18×, respectively, while achieving comparable or higher throughput
Switched-architecture (load balancing):
- Blk-switch introduced a per-core, multi-egress queue block layer architecture. Each egress queue is mapped into a driver queue. The egress queues are assigned priorities to the kernel thread based on application performance goals (Latency-sensitive/Throughout-bound).
- Blk-switch can decouple request processing from application cores in order to efficiently utilize all cores. Requests submitted at a core can be steered from the ingress core to another core's egress queue. After processing, the response would be routed back to the original application core.
Request Steering (handling transient loads):
- Steers T-app request at ingress queues of transiently overloaded cores to other cores' egress queues.
- No implementation on remote storage server side because it's unnecessary.
Application Steering (handling persistent loads):
- Steers threads from persistently overloaded cores to underutilized cores.
- Implemented at both the application side and the remote storage server.
- Linux’s per-core block layer queues + modern multi-queue storage + network hardware -> makes the storage stack conceptually similar to network switches.
- Decouple ingress (application-side block layer) queues from egress (device-side driver) queues.
- Different timescales for different Steering(T-app request scale for Request Steering and coarse-grained timescales for Application Steering)
Goal: μs-scale average and tail latency & high throughput for T-apps
Evaluation Setup
- Evaluated Systems: , Linux, SPDK, blk-switch
- Performance metrics: average and tail latency(P99 & P99.9) for L-apps, total throughput of all applications, and throughput-per-core
- Default workload: Linux-FIO, SPDK-perf, blk-switch-RocksDB
- To push the bottleneck to the storage stack processing: Two 32-core servers connected directly over 100Gbps
Achievements
- Increase
L-apps
competing for host compute and network resources with T-apps:
-
- Linux: high throughput but high latency: HOL blocking
- SPDK: increasing Lapps, inflated latency and degraded throughput: undesirable interactions between the storage stack and the kernel CPU scheduler
- Another reason: increased L3 cache miss rate. Higher cache miss rates lead to an increase in the per-byte CPU overhead for kernel TCP processing (mainly dusre to data copy), resulting in lower throughput-per-core.
- Blk-switch:
-
- Application steering: isolate Lapps and Tapps
- Request steering: utilize unused Lapps cores
- Prioritizing
-
-
Increase L-app threads competing for host compute and network resources with T-apps:
-
Increase the load induced by T-apps:
-
blk-switch maintains its μs-scale average and tail latency for L-apps with varying number of cores, even when scheduling across NUMA nodes
-
blk-switch is able to maintain low average and tail latencies even when applications operate at throughput close to 200Gbps.
-
blk-switch latency is largely overshadowed by SSD/RocksDB access latency
- How to improve the performance of blk-switch on SSD & RocksDB?
- How to accommodate blk-switch to harware-level isolation?
- Is it possible to use blk-switch in Linux?
[optional] What questions do you have?
The concept of head-of-line is not very clear for me.