Computation of the easy special leaves

In the combinatorial prime counting algorithms the computation of the special leaves is the computationally most expensive task. In order to speed up that computation Lagarias-Miller-Odlyzko [1] have split up the special leaves into easy special leaves and hard special leaves. The contribution of each easy special leaf can be computed in $O(1)$ using a PrimePi[n] lookup table, whereas the contribution of each hard special leaf requires evaluating the partial sieve function $\phi(x, a)$ and cannot be computed in $O(1)$.

In the Deleglise-Rivat [2] and Gourdon [3] prime counting algorithms (which are based on the Lagarias-Miller-Odlyzko algorithm) the computation of the easy special leaves requires looking up the number of primes ≤ $n$ with $n < \sqrt{x}$. Since a PrimePi[n] lookup table of size $\sqrt{x}$ is much too large to be practical, Deleglise-Rivat [2] have suggested segmenting the Interval $[0,\sqrt{x}$[ using a segment size of $y\ (\sim \sqrt[3]{x}\ \log^{3}{x})$. Hence instead of using a PrimePi[n] lookup table of size $\sqrt{x}$, we now use a SegmentedPrimePi[n] lookup table of size $y$ which also returns the number of primes ≤ $n$ but requires $n$ to be within the current segment $[low, low + y[$. This approach was used in primecount up to version 6.4. However this segment size causes severe scaling issues for large computations > $10^{22}$ as the SegmentedPrimePi[n] lookup table becomes exceedingly large, e.g. at $10^{30}$ its size was 137 GiB in primecount. For this reason Xavier Gourdon [3] suggested using a smaller segment size of $\sqrt{x/y\ }$ which is orders of magnitude smaller and generally a good practical improvement.

Here are links to primecount's PiTable and SegmentedPiTable implementations.

Improving the cache efficiency

The SegmentedPrimePi[n] lookup table is accessed very frequently in the computation of the easy special leaves (about once for each easy special leaf) and these memory accesses are non sequential. It is therefore important that the SegmentedPrimePi[n] fits into the CPU's fast cache memory. While Xavier Gourdon's smaller segment size is already considerably smaller it is still too large for new record computations. For this reason I suggest using an even smaller segment size of $\sqrt[4]{x}$ for the computation of the easy special leaves. With a segment size of $\sqrt[4]{x}$ the SegmentedPrimePi[n] lookup table fits into the CPU's cache even for record computations, e.g. at $10^{30}$ the SegmentedPrimePi[n] is only about 2 MiB in primecount. A segment size of $\sqrt[4]{x}$ does not deteriorate the runtime complexity of the algorithm because the segmented sieve of Eratosthenes which is used to initialize the SegmentedPrimePi[n] lookup table has the same runtime complexity as the sieve of Eratosthenes as long as the segment size is not smaller than the square root of the total sieving distance (which is $\sqrt{x}$).

Note that Deleglise-Rivat [2] have split up the easy special leaves into many formulas and suggest using segmentation only for the two formulas that need to lookup the number of primes < $\sqrt{x}$, whereas all other formulas that only need to lookup the number of primes ≤ $y$ should be computed without segmentation. As a PrimePi[n] lookup table of size $y$ is much too large to fit into the CPU's cache and as the PrimePi[n] lookup table is accessed in random order, I suggest segmenting all easy special leaves formulas that are computationally expensive using a segment size of $\sqrt[4]{x}$ in order to improve performance. To reduce the amount of work for the programmer it is best to sieve the interval $[0, \sqrt{x}[$ only once and compute all easy special leaf formulas within that sieve.

Extra care needs to be used when segmenting the formulas that compute consecutive identical easy leaves more efficiently, sometimes these leaves are named clustered easy leaves [4]. In the Deleglise-Rivat algorithm the $W_3$ and $W_5$ formulas compute clustered easy leaves. These formulas need to access PrimePi[n] values with $n$ ≤ $y$, but some of these memory accesses (i.e. those that compute how many consecutive leaves are identical) may be outside of the segment $[low, low + segment\ size[$. For these memory accesses I suggest using a PrimePi[n] lookup table of size $y$ instead of the SegmentedPrimePi[n] lookup table. Note that it is important for performance to segment the clustered easy leaves as there is a proportionally large number of these leaves and their computation is expensive.

Parallel computation and load-balancing

So far we have focused on improving the cache efficiency of the computation of the easy special leaves. Now we will have a look at how to parallelize the computation of the easy special leaves so that the algorithm scales well. Generally parallel algorithms scale well on current CPU architectures if they accomplish the following three properties:

Each thread only operates on its own tiny chunk of memory that fits into the cache of the corresponding CPU core.
All threads must be independent of each other (i.e. require no synchronization).
The work must be distributed evenly among all threads in order to avoid load imbalance.

A segment size of $\sqrt[4]{x}$ already accomplishes the first property. So next we have to design our parallel algorithm in a way that all threads are independent from each other. We can reuse an idea from Xavier Gourdon [3] for this purpose: each thread is assigned a different segment, which it processes exclusively. At the start of each new segment $[low, low + segment\ size[$ each thread computes $\pi(low)$ using a prime counting function implementation in $O(low^{\frac{2}{3}})$ or less. The result of $\pi(low)$ is required to initialize the SegmentedPrimePi[n] lookup table for the current segment $[low, low + segment\ size[$. With the additional $\pi(low)$ initialization step we have now managed to make the threads independent of each other. This algorithm has been implemented in primecount-7.0 (see SegmentedPiTable.cpp, AC.cpp), it improved performance by more than 2x at $10^{23}$ on my dual-socket AMD EPYC server compared to primecount-6.4 which used a larger segment size and required frequent synchronization of threads. It is important to ensure that the additional pre-computations do not deteriorate the runtime complexity of the algorithm. When sieving up to $\sqrt{x}$ using a segment size of $\sqrt[4]{x}$ there will by exactly $\sqrt[4]{x}$ segments. For each segment we need to compute $\pi(low)$ with $low$ < $\sqrt{x}$. Hence in total the additional pre-computations have a runtime complexity of less than $O(\sqrt{x}^{\frac{2}{3}}\sqrt[4]{x}) = O(x^{\frac{7}{12}})$ which does not deteriorate the overall runtime complexity of the algorithm.

Lastly we have to ensure that the work is distributed evenly among all threads. The easy special leaves are distributed very unevenly, most of the leaves are located below $y\ (\sim \sqrt[3]{x}\ \log^{3}{x})$ whereas above $y$ the number of leaves slowly decreases and they become more and more sparse as they approach $\sqrt{x}$. Hence it is critical that the region below $y$ is distributed evenly among all threads. Based on my benchmarks a small segment size of $\sqrt[4]{x}$ evenly distributes the work even on servers with a large number of CPU cores such as my dual-socket AMD EPYC server with 196 threads. Using a segment size larger than $\sqrt[4]{x}$ such as $\sqrt[3]{x}$ or $y$ causes significant load imbalance (i.e. some threads will be assigned much more work than others and keep on computing after most of the threads have already finished their computations) which severely deteriorates performance, especially on PCs and servers with a large number of CPU cores. Above $y$ there are much fewer easy special leaves, hence the segment size can be increased by a small constant factor (16 in primecount) in order to reduce the pre-computation overhead, provided that the new segment size still fits into the CPU's cache.

References

J. C. Lagarias, V. S. Miller, and A. M. Odlyzko, Computing pi(x): The Meissel-Lehmer method, Mathematics of Computation, 44 (1985), pp. 537–560.
M. Deleglise and J. Rivat, "Computing pi(x): The Meissel, Lehmer, Lagarias, Miller, Odlyzko Method", Mathematics of Computation, Volume 65, Number 213, 1996, pp 235–245.
Xavier Gourdon, Computation of pi(x) : improvements to the Meissel, Lehmer, Lagarias, Miller, Odllyzko, Deléglise and Rivat method, February 15, 2001.
Tomás Oliveira e Silva, Computing pi(x): the combinatorial method, Revista do DETUA, vol. 4, no. 6, March 2006, pp. 759-768.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Easy-Special-Leaves.md

Easy-Special-Leaves.md

Computation of the easy special leaves

Improving the cache efficiency

Parallel computation and load-balancing

References

Files

Easy-Special-Leaves.md

Latest commit

History

Easy-Special-Leaves.md

File metadata and controls

Computation of the easy special leaves

Improving the cache efficiency

Parallel computation and load-balancing

References