Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmented prime sieve for fun and profit #87

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

haampie
Copy link

@haampie haampie commented Sep 7, 2020

Closes #45.

I have this code lying around for a bit now, but never turned it into a PR. Now I'm turning it into a PR to force myself to finish it 😅. It could use some tests and maybe some code style improvements / cleanup.

What it brings is an iterator:

using Primes.SegmentedSieve: SegmentIterator, vec_count_ones

function count_primes(a, b)
    cnt = 0
    for segment in SegmentIterator(a:b, 32 * 1024)
        cnt += vec_count_ones(segment.segment)
    end
    return cnt
end

which makes some operations very fast, as they operate on O(1) memory instead of O(n). For reference:

julia> @time count(Primes.primesmask(2^31, 2^32))
  7.362490 seconds (3.86 k allocations: 802.346 MiB, 1.18% gc time)
98182656

julia> @time count_primes(2^31, 2^32)
  0.452503 seconds (2.19 k allocations: 375.266 KiB)
98182656

this is 16x faster.

There's some neat optimizations: you can pick a buffer size to make things fit in L1 cache (above it's 32kb); the inner loop is unrolled by hand such that 8 multiples are removed per iteration; memory is compressed s.t. 1 byte equals an interval of 30 numbers (through a {2, 3, 5}-wheel); multiples of {7, 11, 13, 17} are "pre-sieved" in the sense that a buffer of size 7 * 11 * 13 * 17 is sieved and repeatedly copied over, instead of actually sieving 7, 11, 13 and 17 each over the full range.

You can also print the segment if you like:

julia> for segment in SegmentIterator(1:180, 2)
           println(segment)
       end
    01  07  11  13  17  19  23  29  
00   .   x   x   x   x   x   x   x  
30   x   x   x   x   x   .   x   x  

    01  07  11  13  17  19  23  29  
60   x   x   x   x   .   x   x   x  
90   .   x   x   x   x   x   x   .  

     01  07  11  13  17  19  23  29  
120   .   x   x   .   x   x   .   x  
150   x   x   .   x   x   .   x   x 

So, here it has a buffer of 2 bytes, resembling an interval of 60 numbers, meaning it needs 3 iterations. Every x marks a prime number.

@galanakis
Copy link

Is there a documentation or a paper this algorithm is based on? I would like to help review it.

@haampie
Copy link
Author

haampie commented Dec 11, 2021

I think primesieve.org is a good reference. The wheel is a generalization of the idea of skipping multiples of 2 (which is easy to code), see https://en.wikipedia.org/wiki/Wheel_factorization for a good overview.

If you take a {2, 3, 5}-wheel, then consider any number z = 2*3*5*x + y = 30*x + y, where y = 1, ..., 30. Then z is composite if y is a multiple of 2, 3 or 5. So you only have to verify y = {1, 7, 11, 13, 17, 19, 23, 29} are prime for all x. Those are 8 numbers, and them being prime or not is 1 bit of info, so this neatly compressed to 1 byte of data per interval of 30 numbers on the number line.

The next optimization after compressing the number line by a constant factor is to work on constant size segments of the number line. So instead of 1:n, you work in segments 1:b, b+1:2b, 2b+1:3b, ..., m*b+1:n, and you allocate the boolean array once and for all. Effectively you do:

for (a, b) in segments # sieve a:b
  fill!(segment, 1)
  for p in primes_so_far
    # find the smallest number q st p * q >= a.
    # then remove p * q, p * (q + 1), p * (q + 2), ... until p * (q + m) > b from segment.
  end
  # find all primes in `segment` and push to `primes_so_far`
end

the inner loop is finicky, cause (a) you don't want to always compute the smallest q s.t. p * q >= a, instead you already know what it is thanks to the last iteration in the previous segment. and (b) you aren't going to remove multiples q, q+1, q+2, ... as you're skipping all p * (q+i)'s that aren't in the {1, 7, 11, 13, 17, 19, 23, 29} spokes -- and this is where the loop is unrolled by hand, and where all the pain and suffering comes from. There's going to be 8 loops for all possible p mod 30's, and 8 steps inside that loop for all q mod 30 values, leading to 64 entrypoints into the loops, and the main optimization is going to be here that instead of doing "cross off q * p, increment q, check if p * q is inside the segment" you're adding a hot loop where you're checking if you can cross off e.g. p * [q, q + 6, q + 10, q + 12, q + 16, q + 18, q + 22, q + 28] at once.

@galanakis
Copy link

galanakis commented Dec 12, 2021

Thank you. I am already familiar with wheel factorization and the the Segmented Sieve (Bays-Hudson). I have to think a bit about combining the two. I appreciate the long response and I will think about it. I really like the idea of combining the 8 bits into one byte.

@galanakis
Copy link

I have not thought about it very much, but this paper has some other little optimizations, under Algorithms B and C.

https://link.springer.com/content/pdf/10.1007/BF01932283.pdf

Is your segmented sieve similar to this? Or are you using a completely different approach which is incompatible. Maybe it is incompatible, because this paper basically uses the usual 2-wheel, but I really have to think about it.

@oscardssmith
Copy link
Member

Want to merge update and this? It's a lot of extra code, but I've become convinced that something like this is necessary for optimal performance.

1 similar comment
@oscardssmith
Copy link
Member

Want to merge update and this? It's a lot of extra code, but I've become convinced that something like this is necessary for optimal performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reduce memory usage in the sieve
3 participants