-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmented prime sieve for fun and profit #87
base: main
Are you sure you want to change the base?
Conversation
Is there a documentation or a paper this algorithm is based on? I would like to help review it. |
I think primesieve.org is a good reference. The wheel is a generalization of the idea of skipping multiples of 2 (which is easy to code), see https://en.wikipedia.org/wiki/Wheel_factorization for a good overview. If you take a {2, 3, 5}-wheel, then consider any number The next optimization after compressing the number line by a constant factor is to work on constant size segments of the number line. So instead of for (a, b) in segments # sieve a:b
fill!(segment, 1)
for p in primes_so_far
# find the smallest number q st p * q >= a.
# then remove p * q, p * (q + 1), p * (q + 2), ... until p * (q + m) > b from segment.
end
# find all primes in `segment` and push to `primes_so_far`
end the inner loop is finicky, cause (a) you don't want to always compute the smallest q s.t. p * q >= a, instead you already know what it is thanks to the last iteration in the previous segment. and (b) you aren't going to remove multiples q, q+1, q+2, ... as you're skipping all p * (q+i)'s that aren't in the {1, 7, 11, 13, 17, 19, 23, 29} spokes -- and this is where the loop is unrolled by hand, and where all the pain and suffering comes from. There's going to be 8 loops for all possible p mod 30's, and 8 steps inside that loop for all q mod 30 values, leading to 64 entrypoints into the loops, and the main optimization is going to be here that instead of doing "cross off q * p, increment q, check if p * q is inside the segment" you're adding a hot loop where you're checking if you can cross off e.g. p * [q, q + 6, q + 10, q + 12, q + 16, q + 18, q + 22, q + 28] at once. |
Thank you. I am already familiar with wheel factorization and the the Segmented Sieve (Bays-Hudson). I have to think a bit about combining the two. I appreciate the long response and I will think about it. I really like the idea of combining the 8 bits into one byte. |
I have not thought about it very much, but this paper has some other little optimizations, under Algorithms B and C. https://link.springer.com/content/pdf/10.1007/BF01932283.pdf Is your segmented sieve similar to this? Or are you using a completely different approach which is incompatible. Maybe it is incompatible, because this paper basically uses the usual 2-wheel, but I really have to think about it. |
Want to merge update and this? It's a lot of extra code, but I've become convinced that something like this is necessary for optimal performance. |
1 similar comment
Want to merge update and this? It's a lot of extra code, but I've become convinced that something like this is necessary for optimal performance. |
Closes #45.
I have this code lying around for a bit now, but never turned it into a PR. Now I'm turning it into a PR to force myself to finish it 😅. It could use some tests and maybe some code style improvements / cleanup.
What it brings is an iterator:
which makes some operations very fast, as they operate on O(1) memory instead of O(n). For reference:
this is 16x faster.
There's some neat optimizations: you can pick a buffer size to make things fit in L1 cache (above it's 32kb); the inner loop is unrolled by hand such that 8 multiples are removed per iteration; memory is compressed s.t. 1 byte equals an interval of 30 numbers (through a {2, 3, 5}-wheel); multiples of {7, 11, 13, 17} are "pre-sieved" in the sense that a buffer of size 7 * 11 * 13 * 17 is sieved and repeatedly copied over, instead of actually sieving 7, 11, 13 and 17 each over the full range.
You can also print the segment if you like:
So, here it has a buffer of 2 bytes, resembling an interval of 60 numbers, meaning it needs 3 iterations. Every
x
marks a prime number.