You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a capture of an observation that @s-timcheck brought to my attention via email that I think deserves more attention. The crux of the matter is that a hand-generated lg(n)-step reduction is more efficient than a Chapel reduction for bigint multiplies on problem sizes above a certain size. My original hypothesis was that we just had overhead in our reduction implementation (e.g., the use of classes or how we combine values between tasks) that the hand-coded reduction didn't. However, doing some more measurements, my new guess is that the hand-coded version keeps results in a smaller and therefore more efficient bigint value range for longer. This suggests that the author of a user-defined reduction (or user of that reduction?) may want more control over the algorithm used by the reduction than we had previously imagined or considered.
In a bit more detail, the hand-coded optimization essentially uses a combining tree where first adjacent elements are multiplied, then adjacent results of that first step, then adjacent results of that step, and so on until the work is done. This requires O(lg n) steps essentially, though each of those steps can be computed in parallel using a forall loop. It also requires something like O(n/2) storage to capture intermediate results.
Meanwhile, the Chapel version effectively breaks the input space into numtasks (~here.maxTaskPar) tasks, each of which computes the product of its local chunk of that space, after which those results are combined. So O(1) steps, but some sort of O(numTasks) combining and coordination at the end, and with O(numTasks) storage required. That said, as mentioned, the current implementation also relies on class objects (so, heap allocated data) and some form of synchronization in combining the tasks' results that isn't really present in the same way in the hand-coded. These differences are what I was originally thinking caused the problem, though now I think I was off-base.
Specifically, what I found was that (a) when running the original comparison at smaller problem sizes, the Chapel version was faster and (b) when rewriting the comparison and hand-coded algorithm to do the reduction on uint(64) values rather than bigints, the Chapel version won at all problem sizes I tried. The second of these observations suggests that it's something about the nature of bigints that causes the difference, where the most obvious difference is that multiplies are potentially much more expensive, particularly at larger values.
So, my current hypothesis is that by combining values pair-wise, the hand-coded algorithm maintains more elements at smaller values for longer, only combining into the very largest values in the final stages of the algorithm, therefore doing more cheaper operations. Whereas the Chapel algorithm, by accumulating into numTasks scalars jumps to larger, more expensive values more quickly and takes longer overall.
To my the best of my recollection, this is the first time we've encountered a case where different reduction scenarios may want to use different algorithms (and to spend more memory) in order to get better performance, suggesting that it may be attractive to somehow extend the reduction framework to support different algorithms if that's not considered overly complex. Not clear to me is whether this should be encoded within the reduction itself (e.g., a multPairwise reduction whose documentation notes the different approach and space requirements) or in the application of the reduction (via an argument? or an annotation?) or whether such cases should be handled by different approaches (a library routine or …?)
Finally, here is the code. The first file tests some different approaches to bigint reductions and was authored by @s-timcheck (Stephen Timcheck, LPS) with some code refactoring by me as I was getting my head around it.
bigintMultPerf.chpl:
// Authored by Stephen Timcheck (LPL)// Reformatted by Brad Chamberlain (HPE)use BigInteger, Math, Random, Time;
configconst size =10_000,
numTestIters =1000;
proc main() {
var arrDom = {0..<size};
var arr:[arrDom]uint;
var arrProduct:[0..4] bigint;
var seed =19;
fillRandom(arr, seed);
testBigintReduce0(arr, arrProduct[0]);
testBigintReduce1(arr, arrProduct[1]);
testBigintReduce2(arr, arrProduct[2]);
testBigintReduce3(arr, arrProduct[3]);
for i in0..3 {
if(arrProduct[0]!= arrProduct[i]) {
writeln("MISMATCH BETWEEN PRODUCT 0 and ", i);
}
}
}
// Bigint reduce casting the input 'uint' array to 'bigint'proc testBigintReduce0(const arr:[]uint, ref arrProduct: bigint) {
var time_reduce: stopwatch;
time_reduce.start();
for i in0..numTestIters {
arrProduct =*reduce (arr: bigint);
}
time_reduce.stop();
writeln("*\t\treduce\t(arr: bigint)\t", time_reduce.elapsed());
}
// Bigint reduce using 'bigintMult' by Bradproc testBigintReduce1(const arr:[]uint, ref arrProduct: bigint) {
var time_reduce: stopwatch;
time_reduce.start();
for i in0..numTestIters {
arrProduct = bigintMult reduce arr;
}
time_reduce.stop();
writeln("bigintMult\treduce\tarr\t\t", time_reduce.elapsed());
}
// Bigint reduce with custom reduction logic by Stephenproc testBigintReduce2(const arr:[?arrDom]uint, ref arrProduct: bigint) {
var time_reduce: stopwatch;
time_reduce.start();
for j in0..numTestIters {
var step =1;
var len:uint= log2(size) +1;
var tmp_bigProduct:[arrDom] bigint;
tmp_bigProduct = arr;
for h in1..len {
//forall i in arrDom by (step * 2) do {forall i in arrDom by (step *2) do {
if i + step >= size { }
else {
tmp_bigProduct[i]*= tmp_bigProduct[i + step];
}
}
step = step *2;
}
arrProduct = tmp_bigProduct[0];
}
time_reduce.stop();
writeln("manual reduce\tof\tarr\t\t", time_reduce.elapsed());
}
// Bigint reduce copying all of the 'uint' array's elements to another 'bigint' array, then reducingproc testBigintReduce3(const arr:[?arrDom]uint, ref arrProduct: bigint) {
var time_reduce: stopwatch;
time_reduce.start();
var bigArr:[arrDom] bigint;
bigArr = arr;
for i in0..numTestIters {
arrProduct =*reduce bigArr;
}
time_reduce.stop();
writeln("*\t\treduce\tbigArr\t\t", time_reduce.elapsed());
}
// Custom bigint multiply reduceclass bigintMult: ReduceScanOp {
type eltType;
var value: bigint =1;
inlineproc identity doreturn1: bigint;
inlineproc accumulate(x) {
value *= x;
}
inlineproc accumulateOntoState(ref state, x) {
state *= x;
}
inlineproc combine(x) {
value *= x.value;
}
inlineproc generate() doreturn value;
inlineproc clone() doreturnnewunmanaged bigintMult(eltType=eltType);
}
uintMultPerf.chpl:
// Based on a program authored by Stephen Timcheck (LPL)// Updated to compute using 'uint' rather than 'bigint' by Brad Chamberlain (HPE)use Math, Random, Time;
configconst size =10_000,
numTestIters =1000;
proc main() {
var arrDom = {0..<size};
var arr:[arrDom]uint;
var arrProduct:[0..4]uint;
var seed =19;
fillRandom(arr, seed);
testUintReduce0(arr, arrProduct[0]);
testUintReduce1(arr, arrProduct[1]);
testUintReduce2(arr, arrProduct[2]);
testUintReduce3(arr, arrProduct[3]);
for i in0..3 {
if(arrProduct[0]!= arrProduct[i]) {
writeln("MISMATCH BETWEEN PRODUCT 0 and ", i);
}
}
}
// Uint reduce casting the input 'uint' array to 'uint'proc testUintReduce0(const arr:[]uint, ref arrProduct:uint) {
var time_reduce: stopwatch;
time_reduce.start();
for i in0..numTestIters {
arrProduct =*reduce (arr:uint);
}
time_reduce.stop();
writeln("*\t\treduce\t(arr: uint)\t", time_reduce.elapsed());
}
// Uint reduce using 'uintMult' by Bradproc testUintReduce1(const arr:[]uint, ref arrProduct:uint) {
var time_reduce: stopwatch;
time_reduce.start();
for i in0..numTestIters {
arrProduct = uintMult reduce arr;
}
time_reduce.stop();
writeln("uintMult\treduce\tarr\t\t", time_reduce.elapsed());
}
// Uint reduce with custom reduction logic by Stephenproc testUintReduce2(const arr:[?arrDom]uint, ref arrProduct:uint) {
var time_reduce: stopwatch;
time_reduce.start();
for j in0..numTestIters {
var step =1;
var len:uint= log2(size) +1;
var tmp_bigProduct:[arrDom]uint;
tmp_bigProduct = arr;
for h in1..len {
//forall i in arrDom by (step * 2) do {forall i in arrDom by (step *2) do {
if i + step >= size { }
else {
tmp_bigProduct[i]*= tmp_bigProduct[i + step];
}
}
step = step *2;
}
arrProduct = tmp_bigProduct[0];
}
time_reduce.stop();
writeln("manual reduce\tof\tarr\t\t", time_reduce.elapsed());
}
// Uint reduce copying all of the 'uint' array's elements to another 'uint' array, then reducingproc testUintReduce3(const arr:[?arrDom]uint, ref arrProduct:uint) {
var time_reduce: stopwatch;
time_reduce.start();
var bigArr:[arrDom]uint;
bigArr = arr;
for i in0..numTestIters {
arrProduct =*reduce bigArr;
}
time_reduce.stop();
writeln("*\t\treduce\tbigArr\t\t", time_reduce.elapsed());
}
// Custom uint multiply reduceclass uintMult: ReduceScanOp {
type eltType;
var value:uint=1;
inlineproc identity doreturn1:uint;
inlineproc accumulate(x) {
value *= x;
}
inlineproc accumulateOntoState(ref state, x) {
state *= x;
}
inlineproc combine(x) {
value *= x.value;
}
inlineproc generate() doreturn value;
inlineproc clone() doreturnnewunmanaged uintMult(eltType=eltType);
}
And here are some sample timings on my Mac (where I'm just focusing on the bigintMult reduce and uintMult lines from the output, though all Chapel variations were in the same rough ballpark):
algorithm
--size=10
100
1000
10,000
100,000
100,000,000
bigint Chapel reduce
0.006s
0.011s
0.131s
4.05s
74.0s
d.n.f.
bigint hand-coded
0.012s
0.025s
0.116s
2.78s
34.2s
d.n.f.
uint Chapel reduce
0.004s
0.004s
0.005s
0.005s
0.012s
6.819s
uint hand-coded
0.010s
0.019s
0.032s
0.051s
0.079s
77.23s
The text was updated successfully, but these errors were encountered:
Shortly after filing this, it occurred to me that an interesting experiment would be to only multiply zero values to check my hypothesis that the growing size of the product accounts for the difference between the algorithms, and this seems to be the case:
$ ./bigintMultReducePerf-zero --size=100000
* reduce (arr: bigint) 0.246442
bigintMult reduce arr 0.031382
manual reduce of arr 0.482826
* reduce bigArr 0.062542
* ```
Also interesting is that this case shows the overheads of casting to bigint or converting to bigArr that I was expecting originally, but not seeing in practice when the execution times exceeded 4 seconds. That said, I'm surprised that the cast-on-the-fly approach is so much slower than the "copy-to-bigint-array" approach, where I would've expected the opposite. Specifically, I was imagining that the additional memory created and touced by the copy to bigint would be saved by casting each element to a bigint just before reducing it (assuming that's what's happening, which has been my assumption).
Here's the diff to the previous code that I timed here:
```diff
index 74c6c96aad..c72e57b2ad 100644
--- a/./bigintMultReducePerf.chpl
+++ b/./bigintMultReducePerf-zero.chpl
@@ -11,9 +11,7 @@ proc main() {
var arr: [arrDom] uint;
var arrProduct: [0..4] bigint;
- var seed = 19;
-
- fillRandom(arr, seed);
+ arr = 0;
testBigintReduce0(arr, arrProduct[0]);
testBigintReduce1(arr, arrProduct[1]);
I'm not sure I'm fully understanding and I definitely don't know as much about our reductions implementation as I could, but would it make sense to have reductions follow different algorithms depending on the type being used? Obviously that wouldn't help for user-defined types that are heavy-weight and we'd want to come up with some design that helps with that, but the desire to be space efficient rather than time efficient seems like a tradeoff that would be more likely whenever bigints are used.
I think you're correct that we could special-case the algorithm used for bigints, but (as you suggest), it'd be preferable to have a more general-purpose solution that bigints could opt in to rather than special casing bigints in the compiler.
This is a capture of an observation that @s-timcheck brought to my attention via email that I think deserves more attention. The crux of the matter is that a hand-generated lg(n)-step reduction is more efficient than a Chapel reduction for bigint multiplies on problem sizes above a certain size. My original hypothesis was that we just had overhead in our reduction implementation (e.g., the use of classes or how we combine values between tasks) that the hand-coded reduction didn't. However, doing some more measurements, my new guess is that the hand-coded version keeps results in a smaller and therefore more efficient bigint value range for longer. This suggests that the author of a user-defined reduction (or user of that reduction?) may want more control over the algorithm used by the reduction than we had previously imagined or considered.
In a bit more detail, the hand-coded optimization essentially uses a combining tree where first adjacent elements are multiplied, then adjacent results of that first step, then adjacent results of that step, and so on until the work is done. This requires O(lg n) steps essentially, though each of those steps can be computed in parallel using a forall loop. It also requires something like O(n/2) storage to capture intermediate results.
Meanwhile, the Chapel version effectively breaks the input space into numtasks (~
here.maxTaskPar
) tasks, each of which computes the product of its local chunk of that space, after which those results are combined. So O(1) steps, but some sort of O(numTasks) combining and coordination at the end, and with O(numTasks) storage required. That said, as mentioned, the current implementation also relies on class objects (so, heap allocated data) and some form of synchronization in combining the tasks' results that isn't really present in the same way in the hand-coded. These differences are what I was originally thinking caused the problem, though now I think I was off-base.Specifically, what I found was that (a) when running the original comparison at smaller problem sizes, the Chapel version was faster and (b) when rewriting the comparison and hand-coded algorithm to do the reduction on
uint(64)
values rather thanbigint
s, the Chapel version won at all problem sizes I tried. The second of these observations suggests that it's something about the nature of bigints that causes the difference, where the most obvious difference is that multiplies are potentially much more expensive, particularly at larger values.So, my current hypothesis is that by combining values pair-wise, the hand-coded algorithm maintains more elements at smaller values for longer, only combining into the very largest values in the final stages of the algorithm, therefore doing more cheaper operations. Whereas the Chapel algorithm, by accumulating into numTasks scalars jumps to larger, more expensive values more quickly and takes longer overall.
To my the best of my recollection, this is the first time we've encountered a case where different reduction scenarios may want to use different algorithms (and to spend more memory) in order to get better performance, suggesting that it may be attractive to somehow extend the reduction framework to support different algorithms if that's not considered overly complex. Not clear to me is whether this should be encoded within the reduction itself (e.g., a
multPairwise
reduction whose documentation notes the different approach and space requirements) or in the application of the reduction (via an argument? or an annotation?) or whether such cases should be handled by different approaches (a library routine or …?)Finally, here is the code. The first file tests some different approaches to
bigint
reductions and was authored by @s-timcheck (Stephen Timcheck, LPS) with some code refactoring by me as I was getting my head around it.bigintMultPerf.chpl:
uintMultPerf.chpl:
And here are some sample timings on my Mac (where I'm just focusing on the
bigintMult reduce
anduintMult
lines from the output, though all Chapel variations were in the same rough ballpark):The text was updated successfully, but these errors were encountered: