Performance gap between Chapel bigint (multiply) reductions vs. hand-coded #26427

bradcray · 2024-12-17T20:54:21Z

This is a capture of an observation that @s-timcheck brought to my attention via email that I think deserves more attention. The crux of the matter is that a hand-generated lg(n)-step reduction is more efficient than a Chapel reduction for bigint multiplies on problem sizes above a certain size. My original hypothesis was that we just had overhead in our reduction implementation (e.g., the use of classes or how we combine values between tasks) that the hand-coded reduction didn't. However, doing some more measurements, my new guess is that the hand-coded version keeps results in a smaller and therefore more efficient bigint value range for longer. This suggests that the author of a user-defined reduction (or user of that reduction?) may want more control over the algorithm used by the reduction than we had previously imagined or considered.

In a bit more detail, the hand-coded optimization essentially uses a combining tree where first adjacent elements are multiplied, then adjacent results of that first step, then adjacent results of that step, and so on until the work is done. This requires O(lg n) steps essentially, though each of those steps can be computed in parallel using a forall loop. It also requires something like O(n/2) storage to capture intermediate results.

Meanwhile, the Chapel version effectively breaks the input space into numtasks (~here.maxTaskPar) tasks, each of which computes the product of its local chunk of that space, after which those results are combined. So O(1) steps, but some sort of O(numTasks) combining and coordination at the end, and with O(numTasks) storage required. That said, as mentioned, the current implementation also relies on class objects (so, heap allocated data) and some form of synchronization in combining the tasks' results that isn't really present in the same way in the hand-coded. These differences are what I was originally thinking caused the problem, though now I think I was off-base.

Specifically, what I found was that (a) when running the original comparison at smaller problem sizes, the Chapel version was faster and (b) when rewriting the comparison and hand-coded algorithm to do the reduction on uint(64) values rather than bigints, the Chapel version won at all problem sizes I tried. The second of these observations suggests that it's something about the nature of bigints that causes the difference, where the most obvious difference is that multiplies are potentially much more expensive, particularly at larger values.

So, my current hypothesis is that by combining values pair-wise, the hand-coded algorithm maintains more elements at smaller values for longer, only combining into the very largest values in the final stages of the algorithm, therefore doing more cheaper operations. Whereas the Chapel algorithm, by accumulating into numTasks scalars jumps to larger, more expensive values more quickly and takes longer overall.

To my the best of my recollection, this is the first time we've encountered a case where different reduction scenarios may want to use different algorithms (and to spend more memory) in order to get better performance, suggesting that it may be attractive to somehow extend the reduction framework to support different algorithms if that's not considered overly complex. Not clear to me is whether this should be encoded within the reduction itself (e.g., a multPairwise reduction whose documentation notes the different approach and space requirements) or in the application of the reduction (via an argument? or an annotation?) or whether such cases should be handled by different approaches (a library routine or …?)

Finally, here is the code. The first file tests some different approaches to bigint reductions and was authored by @s-timcheck (Stephen Timcheck, LPS) with some code refactoring by me as I was getting my head around it.

bigintMultPerf.chpl:

// Authored by Stephen Timcheck (LPL)
// Reformatted by Brad Chamberlain (HPE)

use BigInteger, Math, Random, Time;

config const size = 10_000,
             numTestIters = 1000;

proc main() {
  var arrDom = {0..<size};
  var arr: [arrDom] uint;
  var arrProduct: [0..4] bigint;

  var seed = 19;

  fillRandom(arr, seed);

  testBigintReduce0(arr, arrProduct[0]);
  testBigintReduce1(arr, arrProduct[1]);
  testBigintReduce2(arr, arrProduct[2]);
  testBigintReduce3(arr, arrProduct[3]);

  for i in 0..3 {
    if(arrProduct[0] != arrProduct[i]) {
      writeln("MISMATCH BETWEEN PRODUCT 0 and ", i);
    }
  }
}

// Bigint reduce casting the input 'uint' array to 'bigint'
proc testBigintReduce0(const arr: [] uint, ref arrProduct: bigint) {
  var time_reduce: stopwatch;
  time_reduce.start();
  for i in 0..numTestIters {
    arrProduct = * reduce (arr: bigint);
  }
  time_reduce.stop();
  writeln("*\t\treduce\t(arr: bigint)\t", time_reduce.elapsed());
}

// Bigint reduce using 'bigintMult' by Brad
proc testBigintReduce1(const arr: [] uint, ref arrProduct: bigint) {
  var time_reduce: stopwatch;
  time_reduce.start();
  for i in 0..numTestIters {
    arrProduct = bigintMult reduce arr;
  }
  time_reduce.stop();
  writeln("bigintMult\treduce\tarr\t\t", time_reduce.elapsed());
}

// Bigint reduce with custom reduction logic by Stephen
proc testBigintReduce2(const arr: [?arrDom] uint, ref arrProduct: bigint) {
  var time_reduce: stopwatch;
  time_reduce.start();
  for j in 0..numTestIters {
    var step = 1;
    var len: uint = log2(size) + 1;
    var tmp_bigProduct: [arrDom] bigint;
    tmp_bigProduct = arr;

    for h in 1..len {
      //forall i in arrDom by (step * 2) do {
      forall i in arrDom by (step * 2) do {
        if i + step >= size { }
        else {
          tmp_bigProduct[i] *= tmp_bigProduct[i + step];
        }
      }
      step = step * 2;
    }
    arrProduct = tmp_bigProduct[0];
  }
  time_reduce.stop();
  writeln("manual reduce\tof\tarr\t\t", time_reduce.elapsed());
}

// Bigint reduce copying all of the 'uint' array's elements to another 'bigint' array, then reducing
proc testBigintReduce3(const arr: [?arrDom] uint, ref arrProduct: bigint) {
  var time_reduce: stopwatch;
  time_reduce.start();
  var bigArr: [arrDom] bigint;
  bigArr = arr;
  for i in 0..numTestIters {
    arrProduct = * reduce bigArr;
  }
  time_reduce.stop();
  writeln("*\t\treduce\tbigArr\t\t", time_reduce.elapsed());
}

// Custom bigint multiply reduce
class bigintMult: ReduceScanOp {
  type eltType;
  var value: bigint = 1;

  inline proc identity do return 1: bigint;
  inline proc accumulate(x) {
    value *= x;
  }
  inline proc accumulateOntoState(ref state, x) {
    state *= x;
  }
  inline proc combine(x) {
    value *= x.value;
  }
  inline proc generate() do return value;
  inline proc clone() do return new unmanaged bigintMult(eltType=eltType);
}

uintMultPerf.chpl:

// Based on a program authored by Stephen Timcheck (LPL)
// Updated to compute using 'uint' rather than 'bigint' by Brad Chamberlain (HPE)

use Math, Random, Time;

config const size = 10_000,
             numTestIters = 1000;

proc main() {
  var arrDom = {0..<size};
  var arr: [arrDom] uint;
  var arrProduct: [0..4] uint;

  var seed = 19;

  fillRandom(arr, seed);

  testUintReduce0(arr, arrProduct[0]);
  testUintReduce1(arr, arrProduct[1]);
  testUintReduce2(arr, arrProduct[2]);
  testUintReduce3(arr, arrProduct[3]);

  for i in 0..3 {
    if(arrProduct[0] != arrProduct[i]) {
      writeln("MISMATCH BETWEEN PRODUCT 0 and ", i);
    }
  }
}

// Uint reduce casting the input 'uint' array to 'uint'
proc testUintReduce0(const arr: [] uint, ref arrProduct: uint) {
  var time_reduce: stopwatch;
  time_reduce.start();
  for i in 0..numTestIters {
    arrProduct = * reduce (arr: uint);
  }
  time_reduce.stop();
  writeln("*\t\treduce\t(arr: uint)\t", time_reduce.elapsed());
}

// Uint reduce using 'uintMult' by Brad
proc testUintReduce1(const arr: [] uint, ref arrProduct: uint) {
  var time_reduce: stopwatch;
  time_reduce.start();
  for i in 0..numTestIters {
    arrProduct = uintMult reduce arr;
  }
  time_reduce.stop();
  writeln("uintMult\treduce\tarr\t\t", time_reduce.elapsed());
}

// Uint reduce with custom reduction logic by Stephen
proc testUintReduce2(const arr: [?arrDom] uint, ref arrProduct: uint) {
  var time_reduce: stopwatch;
  time_reduce.start();
  for j in 0..numTestIters {
    var step = 1;
    var len: uint = log2(size) + 1;
    var tmp_bigProduct: [arrDom] uint;
    tmp_bigProduct = arr;

    for h in 1..len {
      //forall i in arrDom by (step * 2) do {
      forall i in arrDom by (step * 2) do {
        if i + step >= size { }
        else {
          tmp_bigProduct[i] *= tmp_bigProduct[i + step];
        }
      }
      step = step * 2;
    }
    arrProduct = tmp_bigProduct[0];
  }
  time_reduce.stop();
  writeln("manual reduce\tof\tarr\t\t", time_reduce.elapsed());
}

// Uint reduce copying all of the 'uint' array's elements to another 'uint' array, then reducing
proc testUintReduce3(const arr: [?arrDom] uint, ref arrProduct: uint) {
  var time_reduce: stopwatch;
  time_reduce.start();
  var bigArr: [arrDom] uint;
  bigArr = arr;
  for i in 0..numTestIters {
    arrProduct = * reduce bigArr;
  }
  time_reduce.stop();
  writeln("*\t\treduce\tbigArr\t\t", time_reduce.elapsed());
}

// Custom uint multiply reduce
class uintMult: ReduceScanOp {
  type eltType;
  var value: uint = 1;

  inline proc identity do return 1: uint;
  inline proc accumulate(x) {
    value *= x;
  }
  inline proc accumulateOntoState(ref state, x) {
    state *= x;
  }
  inline proc combine(x) {
    value *= x.value;
  }
  inline proc generate() do return value;
  inline proc clone() do return new unmanaged uintMult(eltType=eltType);
}

And here are some sample timings on my Mac (where I'm just focusing on the bigintMult reduce and uintMult lines from the output, though all Chapel variations were in the same rough ballpark):

algorithm	--size=10	100	1000	10,000	100,000	100,000,000
bigint Chapel reduce	0.006s	0.011s	0.131s	4.05s	74.0s	d.n.f.
bigint hand-coded	0.012s	0.025s	0.116s	2.78s	34.2s	d.n.f.
uint Chapel reduce	0.004s	0.004s	0.005s	0.005s	0.012s	6.819s
uint hand-coded	0.010s	0.019s	0.032s	0.051s	0.079s	77.23s

The text was updated successfully, but these errors were encountered:

bradcray · 2024-12-17T23:07:14Z

Shortly after filing this, it occurred to me that an interesting experiment would be to only multiply zero values to check my hypothesis that the growing size of the product accounts for the difference between the algorithms, and this seems to be the case:

$ ./bigintMultReducePerf-zero --size=100000
*		reduce	(arr: bigint)	0.246442
bigintMult	reduce	arr		0.031382
manual reduce	of	arr		0.482826
*		reduce	bigArr		0.062542
* ```

Also interesting is that this case shows the overheads of casting to bigint or converting to bigArr that I was expecting originally, but not seeing in practice when the execution times exceeded 4 seconds.  That said, I'm surprised that the cast-on-the-fly approach is so much slower than the "copy-to-bigint-array" approach, where I would've expected the opposite.  Specifically, I was imagining that the additional memory created and touced by the copy to bigint would be saved by casting each element to a bigint just before reducing it (assuming that's what's happening, which has been my assumption).

Here's the diff to the previous code that I timed here:

```diff
index 74c6c96aad..c72e57b2ad 100644
--- a/./bigintMultReducePerf.chpl
+++ b/./bigintMultReducePerf-zero.chpl
@@ -11,9 +11,7 @@ proc main() {
   var arr: [arrDom] uint;
   var arrProduct: [0..4] bigint;
 
-  var seed = 19;
-
-  fillRandom(arr, seed);
+  arr = 0;
 
   testBigintReduce0(arr, arrProduct[0]);
   testBigintReduce1(arr, arrProduct[1]);

lydia-duncan · 2025-01-03T16:43:13Z

I'm not sure I'm fully understanding and I definitely don't know as much about our reductions implementation as I could, but would it make sense to have reductions follow different algorithms depending on the type being used? Obviously that wouldn't help for user-defined types that are heavy-weight and we'd want to come up with some design that helps with that, but the desire to be space efficient rather than time efficient seems like a tradeoff that would be more likely whenever bigints are used.

bradcray · 2025-01-06T02:10:53Z

I think you're correct that we could special-case the algorithm used for bigints, but (as you suggest), it'd be preferable to have a more general-purpose solution that bigints could opt in to rather than special casing bigints in the compiler.

bradcray added type: Design area: Libraries / Modules type: Performance area: Language user issue labels Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance gap between Chapel bigint (multiply) reductions vs. hand-coded #26427

Performance gap between Chapel bigint (multiply) reductions vs. hand-coded #26427

bradcray commented Dec 17, 2024

bradcray commented Dec 17, 2024

lydia-duncan commented Jan 3, 2025

bradcray commented Jan 6, 2025

Performance gap between Chapel bigint (multiply) reductions vs. hand-coded #26427

Performance gap between Chapel bigint (multiply) reductions vs. hand-coded #26427

Comments

bradcray commented Dec 17, 2024

bradcray commented Dec 17, 2024

lydia-duncan commented Jan 3, 2025

bradcray commented Jan 6, 2025