Reduce/remove inner allocations (e.g. `gamma_inc_taylor`, `gamma_inc_asym`) #442

BioTurboNick · 2023-05-03T15:16:35Z

A performance issue when running many iterations of the gamma function is the internal allocation of a 30-length buffer.

A solution I've implemented is to overwrite the functions to use 1) StaticArrays.MVector, and 2) A per-thread buffer pool. However, I don't know if this is the best solution.

const gamma_inc_taylor_buffers = map(1:nthreadlimit) do i
    @MVector zeros(30)
end

function _gamma_inc_get_buffer()
    buffer = gamma_inc_taylor_buffers[Threads.threadid()]
    buffer .= 0
    return buffer
end

# in functions
wk = _gamma_inc_get_buffer()

It would be better to have a solution in here, to avoid recompiling.

4.814791 seconds (14.04 M allocations: 3.806 GiB, 18.75% gc time) # Current

4.496342 seconds (14.04 M allocations: 3.217 GiB, 16.10% gc time) # `@MVector zeros(30)`

3.123370 seconds (869.47 k allocations: 78.309 MiB) # Buffer pool + MVector

The text was updated successfully, but these errors were encountered:

heltonmc · 2023-05-03T22:01:37Z

Which function are you talking about specifically? gamma really shouldn't be allocating at all but an approach using StaticArrays is probably not needed. Should probably use a ntuple as I doubt this package would ever pull a dependency on StaticArrays as so many other packages depend on SpecialFunctions.

BioTurboNick · 2023-05-03T23:54:28Z

Sorry, it's in the title: gamma_inc_taylor, gamma_inc_asym.

BioTurboNick · 2023-05-04T00:04:56Z

My guess is that the purpose is to add up the components from smallest to largest to increase floating point accuracy.

heltonmc · 2023-05-04T00:50:45Z

Ahh I see (

SpecialFunctions.jl/src/gamma_inc.jl

Line 424 in ae35d10

wk = zeros(30)

) is the issue. I think this is just an issue that the original implementation was a translation of the Fortran routines (with stack allocated arrays) and so this is just not a good way to implement this in Julia. The way you sum them up though shouldn't matter as the terms should be strictly decreasing where these algorithms are employed.

Honestly, if I could figure out what math expression it was actually implementing I'd just rewrite it but it seems like it's doing something a bit more than the dlmf link.

BioTurboNick · 2023-05-04T03:43:51Z

Yeah.

So, the wrinkle for floating point series like this is that while the terms should strictly decrease, at some point they start to become unreliable - increasing, going negative, blowing up, etc. So you need to iterate one at a time and test whether the terms still make sense. And the number of terms at which it breaks down will be variable.

This loop is building up the terms, checking whether the term has a sensible magnitude, and storing it.

SpecialFunctions.jl/src/gamma_inc.jl

Lines 426 to 442 in ae35d10

    
           apn = a + 1.0 
        
           t = x/apn 
        
           wk[1] = t 
        
           loop = 2 
        
           for indx = 2:20 
        
              apn += 1.0 
        
              t *= x/apn 
        
              if t <= 1.0e-3 
        
                  loop = indx 
        
                  flag = true 
        
                  break 
        
              end 
        
              wk[indx] = t 
        
           end 
        
           if !flag 
        
               loop = 20 
        
           end

And, to preserve floating point accuracy, you add them up in reverse order, small to large.

SpecialFunctions.jl/src/gamma_inc.jl

Lines 453 to 455 in ae35d10

    
           for j = loop-1:-1:1 
        
              sm += wk[j] 
        
           end

The middle loop is just adding up the smaller terms directly, I guess the accuracy there wasn't deemed as important to the overall result.

heltonmc · 2023-05-04T04:19:41Z

at some point they start to become unreliable - increasing, going negative, blowing up, etc

I think my point is that it is inefficient to do this at runtime and not needed. It should be very predictable when this occurs depending on if you are looking at convergent or divergent series. We will know if the series suffers from cancellation beforehand or not and can set up the algorithm appropriately. Checking all these things at runtime is not very efficient and it shouldn't be required to store all these values. If the series is decreasing (or positive) it will be just as accurate to sum k =0...inf than summing over reverse order. If the series is diverging then it won't matter how we sum it up we will need to use fancier methods like sequence transformations to sum it accurately which is something differnet.

I think the unfortunate thing is that it is unclear to me what these two functions are computing because they point to the same nist link. I'm sure I'd have to read the paper but there also isn't an equation ref number either 🤷‍♂️

BioTurboNick · 2023-05-04T05:44:22Z

it will be just as accurate to sum k =0...inf

This is not true for floating point addition though. The precision of the smaller numbers gets swallowed by the larger numbers.

gamma_inc(11, 9)
# Correct answer: (0.2940116796594886, 0.7059883203405114)
# Just adding up all terms from large to small: (0.39103174184831896, 0.6089682581516811)

heltonmc · 2023-05-04T11:34:14Z

I must admit the code itself is a bit of a Chesterton's Fence situation for me but having dealt with a lot of series like this I've found it's no more accurate to sum in this way. Of course from a floating point standpoint strict equality will not be met if you sum them differently but usually they are similar to <1 ULP. If they are not then the series representation should probably not be used in that domain.

I have no idea what you are doing in your example but just re-implementing their commented example...

function g(a, z::T) where T
    MaxIter = 5000
    t = one(T)
    s = zero(T)
    for i in 1:MaxIter
        s += t
        abs(t) < eps(T) * abs(s) && break
        t *= z / (a + i)
    end
    p = (SpecialFunctions.rgammax(a, z)/a) * s
    return (p, 1.0 - p)
end

julia> g(11.0, 9.0)
(0.29401167965948855, 0.7059883203405115)

julia> gamma_inc_taylor(11.0, 9.0, 0)
(0.2940116796594886, 0.7059883203405114)

Now, I am not for sure why the original idea (the fence) to do it this way came about so I'm not saying this is the right way to do it. But I think my original point stands for these types of series that summing in reverse order is not usually needed.

BioTurboNick · 2023-05-04T12:02:12Z

Alright, yeah, your version works and passes all the tests. Fair enough!

BioTurboNick mentioned this issue May 4, 2023

Rewrite gamma_inc_taylor and gamma_inc_asym more concisely #443

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce/remove inner allocations (e.g. `gamma_inc_taylor`, `gamma_inc_asym`) #442

Reduce/remove inner allocations (e.g. `gamma_inc_taylor`, `gamma_inc_asym`) #442

BioTurboNick commented May 3, 2023

heltonmc commented May 3, 2023

BioTurboNick commented May 3, 2023

BioTurboNick commented May 4, 2023

heltonmc commented May 4, 2023

BioTurboNick commented May 4, 2023 •

edited

Loading

heltonmc commented May 4, 2023

BioTurboNick commented May 4, 2023 •

edited

Loading

heltonmc commented May 4, 2023

BioTurboNick commented May 4, 2023

Reduce/remove inner allocations (e.g. gamma_inc_taylor, gamma_inc_asym) #442

Reduce/remove inner allocations (e.g. gamma_inc_taylor, gamma_inc_asym) #442

Comments

BioTurboNick commented May 3, 2023

heltonmc commented May 3, 2023

BioTurboNick commented May 3, 2023

BioTurboNick commented May 4, 2023

heltonmc commented May 4, 2023

BioTurboNick commented May 4, 2023 • edited Loading

heltonmc commented May 4, 2023

BioTurboNick commented May 4, 2023 • edited Loading

heltonmc commented May 4, 2023

BioTurboNick commented May 4, 2023

Reduce/remove inner allocations (e.g. `gamma_inc_taylor`, `gamma_inc_asym`) #442

Reduce/remove inner allocations (e.g. `gamma_inc_taylor`, `gamma_inc_asym`) #442

BioTurboNick commented May 4, 2023 •

edited

Loading

BioTurboNick commented May 4, 2023 •

edited

Loading