Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: stats.lmoment: add function to calculate sample L-moments #19475

Merged
merged 18 commits into from
Nov 8, 2024

Conversation

mdhaber
Copy link
Contributor

@mdhaber mdhaber commented Nov 5, 2023

Reference issue

gh-19460

What does this implement/fix?

The idea of fitting distributions to data with method of L-moments has been tossed around for a while, and gh-19460 brought it back up again. This PR adds a function to compute sample L-moments, which is an essential step of the method of L-moments.

Additional information

No rush here; I wrote the essential part of the code in #19460 (comment) so I figured I might as well submit a PR.

This needs an email to the mailing list before merge. (Done 3/11/2024.)

I can see arguments for the name being l_moment instead of lmoment. If there is strong support for changing it, that's OK with me.

@mdhaber mdhaber added scipy.stats enhancement A new feature or improvement labels Nov 5, 2023
Copy link
Contributor Author

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some self-review to aid the reviewer.

x = np.broadcast_to(x, x.shape[:-2] + (len(r), n))
x = np.triu(x)
j = np.arange(n)
return np.sum(special.binom(j, r[:, np.newaxis])*x, axis=-1) / special.binom(n-1, r) / n
Copy link
Contributor Author

@mdhaber mdhaber Nov 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the paper, this appears as
image

It is a little tricky to compare this implementation against the formula. Without thinking carefully:

  • It may look as though special.binom(j, ...) should be something like special.binom(j - 1, ...)
  • It may look as though j = np.arange(n) should be something like j = np.arange(r, n + 1)

But I think closer inspection will reveal that this implementation is equivalent. Note that r is a vector, so we're calculating multiple $b_r$ in one call. The formula is implemented this way to avoid explicit looping over the elements of r. The broadcasting and x = np.triu(x) zeros out the terms that shouldn't be part of the sum.

scipy/stats/_stats_py.py Show resolved Hide resolved
bk = _br(sample, r=k)

n = sample.shape[-1]
bk[..., n:] = 0 # remove NaNs due to n_moments > n
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the number of moments exceeds the number of observations, NaNs appear in the corresponding columns of bk. The weight applied to these values is 0 for L-moments of order less than or equal to the number of observations, and nonzero for L-moments of order greater than the number of observations; therefore, only the L-moments of order greater than the number of observations should be NaN. Unless we're careful, that's not what happens in prk * bk below because 0 * np.nan is np.nan. Instead, we're zero out these NaNs in bk and NaN-out the rows of the L-moments for which the order is greater than the number of observations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if vectorized the amount of samples is always a constant.
So perhaps we could raising an error here instead? Maybe it could help to keep the code nice and and simple. Plus, there's the whole fail-fast dogma argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if vectorized the amount of samples is always a constant.

Not for masked arrays.

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
@fancidev
Copy link
Contributor

fancidev commented Mar 13, 2024

I vote for the name lmoment.

The argument order should not have a default value. Explicitly supplying the order makes the code more readable while adding no burden to the writer. And yes, the moment argument of the moment function should read order; but that would be a breaking change.

The axis argument should have default value None, but I guess we’ve passed the point of no return as the moment function has axis=0 by default. None would have been a least-surprise choice, and in C layout working on the 0th axis has the worst performance in terms of memory locality.

The sorted argument is very useful!

The standardize argument should not be there, and in particular should not default to True, because higher order L-moment over second L-moment is analogous to skew and kurtosis of conventional moment, but skew and kurtosis are not “standardized” moments and in particular they are not the concern of the moment function.

@fancidev
Copy link
Contributor

fancidev commented Mar 13, 2024

Btw I come across this Python package lmo (BSD license) which deals with L-moments and which appears to be well written. Could be used as a reference, or better, invite the author to incorporate some of the lower-level stuff into SciPy.

@mdhaber
Copy link
Contributor Author

mdhaber commented Mar 15, 2024

The argument order should not have a default value.

The is a justifiable perspective. Please keep in mind that the choice is subjective; other qualified people have thought about the same thing and come to a different conclusion. stats.moment has a default value, so I included it, but I don't care too strongly either way as long as we realize it is not a matter of "right" and "wrong".

And yes, the moment argument of the moment function should read order; but that would be a breaking change.

This can easily be changed in a backward compatible way with _rename_parameter, and I'd be happy to do so if there is some support for it.

The axis argument should have default value None, but I guess we’ve passed the point of no return as the moment function has axis=0 by default. None would have been a least-surprise choice, and in C layout working on the 0th axis has the worst performance in terms of memory locality.

Before breaking changes are allowed by a SciPy 2.0 release, scipy.stats as a whole has passed the point of no return in the choice of axis=0. Almost every reducing function has axis=0 as the default, so this does, too. Again, I would caution against the use of "should"; this is a matter of opinion. I've written about this a few times, but here are the main reasons for each possibility:

  • axis=0: most convenient for users who pass pandas DataFrames
  • axis=None: same as NumPy
  • axis=-1: like axis=0, assumes slices are independent, but faster for C arrays. Preferred internally for more natural expression of vectorized code.

I've advocated for axis=-1 in the past, but that's beyond the scope of this PR.

The sorted argument is very useful!

Great!

The standardize argument should not be there, and in particular should not default to True, because higher order L-moment over second L-moment is analogous to skew and kurtosis of conventional moment, but skew and kurtosis are not “standardized” moments and in particular they are not the concern of the moment function.

Skewness and kurtosis are known as "standardized moments".
Higher order L-moment ratios are also known as "standardized L-moments".
This may or may not change your opinion on whether the standardize argument should be included, but did you have other rationale? My reason for including it and for setting the default to true was that in my limited research, higher-order L-moments seemed more common to report as L-moment ratios. For instance, R's lmom package - written by Hosking, who introduced the theory of L-moments - has ratios default to True for its samlu function.

Could be used as a reference, or better, invite the author to incorporate some of the lower-level stuff into SciPy.

Please elaborate on the motivation for this comment, since this is already written, and I've used R's lmom to generate reference values for testing. Is there functionality that lmo offers that this does not, or does this code not meet your standards?

@fancidev others may not care as much, but I would appreciate it if you would phrase your opinions as such. Thanks!

@fancidev
Copy link
Contributor

Thanks for the reply @mdhaber !

The _rename_parameter thing sounds cool! The reason why I think there should not be a default order is because one would say “the sample mean is xxx” but would not say “the sample moment is xxx” — which moment? Therefore the signature of the function should reflect this.

For axis, pandas defaults to 0 because it is natural for panel data, where the columns often mean very different things (e.g. age, weight, height) so it doesn’t make sense to average them. Consistent with this, pandas stores its data by columns. Since numpy copes with both Fortran and C conventions, None looks a reasonable choice. I also find None the least surprising for multidimensional array of data of the same nature.

Thanks for pointing out the term of “standardized moments”, I haven’t heard of them before. I mentioned Python lmom because I find it well written from the looking (though I have not verified its calculations). I find your code good enough for computing the sample L-moment. But in case you want to add more L-moment related functionalities, such as L-moment estimator or more sophisticated statistics, I guess that package could save you much time from reinventing the wheel :-)

Hosking’s R function samlu by default returns the 1st and 2nd L-moment together with the 3rd and 4th L-ratio. It may well suit R users’ expectation, but I’m not sure it would be a good choice for scipy as I see scipy more toward providing lower-level building blocks that are easily reusable and composable. Higher level “off-the-shelf” solutions can leave to say statsmodels. (My experience with statsmodels is that when it provides the functionality it’s really nice, but if you want something extra then it’s hard to reuse the existing code and I sometimes have to work out from scratch. Hence my desire for “lower-level building blocks”.)

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
@mdhaber
Copy link
Contributor Author

mdhaber commented May 25, 2024

@rlucas7 I wonder if I could interest you in reviewing this one?

@rlucas7
Copy link
Member

rlucas7 commented May 28, 2024

@rlucas7 I wonder if I could interest you in reviewing this one?

@mdhaber I definitely will but I won't be able to look until the coming weekend.

Thanks for putting this together.

@mdhaber
Copy link
Contributor Author

mdhaber commented May 28, 2024

Thanks for being willing to review!

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
@mdhaber mdhaber requested a review from rlucas7 June 10, 2024 05:34
@mdhaber
Copy link
Contributor Author

mdhaber commented Jun 10, 2024

Thanks @rlucas7. I resolved the failures (caused by hasty merge).

@josef-pkt
Copy link
Member

just to the axis argument
This is statistics, or nowadays data science, which has a very strong preference for axis.

axis = None is almost always wrong (except in a few fields) because most data has variables that cannot be compared.
axis=0 is and was forever the default layout for data, dataframes, spread sheets, ... and used in almost all literature.

(I just wrote a shortcut function that I can use a function argument
median = lambda x: np.median(x, axis=0) because numpy has the wrong axis argument for statistics.
and statsmodels for example uses
params = params.reshape(self.K, -1, order='F')
to follow the literature
)

The last longer discussion on this that I remember was around 14 years ago.

Fortran with fortran ordered arrays had it correct for "data" handling.
And statistics did not change because compiled language preference changed.

@rlucas7
Copy link
Member

rlucas7 commented Jun 12, 2024

Thanks @rlucas7. I resolved the failures (caused by hasty merge).

thanks @mdhaber no worries, I'm on on long weekend/vacation this weekend so I likely won't be able to take a look until the weekend after this one.

I'd like to build the branch and do some poking to check for edge cases etc.

Hopefully that's ok w/your timeline on this one and feel free to ping me later next week.

@lucascolley lucascolley added the needs-release-note a maintainer should add a release note written by a reviewer/author to the wiki label Jun 20, 2024
@mdhaber
Copy link
Contributor Author

mdhaber commented Aug 15, 2024

@rlucas7 I just fixed merge conflicts so I thought I'd ping you. Thanks for your help!

Copy link
Contributor Author

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix lint issue.

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jorenham jorenham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! It's great to see interest in L-moments, so thanks a lot for this!

I'm the author of the Lmo package, and it's safe to say that I'm a genuine "L-moment fanboy".

I have implemented this L-moment sample estimator in lmo.l_moment and lmo.l_ratio, the latter being a more general version of lmoment(..., standardize=True).
You might find it interesting to see how I implemented it.

I've been using lmo in production for a while now (and quite successfully), and it is rather good at preventing numerical overflow, pretty fast, and can deal with large orders.
So you might wanna "borrow" some of the tricks I've used in the implementation.

Lmo adds some methods to the scipy.stats distributions, so that you can e.g. scipy.stats.norm().l_ratio(512, 2), which is 2.7563864394155874e-06, i.e. it's surprisingly robust against numerical errors, and only took ~100 ms to run on my laptop.
So this might be useful here for verifying the sample L-moments using the theoretical ones.

scipy/stats/_stats_py.py Show resolved Hide resolved
scipy/stats/_stats_py.py Show resolved Hide resolved
scipy/stats/_stats_py.py Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Show resolved Hide resolved
scipy/stats/_stats_py.py Show resolved Hide resolved
scipy/stats/_stats_py.py Show resolved Hide resolved
scipy/stats/_stats_py.py Show resolved Hide resolved
lmoms[2:] /= lmoms[1]

lmoms[n:] = np.nan # add NaNs where appropriate
return lmoms[order-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With order=0 this effectively boils down to np.arange(0)[-1], and raises an IndexError.

Like a mentioned before, I think that order=0 should be allowed, and return 1 instead.
This way you wouldn't have to deal with the order-1 translation anymore either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we don't get that index error because it is caught by input validation:

ValueError: `order` must be a scalar or a non-empty array of positive integers.

But as I mentioned, I'll look into allowing 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking a little further, I see why I didn't include this in the first place.

Typical formulas break down for the zeroth l-moment in ways that they don't for the zeroth (regular) moment. For instance:

image

The sum is over no elements, so we don't get to the fact that binom(-1, j) is NaN, but the denominator on the left is zero. I don't see a way of taking limits to get around the 0/0 because the sum is discrete.

We run into similar problems with the formulas I use. We'd also need the -1st U-statistic.

Hoskings doesn't seem to define the 0th l-moment, or at least doesn't include it in the main discussion. It's clear from his definition that $r=0$ would be a special case.
image

His lmom package can't calculate the 0th sample L-moment because it returns an integer number of consecutive sample L-moments beginning with 1. R's EnvStats function lMoment raises an error for r=0.

We currently document that the order must be (strictly) positive and raise an error when it is zero. Given the information above, this should not be considered incorrect behavior, and adding support for order 0 would be an enhancement.

I'm sure there are arguments for why the zeroth L-moment sould be defined as 1, but does a user need code to tell them that? Do you have a reference? Is that true when one of the observations is infinite or NaN? (If not, we 'd need to add special cases on top of the special case.)

I'd prefer to leave it out of this first PR, and if someone has a reference and interest, this could be extended in a follow-up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge there aren't any authors who have used "zeroth" L-moment before.
But for the product-moments, this isn't the case; the 0th product moment is defined to be 1:

>>> scipy.stats.moment([-1/12, 1/137, 3.14, 42], 0)
1.0

L-moments roughly describe the same properties that product moments describe,
and all L-moments with orders $r < 2$ are equal to the product moments (i.e. the first L-moment == the first product-moment).
So if we simply continue this pattern, then the 0th L-moment would also be 1. Or at least, I can't think of anything else that it would be.

To understand a bit better, consider what the statistical (L- or product-) moments actually do:
They describe how the probability mass of a random variable is distributed.
And the way that (at least) the product-moments do this, is recursive (or inductive or something) by nature:

  1. ???
  2. The location of the probability mass w.r.t. the total mass -> The center of mass
  3. The location of the probability mass w.r.t. the first moment -> The displacement
  4. The location of the probability mass w.r.t. the second moment -> The "displacement of the displacement", i.e. the skewness
  5. etc...

If we consider 0th moment to be the base case, then there's only one way we can define 0:

  1. The something w.r.t. something

Now to figure out what something is, we consider the other moments and work our way backwards, and see that something is "the total mass".
So with that, we get:

  1. The total mass w.r.t. to the total mass.

The total mass under a probability distribution is by definition 1, so the "w.r.t ..." part could even be left out.
But if we leave it in, then the 0th moment is also 1 if some extra-terrestrial-quantum-alien decides to use 42 as the total probability mass.

👽

scipy/stats/__init__.py Show resolved Hide resolved
Copy link
Contributor Author

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your interest @jorenham! Even though I'm one of the more active maintainers of scipy.stats, I am definitely not a specialist, so your technical insight is appreciated.

scipy/stats/__init__.py Show resolved Hide resolved
scipy/stats/_stats_py.py Show resolved Hide resolved
scipy/stats/_stats_py.py Show resolved Hide resolved
_moment_result_object, n_samples=1, result_to_tuple=lambda x: (x,),
n_outputs=lambda kwds: _moment_outputs(kwds, [1, 2, 3, 4])
)
def lmoment(sample, order=None, *, axis=0, sorted=False, standardize=True):
Copy link
Contributor Author

@mdhaber mdhaber Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably set the default to return more than one moment because the natural default would be the first L-moment, which is identical to the usual first moment (the mean). I chose the default (first four l-moments) because that is what Hoskings chose in lmom samlmu.

This and the name of the function are certainly fair game for debate, though. I know I prefer lmoment to lmoments, but I'm not sure about the default order. I'll leave this open to see what @rlucas7 thinks.

scipy/stats/_stats_py.py Show resolved Hide resolved
scipy/stats/_stats_py.py Show resolved Hide resolved
scipy/stats/_stats_py.py Show resolved Hide resolved
scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Show resolved Hide resolved
lmoms[2:] /= lmoms[1]

lmoms[n:] = np.nan # add NaNs where appropriate
return lmoms[order-1]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we don't get that index error because it is caught by input validation:

ValueError: `order` must be a scalar or a non-empty array of positive integers.

But as I mentioned, I'll look into allowing 0.

ref = stats.lmoment(sample.astype(np.float64))
assert res.dtype.type == np.float64
assert_allclose(res, ref, rtol=1e-15)

Copy link
Contributor

@jorenham jorenham Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shift-invariance of the L-scale, and the scale-invariance of the L-ratio's might also be worth verifying. I.e. scale * lmoment(x, 2) == lmoment(loc + scale * x, 2) and lmoment(x, r) == lmoment(loc + scale * x, r) (r: int > 2) for all finite loc, and for all finite + nonzero scale.

Maybe hypothesis could be used for this (although it's rather annoying IMO to fine-tune in order to avoid flakyness, but there's good chance that you're better at taming it than I am 🤷🏻).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the ability to use hypothesis in SciPy, so I'm a fan of property-based testing. If needed, I can explain why you don't see hypothesis use in this PR. Ultimately, it will depend on the opinion of the reviewing maintainer whether this particular test is important to add. Unless we exercie different code-paths with such a property-based test, I think it is unlikely that this test would catch a error while tests that look for precise agreement with a numerical reference pass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logically this won't test much, I agree.
But it could help verify the numerical stability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave this open for @rlucas7 to assess.

Copy link
Contributor Author

@mdhaber mdhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jorenham. To facilitate further cycles of review, please click "Resolve conversation" when the comment is no longer needed. Please leave those that require action from me or input from @rlucas7 unresolved.

scipy/stats/_stats_py.py Show resolved Hide resolved
ref = stats.lmoment(sample.astype(np.float64))
assert res.dtype.type == np.float64
assert_allclose(res, ref, rtol=1e-15)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the ability to use hypothesis in SciPy, so I'm a fan of property-based testing. If needed, I can explain why you don't see hypothesis use in this PR. Ultimately, it will depend on the opinion of the reviewing maintainer whether this particular test is important to add. Unless we exercie different code-paths with such a property-based test, I think it is unlikely that this test would catch a error while tests that look for precise agreement with a numerical reference pass.

scipy/stats/_stats_py.py Outdated Show resolved Hide resolved
scipy/stats/_stats_py.py Show resolved Hide resolved
scipy/stats/tests/test_stats.py Show resolved Hide resolved
scipy/stats/_stats_py.py Show resolved Hide resolved
@jorenham
Copy link
Contributor

jorenham commented Aug 22, 2024

Thanks @jorenham. To facilitate further cycles of review, please click "Resolve conversation" when the comment is no longer needed. Please leave those that require action from me or input from @rlucas7 unresolved.

I dont't think I can; I'm not a maintainer here.


I'll 👍🏻 the last comment instead

@mdhaber
Copy link
Contributor Author

mdhaber commented Aug 22, 2024

Ok, please add an 😄 emoji to the ones I can resolve.

@jorenham
Copy link
Contributor

Ok, please add an 😄 emoji to the ones I can resolve.

It apparantly became a 👍🏻, is that ok too?

@mdhaber
Copy link
Contributor Author

mdhaber commented Sep 29, 2024

Hey @rlucas7, thought I'd ping you again here. Think we can get this into the next version of SciPy?

Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, it reads closely to the definition so I don't have much to say. Well tested as usual and with tons of input validation.

I don't see anything that would be blocking personally so +1 here.

@@ -10199,6 +10199,146 @@ def first_order(t):
return res.root


def _lmoment_iv(sample, order, axis, sorted, standardize):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really to think about getting these into their own place. For one, I think you are the only one to do things like that, and second I would think we could reuse stuff.

@tupui tupui merged commit 6f0f0b8 into scipy:main Nov 8, 2024
39 checks passed
@tupui tupui added this to the 1.15.0 milestone Nov 8, 2024
@mdhaber
Copy link
Contributor Author

mdhaber commented Nov 8, 2024

Oops, the last time we ran tests was a while ago, and this is failing a relatively new test. I'll work on that now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement A new feature or improvement needs-release-note a maintainer should add a release note written by a reviewer/author to the wiki scipy.stats
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants