-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: stats.lmoment: add function to calculate sample L-moments #19475
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some self-review to aid the reviewer.
scipy/stats/_stats_py.py
Outdated
x = np.broadcast_to(x, x.shape[:-2] + (len(r), n)) | ||
x = np.triu(x) | ||
j = np.arange(n) | ||
return np.sum(special.binom(j, r[:, np.newaxis])*x, axis=-1) / special.binom(n-1, r) / n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a little tricky to compare this implementation against the formula. Without thinking carefully:
- It may look as though
special.binom(j, ...)
should be something likespecial.binom(j - 1, ...)
- It may look as though
j = np.arange(n)
should be something likej = np.arange(r, n + 1)
But I think closer inspection will reveal that this implementation is equivalent. Note that r
is a vector, so we're calculating multiple r
. The broadcasting and x = np.triu(x)
zeros out the terms that shouldn't be part of the sum.
bk = _br(sample, r=k) | ||
|
||
n = sample.shape[-1] | ||
bk[..., n:] = 0 # remove NaNs due to n_moments > n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the number of moments exceeds the number of observations, NaNs appear in the corresponding columns of bk
. The weight applied to these values is 0
for L-moments of order less than or equal to the number of observations, and nonzero for L-moments of order greater than the number of observations; therefore, only the L-moments of order greater than the number of observations should be NaN. Unless we're careful, that's not what happens in prk * bk
below because 0 * np.nan
is np.nan
. Instead, we're zero out these NaNs in bk
and NaN-out the rows of the L-moments for which the order is greater than the number of observations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if vectorized the amount of samples is always a constant.
So perhaps we could raising an error here instead? Maybe it could help to keep the code nice and and simple. Plus, there's the whole fail-fast dogma argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if vectorized the amount of samples is always a constant.
Not for masked arrays.
I vote for the name The argument The The The |
Btw I come across this Python package lmo (BSD license) which deals with L-moments and which appears to be well written. Could be used as a reference, or better, invite the author to incorporate some of the lower-level stuff into SciPy. |
The is a justifiable perspective. Please keep in mind that the choice is subjective; other qualified people have thought about the same thing and come to a different conclusion.
This can easily be changed in a backward compatible way with
Before breaking changes are allowed by a SciPy 2.0 release,
I've advocated for
Great!
Skewness and kurtosis are known as "standardized moments".
Please elaborate on the motivation for this comment, since this is already written, and I've used R's @fancidev others may not care as much, but I would appreciate it if you would phrase your opinions as such. Thanks! |
Thanks for the reply @mdhaber ! The For Thanks for pointing out the term of “standardized moments”, I haven’t heard of them before. I mentioned Python Hosking’s R function |
@rlucas7 I wonder if I could interest you in reviewing this one? |
Thanks for being willing to review! |
Thanks @rlucas7. I resolved the failures (caused by hasty merge). |
just to the axis argument axis = None is almost always wrong (except in a few fields) because most data has variables that cannot be compared. (I just wrote a shortcut function that I can use a function argument The last longer discussion on this that I remember was around 14 years ago. Fortran with fortran ordered arrays had it correct for "data" handling. |
thanks @mdhaber no worries, I'm on on long weekend/vacation this weekend so I likely won't be able to take a look until the weekend after this one. I'd like to build the branch and do some poking to check for edge cases etc. Hopefully that's ok w/your timeline on this one and feel free to ping me later next week. |
@rlucas7 I just fixed merge conflicts so I thought I'd ping you. Thanks for your help! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix lint issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi! It's great to see interest in L-moments, so thanks a lot for this!
I'm the author of the Lmo
package, and it's safe to say that I'm a genuine "L-moment fanboy".
I have implemented this L-moment sample estimator in lmo.l_moment
and lmo.l_ratio
, the latter being a more general version of lmoment(..., standardize=True)
.
You might find it interesting to see how I implemented it.
I've been using lmo
in production for a while now (and quite successfully), and it is rather good at preventing numerical overflow, pretty fast, and can deal with large orders.
So you might wanna "borrow" some of the tricks I've used in the implementation.
Lmo adds some methods to the scipy.stats
distributions, so that you can e.g. scipy.stats.norm().l_ratio(512, 2)
, which is 2.7563864394155874e-06
, i.e. it's surprisingly robust against numerical errors, and only took ~100 ms to run on my laptop.
So this might be useful here for verifying the sample L-moments using the theoretical ones.
lmoms[2:] /= lmoms[1] | ||
|
||
lmoms[n:] = np.nan # add NaNs where appropriate | ||
return lmoms[order-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With order=0
this effectively boils down to np.arange(0)[-1]
, and raises an IndexError
.
Like a mentioned before, I think that order=0
should be allowed, and return 1 instead.
This way you wouldn't have to deal with the order-1
translation anymore either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we don't get that index error because it is caught by input validation:
ValueError: `order` must be a scalar or a non-empty array of positive integers.
But as I mentioned, I'll look into allowing 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking a little further, I see why I didn't include this in the first place.
Typical formulas break down for the zeroth l-moment in ways that they don't for the zeroth (regular) moment. For instance:
The sum is over no elements, so we don't get to the fact that binom(-1, j)
is NaN, but the denominator on the left is zero. I don't see a way of taking limits to get around the 0/0 because the sum is discrete.
We run into similar problems with the formulas I use. We'd also need the -1st U-statistic.
Hoskings doesn't seem to define the 0th l-moment, or at least doesn't include it in the main discussion. It's clear from his definition that
His lmom
package can't calculate the 0th sample L-moment because it returns an integer number of consecutive sample L-moments beginning with 1. R's EnvStats
function lMoment
raises an error for r=0
.
We currently document that the order must be (strictly) positive and raise an error when it is zero. Given the information above, this should not be considered incorrect behavior, and adding support for order 0 would be an enhancement.
I'm sure there are arguments for why the zeroth L-moment sould be defined as 1, but does a user need code to tell them that? Do you have a reference? Is that true when one of the observations is infinite or NaN? (If not, we 'd need to add special cases on top of the special case.)
I'd prefer to leave it out of this first PR, and if someone has a reference and interest, this could be extended in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To my knowledge there aren't any authors who have used "zeroth" L-moment before.
But for the product-moments, this isn't the case; the 0th product moment is defined to be 1:
>>> scipy.stats.moment([-1/12, 1/137, 3.14, 42], 0)
1.0
L-moments roughly describe the same properties that product moments describe,
and all L-moments with orders
So if we simply continue this pattern, then the 0th L-moment would also be 1. Or at least, I can't think of anything else that it would be.
To understand a bit better, consider what the statistical (L- or product-) moments actually do:
They describe how the probability mass of a random variable is distributed.
And the way that (at least) the product-moments do this, is recursive (or inductive or something) by nature:
- ???
- The location of the probability mass w.r.t. the total mass -> The center of mass
- The location of the probability mass w.r.t. the first moment -> The displacement
- The location of the probability mass w.r.t. the second moment -> The "displacement of the displacement", i.e. the skewness
- etc...
If we consider 0th moment to be the base case, then there's only one way we can define 0:
- The
something
w.r.t.something
Now to figure out what something
is, we consider the other moments and work our way backwards, and see that something
is "the total mass".
So with that, we get:
- The total mass w.r.t. to the total mass.
The total mass under a probability distribution is by definition 1, so the "w.r.t ..." part could even be left out.
But if we leave it in, then the 0th moment is also 1 if some extra-terrestrial-quantum-alien decides to use 42 as the total probability mass.
👽
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your interest @jorenham! Even though I'm one of the more active maintainers of scipy.stats
, I am definitely not a specialist, so your technical insight is appreciated.
_moment_result_object, n_samples=1, result_to_tuple=lambda x: (x,), | ||
n_outputs=lambda kwds: _moment_outputs(kwds, [1, 2, 3, 4]) | ||
) | ||
def lmoment(sample, order=None, *, axis=0, sorted=False, standardize=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably set the default to return more than one moment because the natural default would be the first L-moment, which is identical to the usual first moment (the mean). I chose the default (first four l-moments) because that is what Hoskings chose in lmom
samlmu
.
This and the name of the function are certainly fair game for debate, though. I know I prefer lmoment
to lmoments
, but I'm not sure about the default order
. I'll leave this open to see what @rlucas7 thinks.
lmoms[2:] /= lmoms[1] | ||
|
||
lmoms[n:] = np.nan # add NaNs where appropriate | ||
return lmoms[order-1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, we don't get that index error because it is caught by input validation:
ValueError: `order` must be a scalar or a non-empty array of positive integers.
But as I mentioned, I'll look into allowing 0
.
ref = stats.lmoment(sample.astype(np.float64)) | ||
assert res.dtype.type == np.float64 | ||
assert_allclose(res, ref, rtol=1e-15) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shift-invariance of the L-scale, and the scale-invariance of the L-ratio's might also be worth verifying. I.e. scale * lmoment(x, 2) == lmoment(loc + scale * x, 2)
and lmoment(x, r) == lmoment(loc + scale * x, r)
(r: int > 2
) for all finite loc
, and for all finite + nonzero scale
.
Maybe hypothesis
could be used for this (although it's rather annoying IMO to fine-tune in order to avoid flakyness, but there's good chance that you're better at taming it than I am 🤷🏻).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the ability to use hypothesis
in SciPy, so I'm a fan of property-based testing. If needed, I can explain why you don't see hypothesis
use in this PR. Ultimately, it will depend on the opinion of the reviewing maintainer whether this particular test is important to add. Unless we exercie different code-paths with such a property-based test, I think it is unlikely that this test would catch a error while tests that look for precise agreement with a numerical reference pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logically this won't test much, I agree.
But it could help verify the numerical stability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave this open for @rlucas7 to assess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ref = stats.lmoment(sample.astype(np.float64)) | ||
assert res.dtype.type == np.float64 | ||
assert_allclose(res, ref, rtol=1e-15) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the ability to use hypothesis
in SciPy, so I'm a fan of property-based testing. If needed, I can explain why you don't see hypothesis
use in this PR. Ultimately, it will depend on the opinion of the reviewing maintainer whether this particular test is important to add. Unless we exercie different code-paths with such a property-based test, I think it is unlikely that this test would catch a error while tests that look for precise agreement with a numerical reference pass.
Ok, please add an 😄 emoji to the ones I can resolve. |
It apparantly became a 👍🏻, is that ok too? |
Hey @rlucas7, thought I'd ping you again here. Think we can get this into the next version of SciPy? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM, it reads closely to the definition so I don't have much to say. Well tested as usual and with tons of input validation.
I don't see anything that would be blocking personally so +1 here.
@@ -10199,6 +10199,146 @@ def first_order(t): | |||
return res.root | |||
|
|||
|
|||
def _lmoment_iv(sample, order, axis, sorted, standardize): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really to think about getting these into their own place. For one, I think you are the only one to do things like that, and second I would think we could reuse stuff.
Oops, the last time we ran tests was a while ago, and this is failing a relatively new test. I'll work on that now. |
Reference issue
gh-19460
What does this implement/fix?
The idea of fitting distributions to data with method of L-moments has been tossed around for a while, and gh-19460 brought it back up again. This PR adds a function to compute sample L-moments, which is an essential step of the method of L-moments.
Additional information
No rush here; I wrote the essential part of the code in #19460 (comment) so I figured I might as well submit a PR.
This needs an email to the mailing list before merge. (Done 3/11/2024.)
I can see arguments for the name being
l_moment
instead oflmoment
. If there is strong support for changing it, that's OK with me.