Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustered SE considerations in feols #6

Closed
karldw opened this issue Nov 23, 2019 · 17 comments
Closed

Clustered SE considerations in feols #6

karldw opened this issue Nov 23, 2019 · 17 comments

Comments

@karldw
Copy link

karldw commented Nov 23, 2019

The details of clustering and degrees-of-freedom corrections are perennially issues, and challenging for all the issues Sergio pointed out in this thread.

When I run @grantmcdermott's example from that same discussion, feols gives the same results as lfe::felm or Stata's cgmreg, but different than Stata's reghdfe or Grant's proposed felm(..., cmethod="reghdfe").
I'm not sure you want to do anything differently, but it's helpful to document the differences across packages.

Ref: DeclareDesign/estimatr#321, sgaure/lfe#19, FixedEffects/FixedEffectModels.jl#49

@lrberge
Copy link
Owner

lrberge commented Nov 25, 2019

Thanks for the pointer. Comparability across software is indeed important.

I'll include more options in SE clustering. Then, I'll write a vignette explaining exactly how the clustered SEs are computed in the package, detailing the different options.

The problem is that the theory behind se-clustering is not compelling (as mentioned by Sergio). Even nesting is not so clear. Let me use the example you introduced in the thread you referenced.

Let's say you cluster the SEs on a dimension that nests only half the data (e.g. you estimate employee FEs and cluster at the company level, leading to partial nesting):

# var. 'cl' (ranging from 1 to 50) did nest the var. 'fe' (from 1 to 500)
# now 'cl_half' nests only half the sample (random for the rest)
cl_half = cl
cl_half[cl < 25] = 100 + sample(1:30, sum(cl < 25), TRUE)

There, as nesting is not complete, it will pass undetected, while if it were detected it would lead to a reduction in the number of parameters K from 501 to about 251 [and a subsequent decrease in the value of the SE]. Or should K be even reduced to 1 when there is partial nesting (that is: equal to the clustered within estimator)???

Another issue is inverse nesting. If instead of estimating:
reghdfe y x, absorb(fe) vce(cluster cl)
you estimate
reghdfe y x, absorb(cl) vce(cluster fe)
there is no correction in the latter (meaning K is equal to 51 instead of 1) although cl is strictly included in fe.

My point is that there are still open questions. But I fully agree different software should be able to provide identical results. I'll work on that!

@grantmcdermott
Copy link
Contributor

@lrberge I need to think a bit more about about your first example (partial nesting). But I'm not sure I see a problem with the second (inverse nesting).

We care about FEs being nested within the cluster variables because the former are swept out of the regression prior to estimation. (Unless you're doing something like LSDV estimation which doesn't sweep out the FEs, but that's obviously very inefficient.) So they're not automatically "present" when we make the necessary DoF cluster adjustment. I don't really see how this should have implications going the other way around. Aren't we identifying off of a system where all possible variation in (and within) this hierarchical structure has already been removed? I mean, that loss of information is the standard price we pay for quick estimation of panel models using Frisch-Waugh, etc.

(Practical example: Do we care about the strong within household correlations between individual (e.g. siblings) when estimating a model at the household level? I don't think we do. Of course, if you have data on individuals then you might as well use this instead of the (collinear) household FEs, but then you're estimating a different model.)

Regardless of what you decide, I strongly agree with you that the clustering of standard errors is much less of a precise science than people like to pretend! A vignette detailing some options for reproducing the results from other SW would be a welcome addition. Let me know if you'd like any help.

PS — Since we're on the subject... If I may, the current documentation (in the introduction for example) is a little confusing because you sometimes refer to groups of fixed effects as "clusters". I understand why you do this from a conceptual standpoint, but it might throw new users off.

@lrberge
Copy link
Owner

lrberge commented Nov 27, 2019

Thanks for weighing in! Very healthy discussion! :-)

I don't really understand your point. But let me try to make mine clearer.

The main difference between SEs is driven by how we compute K, the number of parameters. So far in reghdfe, felm and feols, we use:

K = number of variables + number of Fixed-Effects coefficients

That is, in @karldw's example, we use K = 501 (1 variable and 500 FE coefs). This is true for the three functions.

Now we differ on how to handle nestedness when clustering the standard errors.
In both felm and feols, nothing special is done: regular clustering with no modification of the DOF -- K remains constant.
On the other hand, in reghdfe, when FEs are found to be nested, there is a big drop in K, sweeping out all the fixed-effect coefficients -- K becomes only equal to the number of variables (1 in our example).

While I'm perfectly fine with this way of doing, what makes me uncomfortable is the sharp, discontinuous decline in K. Let me take an example to make it clear:

# cl is the variable nesting the FEs
cl_bis = cl
# I change just one obs.
cl_bis[1] = 51

What I have done is creating a variable that nests 499 of the fixed-effects but one. In all logic, we would expect to get similar results when clustering by cl or cl_bis since the variables are identical on 2499 of the 2500 observations. Let's see:

method se (clustered: cl) t-stat (cl) se (clustered: cl_bis) t-stat (cl_bis)
felm/feols 0.01484 1.91 0.01485 1.91
reghdfe 0.01327 2.14 0.01485 1.91

Due to the behavior regarding nestedness, there is a big discrepancy (-11%!) between the two standard errors in reghdfe although the two variables cl and cl_bis are 99.96% identical.

This discontinuity is a call for theory. I would be much more comfortable with something kicking one FE coefficient from K for each nested coefficient, that way we would have no such discrepancy. But again: theory needed.

This is what I had in mind when talking about partial nesting.

@lrberge
Copy link
Owner

lrberge commented Nov 27, 2019

In any case, I'll also implement stg to take into account the nestedness, my only point was to highlight that this is not so clear after all. I welcome any suggestion.

By the way, regarding your post scriptum, you're right: in my mind I use the words cluster and fixed-effects interchangeably, and this trickles down to the description I wrote.
Indeed, terminology is important and using one word for another leads to confusion. I'll amend it, thanks for the comment.

@lrberge
Copy link
Owner

lrberge commented Nov 27, 2019

Some quick and dirty Monte Carlo simulation:

  • y is white noise
  • x is white noise
  • fe is a fixed-effect, with 10 observations per FE

I estimate the following:

y_it = fe_i + x_it + eps_it

Since y is white noise, we expect a rejection of the null for x 5% of the time using standard testing. Furthermore, clustering the SEs should have no effect since y is just white noise.

Define k = 1 (the number of variables) and G = the number of fixed-effects. Here are the results when using different values for K (either K = k, either K = k + G). It provides the number of times the null hypothesis is rejected for x over 1000 estimations. I use a sample of 1000 observations and replicate the experiment 100 times.

image

Clearly it shows that when not clustering the SEs, we should use K = k + G.
However, regarding clustered-SEs, this is less clear: using K = k + G leads to under-rejection, while K = k leads to over-rejection... Any thoughts???

@ghost
Copy link

ghost commented Jan 4, 2020

This is an interesting discussion! Do you have any insights on which degrees of freedom to use for the robust F test in case of multi-way clustering? (See also the question on Cross Validated: https://stats.stackexchange.com/questions/229678/double-clustered-standard-errors-and-degrees-of-freedom-in-wald-style-f-test-for)

@lrberge
Copy link
Owner

lrberge commented Feb 3, 2020

Sorry @Helix123, I have no clue! Maybe you could just check by running some Monte Carlo experiments?

@lrberge
Copy link
Owner

lrberge commented Feb 3, 2020

I have updated the package to account for different types of standard-errors. Now the nested way is the default. It will avoid confusion when comparing with Stata output.

I have implemented a new function dof() where you can set how the DoF correction should be done (nested, full [or no] account of the fixed-effects parameters; in case of multiple fixed-effects: whether the exact number of restrictions should be accounted for; when clustering the SEs, the adjustment at the cluster level).

@lrberge lrberge closed this as completed Feb 3, 2020
@grantmcdermott
Copy link
Contributor

@lrberge Looks great.

@Helix123 I don't think this is the right place for the Wald DoF discussion, but see the conversation here. FWIW, lfe::felm currently makes a simple DoF adjustment based on the length of the first cluster (which is an error in my view, as I point out in the link).

@lrberge
Copy link
Owner

lrberge commented Jul 14, 2020

Hi, just to say that I have finally settled on the way to compute the SEs in fixest. It was a much more complicated journey that I initially thought!

I detailed how they are computed in this vignette: On standard-errors.
I also compare them to lfe and reghdfe. By the way, I couldn't understand how they are currently computed in lfe (version 2.8-5, for clustered standard-errors with multiple FEs, I think there might be a bug).

Feel free to give me any comment, I'll modify the document accordingly. Thanks again for the discussion!

@karldw
Copy link
Author

karldw commented Jul 14, 2020

This is really useful! I appreciate "You’re already fed up about about these details? I’m sorry but there’s more".

I'm not sure how deep you want to go on how to choose among the different options. As you say, you "don’t discuss the why, but only the how." That said, it could be useful to have a couple of pointers for folks that want to learn more about making the right choice for their application. The references at the end are great, but it's unclear which paper addresses each choice in the vignette.

@grantmcdermott
Copy link
Contributor

Super work, Laurent. I sympathize strongly with your "complicated journey", having easily lost a week+ of my life to these issues before!

The one thing I'll say about lfe is that adding the cmethod = 'reghdfe' option generally (if not always) tends to reconcile the differences between it and reghdfe in the case of multiway clustering. This is, in part, what my PR here was meant to do and is discussed in more depth in the lfe::felm() documentation. Relevant section from the latter:

The cmethod argument may affect the clustered covariance matrix (and thus regressor standard errors), either directly or via adjustments to a degrees of freedom scaling factor. In particular, Cameron, Gelbach and Miller (CGM2011, sec. 2.3) describe two possible small cluster corrections that are relevant in the case of multiway clustering.

The first approach adjusts each component of the cluster-robust variance estimator (CRVE) by its own c_i adjustment factor. For example, the first component (with G clusters) is adjusted by c_1 = G/(G-1)×(N-1)/(N-K), the second component (with H clusters) is adjusted by c_2 = H/(H-1)×(N-1)/(N-K), etc.

The second approach applies the same adjustment to all CRVE components: c = J/(J-1)×(N-1)/(N-K), where J=min(G,H) in the case of two-way clustering, for example.

Any differences resulting from these two approaches are likely to be minor, and they will obviously yield exactly the same results when there is only one cluster dimension. Still, CGM2011 adopt the former approach in their own paper and simulations. This is also the default method that felm uses (i.e. cmethod = 'cgm'). However, the latter approach has since been adopted by several other packages that allow for robust inference with multiway clustering. This includes the popular Stata package reghdfe, as well as the FixedEffectModels.jl implementation in Julia. To match results from these packages exactly, use cmethod = 'cgm2' (or its alias, cmethod = 'reghdfe'). It is possible that some residual differences may still remain; see discussion here.

It's also worth noting that I've also seen xtreg and reghdfe differ on occasion...

Having said all that:

  1. Running felm(y ~ x | id + time | 0 | id, base, cmethod = 'reghdfe) doesn't reconcile the differences here... unsurprisingly, since we're only clustering on a single dimension.
  2. Linking to or mentioning the above may only add to the confusion, so not clear to me that you want to include it!

@grantmcdermott
Copy link
Contributor

Again at the risk of overkill, but perhaps appropriate for an acknowledgements or suggested further reading section at the end:

I'm not sure if you've seen the new(ish) sandwich vignette on clustered covariances? It's very thorough.

@lrberge
Copy link
Owner

lrberge commented Jul 15, 2020

Regarding lfe I think that once there's one set of fixed-effects that is nested, the degrees of freedom associated to the FEs become 1, even if there are other FEs not nested (to be confirmed).

@karldw: as you could feel: I... was the one who was fed up about the details!

@grantmcdermott: "I've also seen xtreg and reghdfe differ on occasion" God, please don't tell me things like that! ;-)

I think you're right, providing some pointers is important, and the sandwich vignette (forthcoming in JSS btw) is a fantastic reference (that I didn't know of, thanks for the ref!). Btw the package is finally compatible with sandwich, so even more vcov possibilities! youhou!

@grantmcdermott
Copy link
Contributor

@lrberge I've just been reading through the vignette again and realised that newcomers to fixest might be confused by the fact the arguments you're referring to at the top — "se" and "dof" — apply to the methods (summary.fixest(), etc.), rather than the original model call(s).

I know that you provide examples further below, but I really think something right out of the gate would help orientate readers.

It could be something as simple as adding a one-liner under the appropriate headings. For example, under "The argument se" you could write

The argument se

The argument se can be equal to either: "standard", "white", "cluster", "twoway", or "threeway". For example, we can adjust the standard-errors for an existing model on the fly as follows:

gravity <- feols(log(Euros) ~ log(dist_km) | Origin + Destination + Product + Year, trade)
summary(gravity, se = "twoway")

And similarly for "The argument dof" heading.

Just a thought!

@grantmcdermott
Copy link
Contributor

Oh, another thing. Do you think there's any value in showing readers how to extract, say, the degrees of freedom from an existing model? Might be useful to let them know where the numbers are coming from, e.g.

attr(vcov(gravity), "dof.K")

@lrberge
Copy link
Owner

lrberge commented Aug 6, 2020

You're right indeed, I do as if users were already acquainted with the package, but it's a very wrong assumption!

Yep, looks like a good idea to add the dof.K example too.

Thanks for the suggestions!

pachadotdev referenced this issue in pachadotdev/fixest2 Mar 19, 2022
comenta pruebas con errores + CI estandar
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants