Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gtools version of merge #76

Open
NilsJPWerner opened this issue Jul 26, 2021 · 5 comments
Open

gtools version of merge #76

NilsJPWerner opened this issue Jul 26, 2021 · 5 comments

Comments

@NilsJPWerner
Copy link

NilsJPWerner commented Jul 26, 2021

What would you like gtools to add or change (and why)?
It would be fantastic if gtools had a gmerge command. Ftools seems to have join/fmerge that is a 2x speedup over merge but since it is implemented in mata it can't support mixed types.

Please include a specific suggestion
Add gmerge command that implements the standard merge functionality.

@NilsJPWerner NilsJPWerner changed the title gtools version of gmerge gtools version of merge Jul 26, 2021
@mcaceresb
Copy link
Owner

@NilsJPWerner In theory I'd like to implement this, but in practice I've looked into it a bit and it's very complicated and not at all clear that I'd get a very large speed improvement. I'd like to look into this again in the future but it won't be any time soon. Sorry!

@fpet19
Copy link

fpet19 commented Dec 1, 2021

Unrelated to merge but also a suggestion: it would be great if you could provide a gtools enhancement for carryforward. This is an essential (to me) but often overlooked command, and currently extremely slow. Thanks!

@mcaceresb
Copy link
Owner

@fpet19 I am curious, what is a specific scenario/example where carryforward is very slow? I have not used it but itsn't it a wrapper for replace var = var[_n-1] if mi(var)?

It's surprising this is specially slow. Or is the issue that if you call it with by that you have to sort the data first? Since it's sensitive to sort order I would have assumed sorting might have been an unavoidable operation.

@fpet19
Copy link

fpet19 commented Dec 1, 2021

Yes, that seems to be the case, I always call it with by. I use gegen to create a group variable for a subset of the group, and then I populate it for the whole group using carryforward. The second command is over 5 times slower than the first.

I assume that whatever magic gtools does for gegen which does not require sorting and then resorting should be useful here. In very long datasets just avoiding having to xtset after gegen-related commands is worth it.

@mcaceresb
Copy link
Owner

@fpet19 For a while now I've basically had carryforward implemented without realizing it. (I was doing this in some data cleaning and somehow remembered your commend from years ago.) I haven't exactly optimized it, but it's a byproduct of this gstats command:

gstats moving (lastnm . 0) filled = var, by(group)

This gets the last non-missing value from var; . 0 looks backwards but includes the current value, so if it's non-missing it doesn't get replaced. I should point out this is still sort-dependent, and I do have my data sorted. However, if your data is unsorted but you have some index you could do

gstats range (lastnm . 0 index) filled = var, by(group)

which looks at the observations with vaues <= index instead. Now, gtools only works with numerical data, so I've actually also had occasion to use the original suggestion I had here, but I've found usefulness for this command in particular when combining it with other calls to gstats or when I don't want to replace the original variable. (Oh and unlike some of my other commands, this is not a copycat so there may be some differences from carryforward I haven't thought about, like e.g. how it handles if and in.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants