Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider .sort = FALSE for summarise(), reframe(), and slice_sample() #6663

Open
DavisVaughan opened this issue Jan 26, 2023 · 2 comments
Open
Labels
feature a feature request or enhancement grouping 👨‍👩‍👧‍👦

Comments

@DavisVaughan
Copy link
Member

With the introduction of .by, we no longer sort group keys automatically. There are a whole host of good reasons for this as outlined here #5664 (comment), and I am mostly confident this is the right long term default for dplyr.

However, I am empathetic to the fact that users do often like to see their summary results sorted in ascending order. Right now, our recommendation is:

df %>%
  summarise(..., .by = c(a, b, c)) %>%
  arrange(a, b, c) # could also come before `summarise()`

This is nice because you get the full power of arrange() including desc() and .locale.

I think we should consider a .sort argument like:

df %>%
  summarise(..., .by = c(a, b, c), .sort = TRUE)
  • .sort = FALSE would be the default for reasons mentioned above.
  • We'd document this as the 100% backwards compatible way to transition from group_by() to .by (even though most of the time the ordering isn't important).
  • You must accept that you get ascending order and the C locale. That makes it compatible with group_by(). If you need anything fancier, call arrange().
  • I do like that you won't have to repeat the group names.
  • Obviously .sort = TRUE errors on unorderable types like clock's year-month-weekday.
  • This would probably only be an argument for the .data.frame method, as opposed to the generic, because dbplyr probably won't want to enforce a sort order? Uncertain.

Basically, this leaves the idea of a groupby + summarise operation theoretically pure (because it shouldn't require orderable keys), but also gives users a convenient way to optionally opt in to sorted results.


There are 3 functions that would get this argument:

The following would not get .sort because they aren't about row ordering:

@ghost
Copy link

ghost commented Oct 19, 2023

I would like this feature, currently the only way I can see achieving this type of sorting is by using the deprecated version of the function

df |> summarize(n = sum(num), .by = c(a, b, c)) |> arrange(a, b, c, n, .by_groups = TRUE)

which I think is a less elegant way to write the above suggestion. A new argument .sort = TRUE being the temporary sorting group of the function in question.

@JosiahParry
Copy link

I would like to note that this caught me by surprise today. I anticipated that group_by() |> summarise() and .by are a 1:1 transition and that if I compared results between two seemingly compatible expressions, that the results would be the same. That is not the case.

Buried deep in the docs is this quote

"Uses ordering by "first appearance" in the original data"

which provides an answer to the quandary.

Earlier in the same doc it says

"Summaries use existing order of group keys"

This doesn't actually make sense to me because we're going from a Many -> 1 so it is not actually possible to keep the order of the keys. Perhaps the "first appearance" portion can be added to this part of the doc?

Repro:

summarise(..., .by = group) and group_by() |> summarise() are not equivalent. This shows that group_by() |> summarise() arranges the results in alphabetical order whereas .by does not.

library(dplyr)

user <- starwars |> 
  select(height, species) |> 
  group_by(species) |> 
  summarise(avg_height = mean(height))

answer <- starwars |> 
  select(height, species) |> 
  summarise(avg_height = mean(height), .by = species)

waldo::compare(user, answer)
#> old vs new
#>                    species avg_height
#> - old[1, ]  Aleena            79.0000
#> + new[1, ]  Human                  NA
#> - old[2, ]  Besalisk         198.0000
#> + new[2, ]  Droid                  NA
#> - old[3, ]  Cerean           198.0000
#> + new[3, ]  Wookiee          231.0000
#> - old[4, ]  Chagrian         196.0000
#> + new[4, ]  Rodian           173.0000
#> - old[5, ]  Clawdite         168.0000
#> + new[5, ]  Hutt             175.0000
#> - old[6, ]  Droid                  NA
#> + new[6, ]  NA               175.0000
#> - old[7, ]  Dug              112.0000
#> + new[7, ]  Yoda's species    66.0000
#> - old[8, ]  Ewok              88.0000
#> + new[8, ]  Trandoshan       190.0000
#> - old[9, ]  Geonosian        183.0000
#> + new[9, ]  Mon Calamari     180.0000
#> - old[10, ] Gungan           208.6667
#> + new[10, ] Ewok              88.0000
#> and 28 more ...
#> 
#>      old$species | new$species                     
#>  [1] "Aleena"    - "Human"          [1]            
#>  [2] "Besalisk"  - "Droid"          [2]            
#>  [3] "Cerean"    - "Wookiee"        [3]            
#>  [4] "Chagrian"  - "Rodian"         [4]            
#>  [5] "Clawdite"  - "Hutt"           [5]            
#>  [6] "Droid"     - NA               [6]            
#>  [7] "Dug"       - "Yoda's species" [7]            
#>  [8] "Ewok"      - "Trandoshan"     [8]            
#>  [9] "Geonosian" - "Mon Calamari"   [9]            
#> [10] "Gungan"    - "Ewok"           [10]           
#>  ... ...           ...              and 28 more ...
#> 
#> `old$avg_height`: 79 198 198 196 168  NA 112  88 183 209 and 28 more...
#> `new$avg_height`: NA  NA 231 173 175 175  66 190 180  88            ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement grouping 👨‍👩‍👧‍👦
Projects
None yet
Development

No branches or pull requests

2 participants