Consider `.sort = FALSE` for `summarise()`, `reframe()`, and `slice_sample()` #6663

DavisVaughan · 2023-01-26T17:12:12Z

With the introduction of .by, we no longer sort group keys automatically. There are a whole host of good reasons for this as outlined here #5664 (comment), and I am mostly confident this is the right long term default for dplyr.

However, I am empathetic to the fact that users do often like to see their summary results sorted in ascending order. Right now, our recommendation is:

df %>%
  summarise(..., .by = c(a, b, c)) %>%
  arrange(a, b, c) # could also come before `summarise()`

This is nice because you get the full power of arrange() including desc() and .locale.

I think we should consider a .sort argument like:

df %>%
  summarise(..., .by = c(a, b, c), .sort = TRUE)

.sort = FALSE would be the default for reasons mentioned above.
We'd document this as the 100% backwards compatible way to transition from group_by() to .by (even though most of the time the ordering isn't important).
You must accept that you get ascending order and the C locale. That makes it compatible with group_by(). If you need anything fancier, call arrange().
I do like that you won't have to repeat the group names.
Obviously .sort = TRUE errors on unorderable types like clock's year-month-weekday.
This would probably only be an argument for the .data.frame method, as opposed to the generic, because dbplyr probably won't want to enforce a sort order? Uncertain.

Basically, this leaves the idea of a groupby + summarise operation theoretically pure (because it shouldn't require orderable keys), but also gives users a convenient way to optionally opt in to sorted results.

There are 3 functions that would get this argument:

summarise()
reframe()
slice_sample() (goes with slice() and slice_head/tail/min/max() should act like a filter() not a reframe() #6662)

The following would not get .sort because they aren't about row ordering:

filter()
mutate()
slice() and slice_min/max/head/tail() (after slice() and slice_head/tail/min/max() should act like a filter() not a reframe() #6662 is changed)

The text was updated successfully, but these errors were encountered:

ghost · 2023-10-19T20:05:37Z

I would like this feature, currently the only way I can see achieving this type of sorting is by using the deprecated version of the function

df |> summarize(n = sum(num), .by = c(a, b, c)) |> arrange(a, b, c, n, .by_groups = TRUE)

which I think is a less elegant way to write the above suggestion. A new argument .sort = TRUE being the temporary sorting group of the function in question.

JosiahParry · 2024-06-29T16:01:33Z

I would like to note that this caught me by surprise today. I anticipated that group_by() |> summarise() and .by are a 1:1 transition and that if I compared results between two seemingly compatible expressions, that the results would be the same. That is not the case.

Buried deep in the docs is this quote

"Uses ordering by "first appearance" in the original data"

which provides an answer to the quandary.

Earlier in the same doc it says

"Summaries use existing order of group keys"

This doesn't actually make sense to me because we're going from a Many -> 1 so it is not actually possible to keep the order of the keys. Perhaps the "first appearance" portion can be added to this part of the doc?

Repro:

summarise(..., .by = group) and group_by() |> summarise() are not equivalent. This shows that group_by() |> summarise() arranges the results in alphabetical order whereas .by does not.

library(dplyr)

user <- starwars |> 
  select(height, species) |> 
  group_by(species) |> 
  summarise(avg_height = mean(height))

answer <- starwars |> 
  select(height, species) |> 
  summarise(avg_height = mean(height), .by = species)

waldo::compare(user, answer)
#> old vs new
#>                    species avg_height
#> - old[1, ]  Aleena            79.0000
#> + new[1, ]  Human                  NA
#> - old[2, ]  Besalisk         198.0000
#> + new[2, ]  Droid                  NA
#> - old[3, ]  Cerean           198.0000
#> + new[3, ]  Wookiee          231.0000
#> - old[4, ]  Chagrian         196.0000
#> + new[4, ]  Rodian           173.0000
#> - old[5, ]  Clawdite         168.0000
#> + new[5, ]  Hutt             175.0000
#> - old[6, ]  Droid                  NA
#> + new[6, ]  NA               175.0000
#> - old[7, ]  Dug              112.0000
#> + new[7, ]  Yoda's species    66.0000
#> - old[8, ]  Ewok              88.0000
#> + new[8, ]  Trandoshan       190.0000
#> - old[9, ]  Geonosian        183.0000
#> + new[9, ]  Mon Calamari     180.0000
#> - old[10, ] Gungan           208.6667
#> + new[10, ] Ewok              88.0000
#> and 28 more ...
#> 
#>      old$species | new$species                     
#>  [1] "Aleena"    - "Human"          [1]            
#>  [2] "Besalisk"  - "Droid"          [2]            
#>  [3] "Cerean"    - "Wookiee"        [3]            
#>  [4] "Chagrian"  - "Rodian"         [4]            
#>  [5] "Clawdite"  - "Hutt"           [5]            
#>  [6] "Droid"     - NA               [6]            
#>  [7] "Dug"       - "Yoda's species" [7]            
#>  [8] "Ewok"      - "Trandoshan"     [8]            
#>  [9] "Geonosian" - "Mon Calamari"   [9]            
#> [10] "Gungan"    - "Ewok"           [10]           
#>  ... ...           ...              and 28 more ...
#> 
#> `old$avg_height`: 79 198 198 196 168  NA 112  88 183 209 and 28 more...
#> `new$avg_height`: NA  NA 231 173 175 175  66 190 180  88            ...

DavisVaughan added feature a feature request or enhancement grouping 👨‍👩‍👧‍👦 labels Jan 26, 2023

mayer79 mentioned this issue May 7, 2023

.by creates different row order as group_by() #6843

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider `.sort = FALSE` for `summarise()`, `reframe()`, and `slice_sample()` #6663

Consider `.sort = FALSE` for `summarise()`, `reframe()`, and `slice_sample()` #6663

DavisVaughan commented Jan 26, 2023

ghost commented Oct 19, 2023

JosiahParry commented Jun 29, 2024

Consider .sort = FALSE for summarise(), reframe(), and slice_sample() #6663

Consider .sort = FALSE for summarise(), reframe(), and slice_sample() #6663

Comments

DavisVaughan commented Jan 26, 2023

ghost commented Oct 19, 2023

JosiahParry commented Jun 29, 2024

Consider `.sort = FALSE` for `summarise()`, `reframe()`, and `slice_sample()` #6663

Consider `.sort = FALSE` for `summarise()`, `reframe()`, and `slice_sample()` #6663