Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propose fill args in dt_unnest() #25

Open
leungi opened this issue Apr 5, 2020 · 6 comments · Fixed by #30
Open

Propose fill args in dt_unnest() #25

leungi opened this issue Apr 5, 2020 · 6 comments · Fixed by #30

Comments

@leungi
Copy link

leungi commented Apr 5, 2020

Reprex and proposal below.

library(tidyfast)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# |- data ----
dat <- structure(
  list(
    id = c("11", "22"),
    phase = c("a", "b"),
    values = list(
      structure(
        list(
          a = 0.0584563566053344,
          b = 192,
          c = "50%",
          d = 1,
          e = 0,
          f = 0,
          g = 0
        ),
        row.names = c(NA, -1L),
        class = c("tbl_df",
                  "tbl", "data.frame")
      ),
      structure(
        list(
          c = "50%",
          d = 465L,
          e = 0,
          g = 290514.430137519,
          b = 10961.9288476965,
          a = 0.359973896295374,
          h = 1.46588348984196,
          f = 119.108387941727
        ),
        row.names = c(NA,
                      -1L),
        class = c("tbl_df", "tbl", "data.frame")
      )
    )
  ),
  row.names = c(NA,
                -2L),
  class = c("tbl_df", "tbl", "data.frame")
)

# |- current ----
dat %>% 
  tidyfast::dt_unnest(values)
#> Error in rbindlist(eval(col)): Item 2 has 8 columns, inconsistent with item 1 which has 7 columns. To fill missing columns use fill=TRUE.

# |- proposed ----
dt_unnest.default_edit <- function(dt_, col, fill = FALSE, ...){
  if (isFALSE(data.table::is.data.table(dt_)))
    dt_ <- data.table::as.data.table(dt_)
  
  col    <- substitute(col)
  keep   <- substitute(alist(...))
  print(keep)
  names  <- colnames(dt_)
  others <- names[-match(paste(col), names)]
  rows   <- sapply(dt_[[paste(col)]], NROW)
  
  if (length(keep) > 1)
    others <- others[others %in% paste(keep)[-1]]
  
  others_dt <- dt_[, ..others]
  classes   <- sapply(others_dt, typeof)
  keep      <- names(classes)[classes != "list"]
  others_dt <- others_dt[, ..keep]
  others_dt <- lapply(others_dt, rep, times = rows)
  
  dt_[, list(data.table::as.data.table(others_dt),
             data.table::rbindlist(eval(col),
                                   fill = fill))]
}

dat %>% 
  dt_unnest.default_edit(values, fill = TRUE)
#> alist()
#>    id phase          a        b   c   d e        f        g        h
#> 1: 11     a 0.05845636   192.00 50%   1 0   0.0000      0.0       NA
#> 2: 22     b 0.35997390 10961.93 50% 465 0 119.1084 290514.4 1.465883

Created on 2020-04-05 by the reprex package (v0.3.0)

@TysonStanley
Copy link
Owner

I like this idea! Definitely is a natural extension. If you want, feel free to do a pull request with this and I'll merge it and add you to the contributor list.

@markfairbanks
Copy link
Collaborator

markfairbanks commented Aug 18, 2020

@TysonStanley The new version of dt_unnest() causes this feature to no longer work. Should this be reopened?

pacman::p_load(tidyfast, data.table, magrittr)

df1 <- data.table(a = "a", b = 1)
df2 <- data.table(a = rep("a", 3), b = 1:3, c = 1:3)

nested_df <- data.table(id = 1:2,
                        list_col = list(df1, df2))

nested_df %>%
  dt_unnest(list_col)
#> Error in `[.data.table`(dt_, , eval(col)[[1L]], by = others): j doesn't evaluate to the same number of columns for each group

@TysonStanley
Copy link
Owner

That is interesting... That was one advantage to using rbindlist() but if possible, I really want to use the [[ approach. Any ideas?

@TysonStanley TysonStanley reopened this Aug 18, 2020
@markfairbanks
Copy link
Collaborator

markfairbanks commented Aug 18, 2020

Maybe extract the list column and check if the nested data.tables have a consistent number of columns?

df1 <- data.table(a = "a", b = 1)
df2 <- data.table(a = rep("a", 3), b = 1:3, c = 1:3)

test_list <- list(df1, df2)

if (length(unique(lengths(test_list))) > 1) {
  "rbindlist code"
} else {
  "[[1]] code"
}
#> [1] "rbindlist code"

@TysonStanley
Copy link
Owner

Yeah, I was thinking something similar. I can't find anything with the [[ in data.table that we could change. The issue with this approach is the additional cost of getting the lengths, especially if it is really large data... I wonder how often this is. @leungi is this something you encounter a lot?

@leungi
Copy link
Author

leungi commented Aug 21, 2020

@TysonStanley @markfairbanks : thanks for bringing this up again.

I do encounter this quite often as a result of map_*() workflow for parsing large volume of messy semi-tabular data, where column names, ncol varies. Being able to bind everything and then remove non-informative columns based on amount of parsed data (post-binding) has been very effective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants