Separate clean cols #8

zkamvar · 2019-01-11T09:01:05Z

This is in accordance with @dirkschumacher's suggestion in #1 Also, I found the comment() function, which seems really useful for this task :)

What it does:

clean_* functions no longer overwrite data explicitly. They now create *_clean columns that also contain a "<linelist>clean" comment to prevent other clean_* functions from overwriting them.
when no changes occur, a *_clean column is not created and no label is applied
reduces the number of operations by only acting on the unclean columns

@thibautjombart, @dirkschumacher, what do you think?

@dirkschumacher

This is in accordance with @dirkschumacher's suggestion in #1 Also, I found the `comment()` function, which seems really useful for this task :)

This scheme prevents previously cleaned variables from being "cleaned" twice (e.g. the messy_dates being converted back to messy character data.

zkamvar · 2019-01-18T05:27:30Z

To clarify, this is what the process looks like:

library("linelist")

onsets <- as.Date("2018-01-01") + sample(1:10, 20, replace = TRUE)
discharge <- format(as.Date(onsets) + 10, "%d/%m/%Y")
genders <- c("male", "female", "FEMALE", "Male", "Female", "MALE")
gender <- sample(genders, 20, replace = TRUE)
case_types <- c("confirmed", "probable", "suspected", "not a case",
                "Confirmed", "PROBABLE", "suspected  ", "Not.a.Case")
messy_dates <- sample(
                 c("01-12-2001", "male", "female", "2018-10-18", "2018_10_17",
                   "2018 10 19", "// 24//12//1989", NA, "that's 24/12/1989!"),
                 20, replace = TRUE)
case <- factor(sample(case_types, 20, replace = TRUE))
toy_data <- data.frame("Date of Onset." = onsets,
                       "DisCharge.." = discharge,
                       "GENDER_ " = gender,
                       "Épi.Case_définition" = case,
                       "messy/dates" = messy_dates)
head(toy_data)
#>   Date.of.Onset. DisCharge.. GENDER_. Épi.Case_définition
#> 1     2018-01-08  18/01/2018   FEMALE            PROBABLE
#> 2     2018-01-06  16/01/2018   female          Not.a.Case
#> 3     2018-01-04  14/01/2018     male           Confirmed
#> 4     2018-01-04  14/01/2018   female           Confirmed
#> 5     2018-01-08  18/01/2018   FEMALE           confirmed
#> 6     2018-01-05  15/01/2018   female         suspected  
#>          messy.dates
#> 1         2018-10-18
#> 2    // 24//12//1989
#> 3         2018 10 19
#> 4 that's 24/12/1989!
#> 5 that's 24/12/1989!
#> 6             female
head(cd <- clean_data(toy_data))
#>   date_of_onset  discharge gender epi_case_definition        messy_dates
#> 1    2018-01-08 18/01/2018 FEMALE            PROBABLE         2018-10-18
#> 2    2018-01-06 16/01/2018 female          Not.a.Case    // 24//12//1989
#> 3    2018-01-04 14/01/2018   male           Confirmed         2018 10 19
#> 4    2018-01-04 14/01/2018 female           Confirmed that's 24/12/1989!
#> 5    2018-01-08 18/01/2018 FEMALE           confirmed that's 24/12/1989!
#> 6    2018-01-05 15/01/2018 female         suspected               female
#>   discharge_clean messy_dates_clean gender_clean epi_case_definition_clean
#> 1      2018-01-18        2018-10-18       female                  probable
#> 2      2018-01-16        1989-12-24       female                not_a_case
#> 3      2018-01-14        2018-10-19         male                 confirmed
#> 4      2018-01-14        1989-12-24       female                 confirmed
#> 5      2018-01-18        1989-12-24       female                 confirmed
#> 6      2018-01-15              <NA>       female                 suspected
lapply(cd, comment)
#> $date_of_onset
#> NULL
#> 
#> $discharge
#> NULL
#> 
#> $gender
#> NULL
#> 
#> $epi_case_definition
#> NULL
#> 
#> $messy_dates
#> NULL
#> 
#> $discharge_clean
#> [1] "<linelist>clean"
#> 
#> $messy_dates_clean
#> [1] "<linelist>clean"
#> 
#> $gender_clean
#> [1] "<linelist>clean"
#> 
#> $epi_case_definition_clean
#> [1] "<linelist>clean"

^{Created on 2019-01-18 by the reprex package (v0.2.1)}

zkamvar · 2019-02-21T05:02:32Z

I'm going to close this PR since it is woefully out of date and will not be merged

zkamvar added 4 commits January 11, 2019 15:54

modify clean_ fns to write to separate columns

dcfac75

This is in accordance with @dirkschumacher's suggestion in #1 Also, I found the `comment()` function, which seems really useful for this task :)

add more stable checks for comments

34fc4dc

This scheme prevents previously cleaned variables from being "cleaned" twice (e.g. the messy_dates being converted back to messy character data.

better preserve user comments if they exist

d8b4b8b

update some documentation

99259ae

zkamvar requested review from thibautjombart and dirkschumacher January 16, 2019 06:47

zkamvar closed this Feb 21, 2019

zkamvar deleted the separate-clean-cols branch March 1, 2019 15:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate clean cols #8

Separate clean cols #8

zkamvar commented Jan 11, 2019 •

edited

Loading

zkamvar commented Jan 18, 2019

zkamvar commented Feb 21, 2019

Separate clean cols #8

Separate clean cols #8

Conversation

zkamvar commented Jan 11, 2019 • edited Loading

zkamvar commented Jan 18, 2019

zkamvar commented Feb 21, 2019

zkamvar commented Jan 11, 2019 •

edited

Loading