Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate clean cols #8

Closed
wants to merge 4 commits into from
Closed

Separate clean cols #8

wants to merge 4 commits into from

Conversation

zkamvar
Copy link
Member

@zkamvar zkamvar commented Jan 11, 2019

This is in accordance with @dirkschumacher's suggestion in #1 Also, I found the comment() function, which seems really useful for this task :)

What it does:

  • clean_* functions no longer overwrite data explicitly. They now create *_clean columns that also contain a "<linelist>clean" comment to prevent other clean_* functions from overwriting them.
  • when no changes occur, a *_clean column is not created and no label is applied
  • reduces the number of operations by only acting on the unclean columns

@thibautjombart, @dirkschumacher, what do you think?

This is in accordance with @dirkschumacher's suggestion in #1 Also, I
found the `comment()` function, which seems really useful for this task
:)
This scheme prevents previously cleaned variables from being "cleaned"
twice (e.g. the messy_dates being converted back to messy character
data.
@zkamvar
Copy link
Member Author

zkamvar commented Jan 18, 2019

To clarify, this is what the process looks like:

library("linelist")

onsets <- as.Date("2018-01-01") + sample(1:10, 20, replace = TRUE)
discharge <- format(as.Date(onsets) + 10, "%d/%m/%Y")
genders <- c("male", "female", "FEMALE", "Male", "Female", "MALE")
gender <- sample(genders, 20, replace = TRUE)
case_types <- c("confirmed", "probable", "suspected", "not a case",
                "Confirmed", "PROBABLE", "suspected  ", "Not.a.Case")
messy_dates <- sample(
                 c("01-12-2001", "male", "female", "2018-10-18", "2018_10_17",
                   "2018 10 19", "// 24//12//1989", NA, "that's 24/12/1989!"),
                 20, replace = TRUE)
case <- factor(sample(case_types, 20, replace = TRUE))
toy_data <- data.frame("Date of Onset." = onsets,
                       "DisCharge.." = discharge,
                       "GENDER_ " = gender,
                       "Épi.Case_définition" = case,
                       "messy/dates" = messy_dates)
head(toy_data)
#>   Date.of.Onset. DisCharge.. GENDER_. Épi.Case_définition
#> 1     2018-01-08  18/01/2018   FEMALE            PROBABLE
#> 2     2018-01-06  16/01/2018   female          Not.a.Case
#> 3     2018-01-04  14/01/2018     male           Confirmed
#> 4     2018-01-04  14/01/2018   female           Confirmed
#> 5     2018-01-08  18/01/2018   FEMALE           confirmed
#> 6     2018-01-05  15/01/2018   female         suspected  
#>          messy.dates
#> 1         2018-10-18
#> 2    // 24//12//1989
#> 3         2018 10 19
#> 4 that's 24/12/1989!
#> 5 that's 24/12/1989!
#> 6             female
head(cd <- clean_data(toy_data))
#>   date_of_onset  discharge gender epi_case_definition        messy_dates
#> 1    2018-01-08 18/01/2018 FEMALE            PROBABLE         2018-10-18
#> 2    2018-01-06 16/01/2018 female          Not.a.Case    // 24//12//1989
#> 3    2018-01-04 14/01/2018   male           Confirmed         2018 10 19
#> 4    2018-01-04 14/01/2018 female           Confirmed that's 24/12/1989!
#> 5    2018-01-08 18/01/2018 FEMALE           confirmed that's 24/12/1989!
#> 6    2018-01-05 15/01/2018 female         suspected               female
#>   discharge_clean messy_dates_clean gender_clean epi_case_definition_clean
#> 1      2018-01-18        2018-10-18       female                  probable
#> 2      2018-01-16        1989-12-24       female                not_a_case
#> 3      2018-01-14        2018-10-19         male                 confirmed
#> 4      2018-01-14        1989-12-24       female                 confirmed
#> 5      2018-01-18        1989-12-24       female                 confirmed
#> 6      2018-01-15              <NA>       female                 suspected
lapply(cd, comment)
#> $date_of_onset
#> NULL
#> 
#> $discharge
#> NULL
#> 
#> $gender
#> NULL
#> 
#> $epi_case_definition
#> NULL
#> 
#> $messy_dates
#> NULL
#> 
#> $discharge_clean
#> [1] "<linelist>clean"
#> 
#> $messy_dates_clean
#> [1] "<linelist>clean"
#> 
#> $gender_clean
#> [1] "<linelist>clean"
#> 
#> $epi_case_definition_clean
#> [1] "<linelist>clean"

Created on 2019-01-18 by the reprex package (v0.2.1)

@zkamvar
Copy link
Member Author

zkamvar commented Feb 21, 2019

I'm going to close this PR since it is woefully out of date and will not be merged

@zkamvar zkamvar closed this Feb 21, 2019
@zkamvar zkamvar deleted the separate-clean-cols branch March 1, 2019 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant