Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[traits.build adding studies functions] Refinements to automated substitutions #21

Open
yangsophieee opened this issue Jul 13, 2023 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@yangsophieee
Copy link
Collaborator

Arising from #613 in austraits.build via @ehwenk:

There are certain circumstances where the automated substitutions code (process.R, line 971) currently requires long lists of substitutions - but maybe could be refined...

Since it only matches entire strings, in circumstances where there are multiple categorical values, one of which needs to be changed, each circumstance with a change to that term needs to be included. For instance, in order to change procumbent to prostrate, there are only 6 times you'd have to replace the term through some variant of str_replace, but 97 different substitutions you'd have to add.

From growth_form branch:

> austraits$traits %>%
+   filter(trait_name == "stem_growth_habit") %>% filter(value == "procumbent") %>% distinct(dataset_id,value)
# A tibble: 6 × 2
  dataset_id         value     
  <chr>              <chr>     
1 Flora_Florabase    procumbent
2 Flora_NT           procumbent
3 Flora_of_Australia procumbent
4 Flora_PlantNet     procumbent
5 Flora_SA           procumbent
6 Flora_VicFlora     procumbent
> austraits$traits %>%
+   filter(trait_name == "stem_growth_habit") %>% filter(str_detect(value, "procumbent")) %>% distinct(dataset_id,value)
# A tibble: 97 × 2
   dataset_id      value                             
   <chr>           <chr>                             
 1 Flora_Florabase procumbent scrambling             
 2 Flora_Florabase procumbent spreading              
 3 Flora_Florabase compact erect procumbent sprawling
 4 Flora_Florabase bushy erect procumbent            
 5 Flora_Florabase bushy procumbent spreading        
 6 Flora_Florabase erect procumbent spreading        
 7 Flora_Florabase erect procumbent                  
 8 Flora_Florabase procumbent prostrate              
 9 Flora_Florabase procumbent                        
10 Flora_Florabase decumbent procumbent prostrate    
# … with 87 more rows
# ℹ Use `print(n = ...)` to see more rows

This gets even harder to fix when the words are entered into the data.csv file in non-alphabetical order, because the output is alphabetical and it is tedious to look up each term in the data.csv file to figure out why the substitution isn't "working".

Could the code be rewritten to replace all instances of a term, rather than an exact string match?

(I also occasionally struggle with capital letters in the input causing substitutions to fail, but this shouldn't be a problem, should it?)

@ehwenk
Copy link
Collaborator

ehwenk commented Aug 29, 2023

@dfalster - I think this would be a quite useful change to make to the add_substitutions functions. For datasets where many of the observations are multiple space-delimited trait values I now mostly bypass the substitutions section of the metadata file and use custom_R_code to avoid the long, long list of substitutions. Is there a reason not to do this?

@dfalster
Copy link
Member

dfalster commented Aug 29, 2023

Sounds good, no reason not to do it. Relevant code is here.

https://github.com/traitecoevo/traits.build/blob/develop/R/process.R#L1177

We can swap out the coder requiring an exact match for a call to str_replace_all

the only possible complications I can image are where you have codes that could be confused with other things, e.g. P for perennial, you might not want to replace all instances of P. So we'd need to include delimiters in the search, i.e. search for P not P

ehwenk added a commit that referenced this issue Aug 29, 2023
Have substitutions process work on individual strings, not entire values.

For austraits this will require a bit of playing around with substitutions, but is well worth it for the long-term gains/simplicity.

solves issue #21
@dfalster dfalster added the enhancement New feature or request label Aug 30, 2023
@ehwenk ehwenk self-assigned this Aug 30, 2023
@ehwenk
Copy link
Collaborator

ehwenk commented Aug 30, 2023

I thought this was a good idea and it does greatly reduce the number of substitutions required for some studies -- or eliminate the use of str_replace in custom_R_code, but as I've been running through all of AusTraits a few problems are emerging:

  • special characters (i.e. all punctuation) are not being recognised for replacement, so those substitutions aren't working
  • there are of course substitutions where the entire string needs to be recognised, such as mapping "small shrub" to "subshrub". I can re-sort the substitutions to have "complete cells" replaced before individual words, but that it clunky. So either maybe I should work out if there is some way to indicate "whole cells" vs "words" (quotes?).

@ehwenk
Copy link
Collaborator

ehwenk commented Sep 6, 2023

See commit fbe21b7 for code

I think a solution would be an if else loop where substitution cells that began with a hook such as "^w " would indicate that you use the word-replacing code for whatever comes after this.

@ehwenk ehwenk added the on-hold label Sep 19, 2023
@ehwenk ehwenk changed the title Refinements to automated substitutions [minor enhancements] Refinements to automated substitutions Jul 31, 2024
@ehwenk ehwenk changed the title [minor enhancements] Refinements to automated substitutions [traits.build adding studies functions] Refinements to automated substitutions Jul 31, 2024
@ehwenk ehwenk removed the on hold label Jul 31, 2024
@ehwenk ehwenk added this to AusTraits Jul 31, 2024
@ehwenk ehwenk moved this to Backlog in AusTraits Jul 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Backlog
Development

No branches or pull requests

3 participants