Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Typos in correspondence with GIFT #28

Open
Rekyt opened this issue May 24, 2024 · 3 comments
Open

Typos in correspondence with GIFT #28

Rekyt opened this issue May 24, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@Rekyt
Copy link

Rekyt commented May 24, 2024

Hi @ehwenk & @dfalster,

while building the trait correspondence network I noticed some issues with the correspondence with GIFT traits (I still have to do the same for BIEN and TRY).
I basically checked that both the trait codes provided by AusTraits were in GIFT, as well as the trait names, and that the provided GIFT traits names were matching the provided GIFT trait codes.

The script I used is below. But I'll first detail my findings.

  1. For trait_0030020 & trait_0030015 the GIFT_close contains multiple traits as a single line. Is it on purpose? Because other matched traits span multiple lines.
  2. For trait_0030215, there is a typo in the GIFT_exact name as it is referenced "Fuiting time" missing an r.
  3. For trait_0030020, there is a typo in the GIFT code 'leaf_thorns_1 [GIFT:4:14.1]' which should be 'leaf_thorns_1 [GIFT:4.14.1]'.
  4. Several GIFT traits names are written following AusTraits' convention and not GIFT's 'seed_height' instead of Seed height.
  5. Capitalization of trait names isn't following GIFT's names, APD tend to use snake_case while GIFT uses Camel_snake_case. For example, GIFT's name referenced in APD is 'flower_colour' [APD:trait_0012417] while in GIFT the trait is 'Flower_clour' [GIFT:3.21.1].
  6. There is an error in the GIFT match with trait_0030060, GIFT_close matches with GIFT 1.4.1 (Climber_1) while it should match with GIFT 3.4.1 (Reproduction_sexual_1). This was the trait that triggered my systematic search for potential mismatches, as I obtained in the correspondence network a much larger connected component than expected with traits that shouldn't be matching.

Maybe you could use an adaptation of the script below to perform semi-automated quality checks when updating the APD?

For the sake of completeness, I'll try performing the same checks for TRY and BIEN.

Matching script
library("dplyr")

gift_trait_meta = GIFT::GIFT_traits_meta()

apd_gift_detailed = tibble::as_tibble(read.csv("APD_traits_input.csv")) |>
  select(identifier:label, starts_with("GIFT")) |>
  rename(trait_id = identifier) |>
  tidyr::pivot_longer(
    starts_with("GIFT"), names_to = "match_type", values_to = "matched_trait"
  ) |>
  filter(matched_trait != "") |>
  mutate(
    # Split for traits that have multiple matches on one line
    split_traits = purrr::map(stringr::str_split(matched_trait, ";"), trimws),
    # Extract GIFT trait name
    extracted_trait = purrr::map(
      split_traits, \(x) stringr::str_extract(x, "^(.*)\\s\\[", group = 1)
    ),
    # Extract GIFT trait code
    extracted_code = purrr::map(
      split_traits, \(x) stringr::str_extract(x, "\\[GIFT:(.+)\\]", group = 1)
    ),
    # Get level
    gift_lvl = purrr::map(
      extracted_code, \(x) stringr::str_count(x, stringr::fixed(".")) + 1L
    )
  ) |>
  # Put everything in a tidy format
  tidyr::unnest(split_traits:gift_lvl)

## Level 2 traits
# Matching code at level 2
apd_gift_lvl2 = apd_gift_detailed |>
  filter(gift_lvl == 2) |>
    left_join(
      gift_trait_meta |>
        distinct(Lvl2, Trait1),
      by = c(extracted_code = "Lvl2")
    )

# Problematic traits
apd_gift_lvl2 |>
  filter((extracted_trait != Trait1) | is.na(Trait1))


## Level 3 traits
apd_gift_lvl3 = apd_gift_detailed |>
  filter(gift_lvl == 3) |>
  left_join(
    gift_trait_meta |>
      distinct(Lvl3, Trait2),
    by = c(extracted_code = "Lvl3")
  )

# Problematic traits
apd_gift_lvl3 |>
  filter((extracted_trait != Trait2) | is.na(Trait2))
@ehwenk
Copy link
Collaborator

ehwenk commented May 27, 2024

@Rekyt Thank you for documenting these! I'll make the changes on a branch tomorrow.

It is sometimes intentionally to have multiple close matches within a single cell in the csv files. The code that builds the formal ontology will split those into multiple lines, each assigned as an example of type "close_match". But I'll double check the one you mentioned to ensure there isn't something else wrong.

@ehwenk ehwenk added the bug Something isn't working label Jul 31, 2024
@ehwenk ehwenk added this to AusTraits Jul 31, 2024
@ehwenk ehwenk moved this to Backlog in AusTraits Jul 31, 2024
ehwenk added a commit that referenced this issue Aug 15, 2024
@ehwenk
Copy link
Collaborator

ehwenk commented Aug 15, 2024

For trait_0030020 & trait_0030015 the GIFT_close contains multiple traits as a single line. Is it on purpose? Because other matched traits span multiple lines.

  • this is correct and the multiple close matches are correctly documented in the *.ttl file

For trait_0030215, there is a typo in the GIFT_exact name as it is referenced "Fuiting time" missing an r.

  • fixed

For trait_0030020, there is a typo in the GIFT code 'leaf_thorns_1 [GIFT:4:14.1]' which should be 'leaf_thorns_1 [GIFT:4.14.1]'.

  • fixed; there was a second typo in these, that they linked to the TRY not GIFT website.

Several GIFT traits names are written following AusTraits' convention and not GIFT's 'seed_height' instead of Seed height.

  • @Rekyt I am confused by how this "AusTraits" convention was introduced. Before I change the many instances of this, can you confirm all GIFT traits should be written as "Sentence case". As with the next comment down, should there be any underscores or instead spaces between the words?
  • Capitalization of trait names isn't following GIFT's names, APD tend to use snake_case while GIFT uses Camel_snake_case. For example, GIFT's name referenced in APD is 'flower_colour' [APD:trait_0012417] while in GIFT the trait is 'Flower_clour' [GIFT:3.21.1].
  • Please see comment above
  • There is an error in the GIFT match with trait_0030060, GIFT_close matches with GIFT 1.4.1 (Climber_1) while it should match with GIFT 3.4.1 (Reproduction_sexual_1). This was the trait that triggered my systematic search for potential mismatches, as I obtained in the correspondence network a much larger connected component than expected with traits that shouldn't be matching.
  • fixed

@ehwenk ehwenk removed the status in AusTraits Aug 23, 2024
@ehwenk ehwenk moved this to In Progress in AusTraits Aug 23, 2024
@ehwenk ehwenk self-assigned this Aug 23, 2024
@Rekyt
Copy link
Author

Rekyt commented Aug 30, 2024

First of all, thank you for making all the fixes!

I'm checking the correspondence with GIFT traits with the script I shared at the end of initial message in the issue.

For GIFT traits there's is an additional complexity regarding the fact that GIFT offers three levels of granularity of traits. Level 1 traits are very broad categories, level 2 are the type of traits, level 3 are detailed to the exact meaning of the trait. Level 2 and Level 3 traits have their respective names in the columns Trait1 and Trait2.

This is what I obtain for example for the first few traits referenced in GIFT:

head(GIFT::GIFT_traits_meta())[,1:6]
#> You are asking for the latest stable version of GIFT which is 3.2.
#>   Lvl1   Category Lvl2       Trait1   Lvl3           Trait2
#> 1    1 Morphology  1.1    Woodiness  1.1.1      Woodiness_1
#> 2    1 Morphology 1.10 Shoot length 1.10.1 Shoot_length_min
#> 3    1 Morphology 1.10 Shoot length 1.10.2 Shoot_length_max
#> 4    1 Morphology  1.2  Growth form  1.2.1    Growth_form_1
#> 5    1 Morphology  1.2  Growth form  1.2.2    Growth_form_2
#> 6    1 Morphology  1.3     Epiphyte  1.3.1       Epiphyte_1

Created on 2024-08-30 with reprex v2.1.0

The Level 2 traits have a "Sentence case" naming convention while the Level 3 traits have a Capitalized_snake_case one.
So that's why I think there might be a systematic matching problem depending on which convention you follow.

For level 2 traits these are the ones I obtained that are still mismatching using the updated APD file:

# A tibble: 12 × 3
   APD_trait_identifier trait_as_extracted_in_APD trait_name_as_in_GIFT
   <chr>                <chr>                     <chr>                
 1 trait_0012512        fruit_length              Fruit length         
 2 trait_0012513        fruit_width               Fruit width          
 3 trait_0012514        fruit_height              Fruit height         
 4 trait_0012610        seed_mass                 Seed mass            
 5 trait_0012613        seed_length               Seed length          
 6 trait_0012614        seed_width                Seed width           
 7 trait_0012615        seed_height               Seed height          
 8 trait_0012616        seed_volume               Seed volume          
 9 trait_0030011        life_form                 Life form            
10 trait_0030061        pollination_syndrom       Pollination syndrome 
11 trait_0030211        dispersal_syndrome        Dispersal syndrome   
12 trait_0030214        flowering_time            Flowering time 

For the level 3 traits this is what I'm getting:

# A tibble: 27 × 3
   APD_trait_identifier trait_as_extracted_in_APD trait_name_as_in_GIFT 
   <chr>                <chr>                     <chr>                 
 1 trait_0010023        plant_height_max          Plant_height_max      
 2 trait_0011310        leaf_form_1               Leaf_form_1           
 3 trait_0011313        leaf_margin_1             Leaf_margin_1         
 4 trait_0011410        leaf_arrangement          Leaf_arrangement      
 5 trait_0011411        leaf_arrangement          Leaf_arrangement      
 6 trait_0012417        flower_colour             Flower_colour         
 7 trait_0012418        flower_colour             Flower_colour         
 8 trait_0012516        fruit_type_1              Fruit_type_1          
 9 trait_0012517        fruit_dryness_1           Fruit_dryness_1       
10 trait_0012518        dehiscence_1              Dehiscence_1          
11 trait_0012519        fruit_colour              Fruit_colour          
12 trait_0020221        photosynthetic_pathway    Photosynthetic_pathway
13 trait_0030010        growth_form_2             Growth_form_2         
14 trait_0030012        lifecycle                 Lifecycle_1           
15 trait_0030014        lifespan_1                Lifespan_1            
16 trait_0030015        epiphyte_2                Epiphyte_2            
17 trait_0030015        aquatic_1                 Aquatic_1             
18 trait_0030022        plant_succulence          Succulence_1          
19 trait_0030023        shoot_orientation         Shoot_orientation     
20 trait_0030023        climber_2                 Climber_2             
21 trait_0030024        deciduousnes_1            Deciduousness_1       
22 trait_0030027        nitrogen_fix_1            Nitrogen_fix_1        
23 trait_0030028        mycorrhiza_1              Mycorrhiza_1          
24 trait_0030029        parasite_1                Parasite_1            
25 trait_0030060        reproduction_sexual_1     Reproduction_sexual_1 
26 trait_0030062        self_fertilization_1      Self_fertilization_1  
27 trait_0030511        reproduction_asexual_2    Reproduction_asexual_2

I may have missed something regarding GIFT's traits naming convention, but AFAIK, the traits in the database are indexed by the Lvl2 and Lvl3 columns which correspond to the Trait1 and Trait2 names. So I would try to make sure the codes and the names are matching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

No branches or pull requests

2 participants