Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bionty Ontology .from_values behaves strangely #2255

Closed
Koncopd opened this issue Dec 4, 2024 · 5 comments · Fixed by #2310
Closed

bionty Ontology .from_values behaves strangely #2255

Koncopd opened this issue Dec 4, 2024 · 5 comments · Fixed by #2310

Comments

@Koncopd
Copy link
Member

Koncopd commented Dec 4, 2024

Also occurred here:

This causes problems for curation.

Example during census curation in laminlabs/cellxgene:

bt.CellType.from_values(
    ['adipocyte', 'lactocyte', 'L2/3-6 intratelencephalic projecting glutamatergic neuron', 'subcutaneous adipocyte'],
    field=bt.CellType.name
)

results in

! did not create CellType records for 2 non-validated names: 'L2/3-6 intratelencephalic projecting glutamatergic neuron', 'lactocyte'
RecordList([CellType(uid='17EZdnrz', name='subcutaneous fat cell', ontology_id='CL:0002521', synonyms='subcutaneous adipocyte', description='A Fat Cell That Is Part Of Subcutaneous Adipose Tissue.', created_by_id=1, source_id=48, created_at=2023-11-28 22:27:55 UTC),
            CellType(uid='wdLgwUXo', name='fat cell', ontology_id='CL:0000136', synonyms='adipocyte|adipose cell', description='A Fat-Storing Cell Found Mostly In The Abdominal Cavity And Subcutaneous Tissue Of Mammals. Fat Is Usually Stored In The Form Of Triglycerides.', created_by_id=1, source_id=48, created_at=2023-11-28 22:27:55 UTC)])

So this returns 2 cell types that are in the instance ("adipocyte", "subcutaneous adipocyte"), however here in the curator code it is clearly assumed that .from_values should return both existing and public records.

When there are no existing records, only public, .from_values returns them

bt.CellType.from_values(
    ['lactocyte', 'L2/3-6 intratelencephalic projecting glutamatergic neuron'],
    field=bt.CellType.name
)

results in

RecordList([CellType(uid='4bOFH2wt', name='lactocyte', ontology_id='CL:0002325', synonyms='lactation-derived mammary cell|mammary alveolar epithelial cell|epithelial cell of lactiferous gland|luminal cell of alveolus of lactiferous gland|lactaction-associated mammary epithelial cell|mammary gland alveolar epithelial cell', description='A Milk-Producing Glandular Epithelial Cell That Is Part Of A Mammary Gland Alveolus And Differentiates From A Luminal Adaptive Secretory Precursor Cell During Secretory Differentiation (Also Termed Lactogenesis I). Following Secretory Activation (Also Termed Lactogenesis Ii), A Lactocyte Is Involved In The Synthesis And/Or Transport Of Milk Constituents Including Proteins, Oligosaccharides, Lactose, Micronutrients, Fat, Hormones, Immunoglobulins, And Cytokines Into The Lumen Of The Lactating Mammary Gland.', created_by_id=6, source_id=60),
            CellType(uid='3x04dOgW', name='L2/3-6 intratelencephalic projecting glutamatergic neuron', ontology_id='CL:4023040', synonyms='L2/3-6 IT projecting neuron', description='A Intratelencephalic-Projecting Glutamatergic Neuron With A Soma Found In Cortical Layers L2/3-6', created_by_id=6, source_id=60)])

These are from public.

So now i believe we have a problem because .from_values returns public records only when the provided list has only public records, if they are mixed with existing, when only existing are returned and thus curators can't add public records.

@Zethson
Copy link
Member

Zethson commented Dec 19, 2024

To reproduce:

!lamin init --storage run-tests --schema bionty,wetlab

import lamindb as ln
import bionty as bt

mouse = bt.Organism.from_source(name="mouse").save()

bt.Gene(ensembl_gene_id="ENSMUSG00000035310", organism=mouse).save()
bt.Gene.from_source(ensembl_gene_id="ENSG00000139618").save()

# should return 1 val
bt.Gene.from_values(["ENSMUSG00000035310"], field=bt.Gene.ensembl_gene_id, organism=mouse)
# RecordList([Gene(uid='6KCoBil6asUg', ensembl_gene_id='ENSMUSG00000035310', created_by_id=1, organism_id=1, created_at=2024-12-19 10:34:19 UTC)])

# should also return 1 val
bt.Gene.from_values(["ENSG00000139618"], field=bt.Gene.ensembl_gene_id, organism=mouse)
# RecordList([Gene(uid='1DQZiYQ1wP6x', symbol='BRCA2', ensembl_gene_id='ENSG00000139618', ncbi_gene_ids='675', biotype='protein_coding', synonyms='FANCD|FANCD1|FAD|BRCC2|FACD|XRCC11|FAD1', description='BRCA2 DNA repair associated ', created_by_id=1, source_id=11, organism_id=2, created_at=2024-12-19 10:30:49 UTC)])

# but this should now return 2 val
bt.Gene.from_values(["ENSMUSG00000035310", "ENSG00000139618"], field=bt.Gene.ensembl_gene_id, organism=mouse)
# RecordList([Gene(uid='6KCoBil6asUg', ensembl_gene_id='ENSMUSG00000035310', created_by_id=1, organism_id=1, created_at=2024-12-19 10:34:19 UTC)])

@sunnyosun
Copy link
Member

To reproduce:

!lamin init --storage run-tests --schema bionty,wetlab

import lamindb as ln
import bionty as bt

mouse = bt.Organism.from_source(name="mouse").save()

bt.Gene(ensembl_gene_id="ENSMUSG00000035310", organism=mouse).save()
bt.Gene.from_source(ensembl_gene_id="ENSG00000139618").save()

# should return 1 val
bt.Gene.from_values(["ENSMUSG00000035310"], field=bt.Gene.ensembl_gene_id, organism=mouse)
# RecordList([Gene(uid='6KCoBil6asUg', ensembl_gene_id='ENSMUSG00000035310', created_by_id=1, organism_id=1, created_at=2024-12-19 10:34:19 UTC)])

# should also return 1 val
bt.Gene.from_values(["ENSG00000139618"], field=bt.Gene.ensembl_gene_id, organism=mouse)
# RecordList([Gene(uid='1DQZiYQ1wP6x', symbol='BRCA2', ensembl_gene_id='ENSG00000139618', ncbi_gene_ids='675', biotype='protein_coding', synonyms='FANCD|FANCD1|FAD|BRCC2|FACD|XRCC11|FAD1', description='BRCA2 DNA repair associated ', created_by_id=1, source_id=11, organism_id=2, created_at=2024-12-19 10:30:49 UTC)])

# but this should now return 2 val
bt.Gene.from_values(["ENSMUSG00000035310", "ENSG00000139618"], field=bt.Gene.ensembl_gene_id, organism=mouse)
# RecordList([Gene(uid='6KCoBil6asUg', ensembl_gene_id='ENSMUSG00000035310', created_by_id=1, organism_id=1, created_at=2024-12-19 10:34:19 UTC)])

This example itself is wrong, ENSMUSG00000035310 is a mouse gene, but ENSG00000139618 is a human gene. So bt.Gene.from_values(["ENSMUSG00000035310", "ENSG00000139618"], field=bt.Gene.ensembl_gene_id, organism=mouse) will only return a single mouse gene because you specified the organism. There's no bug.

@sunnyosun
Copy link
Member

Also occurred here:

This causes problems for curation.

Example during census curation in laminlabs/cellxgene:

bt.CellType.from_values(
    ['adipocyte', 'lactocyte', 'L2/3-6 intratelencephalic projecting glutamatergic neuron', 'subcutaneous adipocyte'],
    field=bt.CellType.name
)

results in

! did not create CellType records for 2 non-validated names: 'L2/3-6 intratelencephalic projecting glutamatergic neuron', 'lactocyte'
RecordList([CellType(uid='17EZdnrz', name='subcutaneous fat cell', ontology_id='CL:0002521', synonyms='subcutaneous adipocyte', description='A Fat Cell That Is Part Of Subcutaneous Adipose Tissue.', created_by_id=1, source_id=48, created_at=2023-11-28 22:27:55 UTC),
            CellType(uid='wdLgwUXo', name='fat cell', ontology_id='CL:0000136', synonyms='adipocyte|adipose cell', description='A Fat-Storing Cell Found Mostly In The Abdominal Cavity And Subcutaneous Tissue Of Mammals. Fat Is Usually Stored In The Form Of Triglycerides.', created_by_id=1, source_id=48, created_at=2023-11-28 22:27:55 UTC)])

So this returns 2 cell types that are in the instance ("adipocyte", "subcutaneous adipocyte"), however here in the curator code it is clearly assumed that .from_values should return both existing and public records.

When there are no existing records, only public, .from_values returns them

bt.CellType.from_values(
    ['lactocyte', 'L2/3-6 intratelencephalic projecting glutamatergic neuron'],
    field=bt.CellType.name
)

results in

RecordList([CellType(uid='4bOFH2wt', name='lactocyte', ontology_id='CL:0002325', synonyms='lactation-derived mammary cell|mammary alveolar epithelial cell|epithelial cell of lactiferous gland|luminal cell of alveolus of lactiferous gland|lactaction-associated mammary epithelial cell|mammary gland alveolar epithelial cell', description='A Milk-Producing Glandular Epithelial Cell That Is Part Of A Mammary Gland Alveolus And Differentiates From A Luminal Adaptive Secretory Precursor Cell During Secretory Differentiation (Also Termed Lactogenesis I). Following Secretory Activation (Also Termed Lactogenesis Ii), A Lactocyte Is Involved In The Synthesis And/Or Transport Of Milk Constituents Including Proteins, Oligosaccharides, Lactose, Micronutrients, Fat, Hormones, Immunoglobulins, And Cytokines Into The Lumen Of The Lactating Mammary Gland.', created_by_id=6, source_id=60),
            CellType(uid='3x04dOgW', name='L2/3-6 intratelencephalic projecting glutamatergic neuron', ontology_id='CL:4023040', synonyms='L2/3-6 IT projecting neuron', description='A Intratelencephalic-Projecting Glutamatergic Neuron With A Soma Found In Cortical Layers L2/3-6', created_by_id=6, source_id=60)])

These are from public.

So now i believe we have a problem because .from_values returns public records only when the provided list has only public records, if they are mixed with existing, when only existing are returned and thus curators can't add public records.

This is a bug, previously the source was determined by source of the existing records. 'lactocyte' is a term in a newer version of ontology that's set as the default, but not linked to the existing records. I fixed it here: #2310

@falexwolf
Copy link
Member

This example itself is wrong, ENSMUSG00000035310 is a mouse gene, but ENSG00000139618 is a human gene. So bt.Gene.from_values(["ENSMUSG00000035310", "ENSG00000139618"], field=bt.Gene.ensembl_gene_id, organism=mouse) will only return a single mouse gene because you specified the organism. There's no bug.

Is there a way that Lukas could have been made aware of this through the API? I fear that's difficult but many users might not have the awareness to readily spot differences by eye.

@sunnyosun
Copy link
Member

This example itself is wrong, ENSMUSG00000035310 is a mouse gene, but ENSG00000139618 is a human gene. So bt.Gene.from_values(["ENSMUSG00000035310", "ENSG00000139618"], field=bt.Gene.ensembl_gene_id, organism=mouse) will only return a single mouse gene because you specified the organism. There's no bug.

Is there a way that Lukas could have been made aware of this through the API? I fear that's difficult but many users might not have the awareness to readily spot differences by eye.

There's a specific parameter organism to pass. Also we don't yet support multiple organisms within the same dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants