Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: Gene score edge case where gene_list gene is chosen as control gene #2875

Merged
merged 25 commits into from
Jul 4, 2024

Conversation

mumichae
Copy link
Contributor

In some edge cases, the control gene selection retrieves the same gene(s) that are also in the gene_list used for scoring.
As a result, when the following line is called, we end up with an empty control gene set, causing the downstream error in #2153

control_genes = list(control_genes - gene_list)

  • Release notes not necessary because:

Sorry, something went wrong.

@mumichae mumichae changed the title Bugfix: Gene score edge case when no control genes are found Bugfix: Gene score edge case where gene_list gene is chosen as control gene Feb 21, 2024
Copy link

codecov bot commented Feb 21, 2024

Codecov Report

Attention: Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 76.50%. Comparing base (4b090c0) to head (97e4024).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2875   +/-   ##
=======================================
  Coverage   76.50%   76.50%           
=======================================
  Files         109      109           
  Lines       12474    12485   +11     
=======================================
+ Hits         9543     9552    +9     
- Misses       2931     2933    +2     
Files Coverage Δ
src/scanpy/preprocessing/_simple.py 87.95% <ø> (ø)
src/scanpy/tools/_score_genes.py 85.10% <86.66%> (-0.44%) ⬇️

@mumichae
Copy link
Contributor Author

Tests are failing, I believe, because the gene_list genes are removed from control genes before random sampling, not after, resulting in a different control gene set. Not quite sure why the difference in scores is so large though.

@ivirshup
Copy link
Member

@flying-sheep, I have never been too familiar with the score_genes code. Could you take a look at this?

@flying-sheep
Copy link
Member

There’s two differences:

  1. np.unique(...) returns the values sorted, pd.Series(...).unique() returns them in original order (this already makes the scores not match)

    This probably changes the sampling, but I wonder why the score difference is so large here! With only that change, I get:

    Arrays are not equal

    Mismatched elements: 2730 / 2730 (100%)
    Max absolute difference: 0.22674037
    Max relative difference: 1581.75673912

  2. what you said: The original approach samples from the full list of genes in each bin, then restricts the sample to valid ones. Your approach samples from the valid genes in each bin.

    So if a bin e.g. contains mostly invalid genes, the original code adds only a few genes for that bin, while yours adds the maximum possible number.

    So the questions is: is the sampling bias introduced in the original code wanted? If not, you not only made the code more resilient, but also more objective.

scanpy/tools/_score_genes.py Outdated Show resolved Hide resolved
@mumichae
Copy link
Contributor Author

mumichae commented Mar 22, 2024

To answer the following question:

  1. what you said: The original approach samples from the full list of genes in each bin, then restricts the sample to valid ones. Your approach samples from the valid genes in each bin.

    So if a bin e.g. contains mostly invalid genes, the original code adds only a few genes for that bin, while yours adds the maximum possible number.

    So the questions is: is the sampling bias introduced in the original code wanted? If not, you not only made the code more resilient, but also more objective.

After going through the original code from Seurat, it seems to me that there's not equivalent to removing genes to be scored from the control gene set.
From what I can tell, if one of the genes to be scored happens to be chosen as the background, it will be included in the calculation.
But please correct me if that's not the case.

So if the original implementation does not remove score genes from the control gene set, we would simply need to remove the following line:

control_genes = control_genes.union(r_genes.difference(gene_list))

(Note: if we want to keep the current behaviour, we should still remove the line above, since it would be redundant)

Copy link
Member

@flying-sheep flying-sheep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long to look at this.

From what I can tell, if one of the genes to be scored happens to be chosen as the background, it will be included in the calculation.

So you’re saying that the seurat code neither does & ~obs_cut.index.isin(gene_list) nor
r_genes.difference(gene_list).

Therefore if we’re going with Seurat compat, we’d end up with difference scores than the failing reference test as well, so the reference doesn’t actually reflect any kind of gold standard, only what we did before at some point?

If I’m right, we should probably add an option to get the new better behavior you propose, and in Scanpy 2.0 we’d switch to that by default.

scanpy/tools/_score_genes.py Outdated Show resolved Hide resolved
scanpy/tools/_score_genes.py Outdated Show resolved Hide resolved
@flying-sheep flying-sheep self-assigned this Apr 25, 2024
@mumichae
Copy link
Contributor Author

About question 1: I find also it strange that the order of cuts that you iterate though should have such a large effect on sampling. Should I change it back to np.unique?

scanpy/tools/_score_genes.py Outdated Show resolved Hide resolved
scanpy/tools/_score_genes.py Outdated Show resolved Hide resolved
scanpy/tools/_score_genes.py Outdated Show resolved Hide resolved
scanpy/tools/_score_genes.py Outdated Show resolved Hide resolved
@flying-sheep flying-sheep added this to the 1.10.2 milestone Apr 29, 2024
mumichae and others added 3 commits June 4, 2024 16:13
@ilan-gold ilan-gold modified the milestones: 1.10.2, 1.10.3 Jun 25, 2024
fmt
cmt
Copy link
Contributor Author

@mumichae mumichae left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I like, that we keep the old way of running! Just a question on the naming.

@@ -161,14 +165,29 @@ def get_subset(genes: pd.Index[str]):

n_items = int(np.round(len(obs_avg) / (n_bins - 1)))
obs_cut = obs_avg.rank(method="min") // n_items
control_genes = pd.Index([], dtype="string")
obs_cut_is_ctrl = False if ctrl_as_ref else obs_cut.index.isin(gene_list)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I fully understand the naming here. I might go with something like keep_ctrl_in_obs_cut

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great name, much better!

@flying-sheep flying-sheep merged commit db2118e into scverse:main Jul 4, 2024
3 of 4 checks passed
meeseeksmachine pushed a commit to meeseeksmachine/scanpy that referenced this pull request Jul 4, 2024
flying-sheep pushed a commit that referenced this pull request Jul 4, 2024
…is chosen as control gene (#3142)

Co-authored-by: Michaela Müller <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Score genes - control genes index of wrong format
4 participants