Bugfix: Gene score edge case where gene_list gene is chosen as control gene #2875

mumichae · 2024-02-21T16:34:30Z

In some edge cases, the control gene selection retrieves the same gene(s) that are also in the gene_list used for scoring.
As a result, when the following line is called, we end up with an empty control gene set, causing the downstream error in #2153

scanpy/scanpy/tools/_score_genes.py

Line 173 in 383a61b

control_genes = list(control_genes - gene_list)

Closes Score genes - control genes index of wrong format #2153
Tests included

Release notes not necessary because:

…ist during control gene selection

…ol gene == gene_list gene

codecov · 2024-02-21T16:47:43Z

Codecov Report

Attention: Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 76.50%. Comparing base (4b090c0) to head (97e4024).

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2875   +/-   ##
=======================================
  Coverage   76.50%   76.50%           
=======================================
  Files         109      109           
  Lines       12474    12485   +11     
=======================================
+ Hits         9543     9552    +9     
- Misses       2931     2933    +2

Files	Coverage Δ
src/scanpy/preprocessing/_simple.py	`87.95% <ø> (ø)`
src/scanpy/tools/_score_genes.py	`85.10% <86.66%> (-0.44%)`	⬇️

mumichae · 2024-02-21T16:48:38Z

Tests are failing, I believe, because the gene_list genes are removed from control genes before random sampling, not after, resulting in a different control gene set. Not quite sure why the difference in scores is so large though.

ivirshup · 2024-02-22T12:33:02Z

@flying-sheep, I have never been too familiar with the score_genes code. Could you take a look at this?

flying-sheep · 2024-02-26T10:26:40Z

There’s two differences:

np.unique(...) returns the values sorted, pd.Series(...).unique() returns them in original order (this already makes the scores not match)

This probably changes the sampling, but I wonder why the score difference is so large here! With only that change, I get:

Arrays are not equal

Mismatched elements: 2730 / 2730 (100%)
Max absolute difference: 0.22674037
Max relative difference: 1581.75673912
what you said: The original approach samples from the full list of genes in each bin, then restricts the sample to valid ones. Your approach samples from the valid genes in each bin.

So if a bin e.g. contains mostly invalid genes, the original code adds only a few genes for that bin, while yours adds the maximum possible number.

So the questions is: is the sampling bias introduced in the original code wanted? If not, you not only made the code more resilient, but also more objective.

scanpy/tools/_score_genes.py

mumichae · 2024-03-22T15:12:54Z

To answer the following question:

what you said: The original approach samples from the full list of genes in each bin, then restricts the sample to valid ones. Your approach samples from the valid genes in each bin.

So if a bin e.g. contains mostly invalid genes, the original code adds only a few genes for that bin, while yours adds the maximum possible number.

So the questions is: is the sampling bias introduced in the original code wanted? If not, you not only made the code more resilient, but also more objective.

After going through the original code from Seurat, it seems to me that there's not equivalent to removing genes to be scored from the control gene set.
From what I can tell, if one of the genes to be scored happens to be chosen as the background, it will be included in the calculation.
But please correct me if that's not the case.

So if the original implementation does not remove score genes from the control gene set, we would simply need to remove the following line:

scanpy/scanpy/tools/_score_genes.py

Line 169 in ec44574

control_genes = control_genes.union(r_genes.difference(gene_list))

(Note: if we want to keep the current behaviour, we should still remove the line above, since it would be redundant)

flying-sheep

Sorry for taking so long to look at this.

From what I can tell, if one of the genes to be scored happens to be chosen as the background, it will be included in the calculation.

So you’re saying that the seurat code neither does & ~obs_cut.index.isin(gene_list) nor
r_genes.difference(gene_list).

Therefore if we’re going with Seurat compat, we’d end up with difference scores than the failing reference test as well, so the reference doesn’t actually reflect any kind of gold standard, only what we did before at some point?

If I’m right, we should probably add an option to get the new better behavior you propose, and in Scanpy 2.0 we’d switch to that by default.

scanpy/tools/_score_genes.py

mumichae · 2024-04-29T14:46:32Z

About question 1: I find also it strange that the order of cuts that you iterate though should have such a large effect on sampling. Should I change it back to np.unique?

scanpy/tools/_score_genes.py

mumichae

LGTM, I like, that we keep the old way of running! Just a question on the naming.

mumichae · 2024-07-04T12:46:06Z

src/scanpy/tools/_score_genes.py

@@ -161,14 +165,29 @@ def get_subset(genes: pd.Index[str]):

    n_items = int(np.round(len(obs_avg) / (n_bins - 1)))
    obs_cut = obs_avg.rank(method="min") // n_items
-    control_genes = pd.Index([], dtype="string")
+    obs_cut_is_ctrl = False if ctrl_as_ref else obs_cut.index.isin(gene_list)


Not sure if I fully understand the naming here. I might go with something like keep_ctrl_in_obs_cut

great name, much better!

…t gene is chosen as control gene

…is chosen as control gene (#3142) Co-authored-by: Michaela Müller <[email protected]>

mumichae added 3 commits February 21, 2024 15:36

Ensure that control gene selection does not contain genes from gene_l…

7c65aaf

…ist during control gene selection

omit converting gene_list to set

294bc70

add test for case where no control genes are available and when contr…

Loading
Loading status checks…

970b728

…ol gene == gene_list gene

mumichae changed the title ~~Bugfix: Gene score edge case when no control genes are found~~ Bugfix: Gene score edge case where gene_list gene is chosen as control gene Feb 21, 2024

ivirshup requested a review from flying-sheep February 23, 2024 11:41

Merge branch 'main' into fix_score_genes_no_control_genes

Loading
Loading status checks…

7d118c0

flying-sheep requested changes Feb 26, 2024

View reviewed changes

scanpy/tools/_score_genes.py Outdated Show resolved Hide resolved

flying-sheep added 2 commits February 26, 2024 11:49

use intersection for faster assert

Loading
Loading status checks…

dfa0545

Merge branch 'main' into fix_score_genes_no_control_genes

Loading
Loading status checks…

ec44574

flying-sheep and others added 3 commits April 5, 2024 14:00

Merge branch 'main' into fix_score_genes_no_control_genes

Loading
Loading status checks…

5a3315d

Merge branch 'main' into fix_score_genes_no_control_genes

Loading
Loading status checks…

b7fdcfb

Merge branch 'main' into fix_score_genes_no_control_genes

Loading
Loading status checks…

2283fdc

flying-sheep reviewed Apr 25, 2024

View reviewed changes

scanpy/tools/_score_genes.py Outdated Show resolved Hide resolved

scanpy/tools/_score_genes.py Outdated Show resolved Hide resolved

flying-sheep self-assigned this Apr 25, 2024

mumichae commented Apr 29, 2024

View reviewed changes

scanpy/tools/_score_genes.py Outdated Show resolved Hide resolved

scanpy/tools/_score_genes.py Outdated Show resolved Hide resolved

scanpy/tools/_score_genes.py Outdated Show resolved Hide resolved

scanpy/tools/_score_genes.py Outdated Show resolved Hide resolved

flying-sheep added 5 commits April 29, 2024 17:11

remove redundancy

Loading
Loading status checks…

d2bedbc

Fix references (scverse#3032)

9ecbd22

Relax pytest version restriction (scverse#3034)

df6a3af

assert to runtime error

Loading
Loading status checks…

f9e7589

Merge branch 'main' into fix_score_genes_no_control_genes

Loading
Loading status checks…

856da25

flying-sheep added this to the 1.10.2 milestone Apr 29, 2024

mumichae and others added 3 commits June 4, 2024 16:13

update warning message for empty cuts

Loading
Loading status checks…

5fd31b1

Merge branch 'main' into fix_score_genes_no_control_genes

Loading
Loading status checks…

078e74b

relnote

Loading
Loading status checks…

750d148

Merge branch 'main' into fix_score_genes_no_control_genes

Loading
Loading status checks…

6142a07

ilan-gold modified the milestones: 1.10.2, 1.10.3 Jun 25, 2024

flying-sheep added 6 commits July 2, 2024 12:16

Merge branch 'main' into pr/mumichae/2875

Loading
Loading status checks…

40b84f9

fix test

Loading
Loading status checks…

74094af

fmt

Loading
Loading status checks…

6f62f4f

add param

Loading
Loading status checks…

d35e05e

cmt

Loading
Loading status checks…

f4d1238

Fix relnotes

Loading
Loading status checks…

97e4024

mumichae commented Jul 4, 2024

View reviewed changes

rename

Loading
Loading status checks…

06129af

flying-sheep merged commit db2118e into scverse:main Jul 4, 2024
3 of 4 checks passed

meeseeksmachine pushed a commit to meeseeksmachine/scanpy that referenced this pull request Jul 4, 2024

Backport PR scverse#2875: Bugfix: Gene score edge case where gene_lis…

a37840f

…t gene is chosen as control gene

meeseeksmachine mentioned this pull request Jul 4, 2024

Backport PR #2875: Bugfix: Gene score edge case where gene_list gene is chosen as control gene #3142

Merged

flying-sheep pushed a commit that referenced this pull request Jul 4, 2024

Backport PR #2875: Bugfix: Gene score edge case where gene_list gene …

Loading
Loading status checks…

55541fc

…is chosen as control gene (#3142) Co-authored-by: Michaela Müller <[email protected]>

This was referenced Jul 26, 2024

score_genes fails completely when the gene set has zero expression in some cells #3169

Closed

Bug fix for scanpy.tl.score_genes #3167

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: Gene score edge case where gene_list gene is chosen as control gene #2875

Bugfix: Gene score edge case where gene_list gene is chosen as control gene #2875

mumichae commented Feb 21, 2024

codecov bot commented Feb 21, 2024 •

edited

Loading

mumichae commented Feb 21, 2024

ivirshup commented Feb 22, 2024

flying-sheep commented Feb 26, 2024

mumichae commented Mar 22, 2024 •

edited by flying-sheep

Loading

flying-sheep left a comment

mumichae commented Apr 29, 2024

mumichae left a comment

mumichae Jul 4, 2024

flying-sheep Jul 4, 2024

Bugfix: Gene score edge case where gene_list gene is chosen as control gene #2875

Bugfix: Gene score edge case where gene_list gene is chosen as control gene #2875

Conversation

mumichae commented Feb 21, 2024

codecov bot commented Feb 21, 2024 • edited Loading

Codecov Report

mumichae commented Feb 21, 2024

ivirshup commented Feb 22, 2024

flying-sheep commented Feb 26, 2024

mumichae commented Mar 22, 2024 • edited by flying-sheep Loading

flying-sheep left a comment

Choose a reason for hiding this comment

mumichae commented Apr 29, 2024

mumichae left a comment

Choose a reason for hiding this comment

mumichae Jul 4, 2024

Choose a reason for hiding this comment

flying-sheep Jul 4, 2024

Choose a reason for hiding this comment

codecov bot commented Feb 21, 2024 •

edited

Loading

mumichae commented Mar 22, 2024 •

edited by flying-sheep

Loading