-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix: Gene score edge case where gene_list gene is chosen as control gene #2875
Bugfix: Gene score edge case where gene_list gene is chosen as control gene #2875
Conversation
…ist during control gene selection
…ol gene == gene_list gene
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2875 +/- ##
=======================================
Coverage 76.50% 76.50%
=======================================
Files 109 109
Lines 12474 12485 +11
=======================================
+ Hits 9543 9552 +9
- Misses 2931 2933 +2
|
Tests are failing, I believe, because the gene_list genes are removed from control genes before random sampling, not after, resulting in a different control gene set. Not quite sure why the difference in scores is so large though. |
@flying-sheep, I have never been too familiar with the |
There’s two differences:
|
To answer the following question:
After going through the original code from Seurat, it seems to me that there's not equivalent to removing genes to be scored from the control gene set. So if the original implementation does not remove score genes from the control gene set, we would simply need to remove the following line: scanpy/scanpy/tools/_score_genes.py Line 169 in ec44574
(Note: if we want to keep the current behaviour, we should still remove the line above, since it would be redundant) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for taking so long to look at this.
From what I can tell, if one of the genes to be scored happens to be chosen as the background, it will be included in the calculation.
So you’re saying that the seurat code neither does & ~obs_cut.index.isin(gene_list)
nor
r_genes.difference(gene_list)
.
Therefore if we’re going with Seurat compat, we’d end up with difference scores than the failing reference test as well, so the reference doesn’t actually reflect any kind of gold standard, only what we did before at some point?
If I’m right, we should probably add an option to get the new better behavior you propose, and in Scanpy 2.0 we’d switch to that by default.
About question 1: I find also it strange that the order of cuts that you iterate though should have such a large effect on sampling. Should I change it back to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I like, that we keep the old way of running! Just a question on the naming.
src/scanpy/tools/_score_genes.py
Outdated
@@ -161,14 +165,29 @@ def get_subset(genes: pd.Index[str]): | |||
|
|||
n_items = int(np.round(len(obs_avg) / (n_bins - 1))) | |||
obs_cut = obs_avg.rank(method="min") // n_items | |||
control_genes = pd.Index([], dtype="string") | |||
obs_cut_is_ctrl = False if ctrl_as_ref else obs_cut.index.isin(gene_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I fully understand the naming here. I might go with something like keep_ctrl_in_obs_cut
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great name, much better!
…t gene is chosen as control gene
…is chosen as control gene (#3142) Co-authored-by: Michaela Müller <[email protected]>
In some edge cases, the control gene selection retrieves the same gene(s) that are also in the gene_list used for scoring.
As a result, when the following line is called, we end up with an empty control gene set, causing the downstream error in #2153
scanpy/scanpy/tools/_score_genes.py
Line 173 in 383a61b