-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[r] add tf-idf and log normalization functions #168
base: main
Are you sure you want to change the base?
Conversation
a93f2c4
to
f1c33cd
Compare
f1c33cd
to
98675d0
Compare
# Test that removing the add_one works | ||
# log of 0 is -inf, but we don't do that on the c side, and just have really large negative numbers. | ||
res_3 <- as(normalize_log(m2, add_one = FALSE), "dgCMatrix") | ||
res_3@x[res_3@x < -60] <- -Inf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any better way of doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As suggested above, I think we just get rid of the add_one
option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overall looks reasonable, though I'm not sure if the formulas implemented are quite what we want.
For documenting the formulas, I'd suggest taking advantage of the fact that we can render latex inside \eqn{}
blocks from docstrings. Due to a current pkgdown bug we'll need to edit _pkgdown.yml
to un-break our equation rendering:
includes:
in_header: |
<script defer data-domain="benparks.net" src="https://plausible.benparks.net/js/visit-counts.js"></script>
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.css" integrity="sha384-Htz9HMhiwV8GuQ28Xr9pEs1B4qJiYu/nYLLwlDklR53QibDfmQzi7rYxXhMH/5/u" crossorigin="anonymous">
<!-- The loading of KaTeX is deferred to speed up page rendering -->
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.js" integrity="sha384-bxmi2jLGCvnsEqMuYLKE/KsVCxV3PqmKeK6Y6+lmNXBry6+luFkEOsmp5vD9I/7+" crossorigin="anonymous"></script>
<!-- To automatically render math in text elements, include the auto-render extension: -->
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/contrib/auto-render.min.js" integrity="sha384-hCXGrW6PitJEwbkoStFjeJxv+fSOOQKOPbJxSfM6G5sWZjAyWhXiTIIAmQqnlLlh" crossorigin="anonymous" onload="renderMathInElement(document.body);"></script>
I'd suggest documenting the transformation equation for a single matrix element, like this for tfidf: eqn{\tilde{x}_{ij} = \log(\frac{x_{ij} \cdot \text{scaleFactor}}{ \text{rowMean}_i\cdot \text{colSum}_j} + 1)}
r/R/transforms.R
Outdated
#' @param add_one (logical) Add one to the matrix before log normalization | ||
#' @returns log normalized matrix. | ||
#' @export | ||
normalize_log <- function(mat, scale_factor = 1e4, add_one = TRUE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- We should remove the
add_one
parameter and always just dolog1p
. Every time I've seen this normalization it's done with alog1p
, as otherwise the zero values would become -Inf (dgCMatrix actually messes this up) - We should divide by the colSums prior to multiplying by scale_factor. It might make sense to use colMeans from
matrix_stats()
so we can do multi-threading - We should give the specific normalization formula in the docs (perhaps in the returns section)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree on removing the add_one, and providing formulas in docs
r/R/transforms.R
Outdated
#' Else, map each feature name to its mean value. | ||
#' @returns tf-idf normalized matrix. | ||
#' @export | ||
normalize_tfidf <- function(mat, feature_means = NULL, threads = 1L) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Nice touch thinking of the name-matching option for
feature_means
- I think we might want to follow the ArchR/Signac TF-IDF default formulas which include some logarithms. Perhaps you were thinking users would combine
normalize_log
withnormalize_tfidf
but I think it's better to keep each standalone - We should also give the specific normalization formula here
# Test that removing the add_one works | ||
# log of 0 is -inf, but we don't do that on the c side, and just have really large negative numbers. | ||
res_3 <- as(normalize_log(m2, add_one = FALSE), "dgCMatrix") | ||
res_3@x[res_3@x < -60] <- -Inf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As suggested above, I think we just get rid of the add_one
option
Thanks for the review Ben! I agree with the styling changes. I believe a small discussion on the logic of these functions would be a good idea to make sure we're aligned. My initial thought was that Similarly, from what I recall, tf-idf normalization doesn't necessarily indicate that a log normalization will take place after it. I was thinking in the case of LSI/iterative LSI, we would again just have the log normalization as a required step post tf-idf normalization. I think having a boolean flag for cell normalization for |
I'm thinking of
|
Okay, my bad on the ArchR thing! You're right, I will address these changes |
d8d9ed2
to
6381f74
Compare
There was a previous decision to try to split PRs into more bite-sized pieces. We decided that this was going too far in this direction, as there becomes many layers of PRs stacked on top of each other. While #169 is downstream of this PR, this PR and the feature selection PR will both point to main. |
Details
As discussed, we are looking to add in normalization functions to allow for passing in transformations into orchestrator functions like LSI/iterative LSI. I add in two functions,
normalize_tfidf()
, andnormalize_log()
(shown in #167).There were a few departures from the design doc, in order to provide a little bit more flexibility. Particularly, I was thinking about the case where the feature means are not ordered in the same way. To add in a little bit of safety, I added some logic for index invariant matching for feature means to matrix features.
Other than that, I also provided an option to do a traditional log transform by boolean flag, rather than log1p. As we don't directly expose a log function in BPCells C++ side, I just added a - 1. However, I'm noticing that this isn't translated into a -Inf like in dgCMatrix/generic matrix, and is instead a very small number. Might need to evaluate if this is something we would want to support