Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r] add tf-idf and log normalization functions #168

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

immanuelazn
Copy link
Collaborator

@immanuelazn immanuelazn commented Dec 12, 2024

Details

As discussed, we are looking to add in normalization functions to allow for passing in transformations into orchestrator functions like LSI/iterative LSI. I add in two functions, normalize_tfidf(), and normalize_log() (shown in #167).

There were a few departures from the design doc, in order to provide a little bit more flexibility. Particularly, I was thinking about the case where the feature means are not ordered in the same way. To add in a little bit of safety, I added some logic for index invariant matching for feature means to matrix features.

Other than that, I also provided an option to do a traditional log transform by boolean flag, rather than log1p. As we don't directly expose a log function in BPCells C++ side, I just added a - 1. However, I'm noticing that this isn't translated into a -Inf like in dgCMatrix/generic matrix, and is instead a very small number. Might need to evaluate if this is something we would want to support

@immanuelazn immanuelazn force-pushed the ia/normalizations branch 2 times, most recently from a93f2c4 to f1c33cd Compare December 12, 2024 23:11
# Test that removing the add_one works
# log of 0 is -inf, but we don't do that on the c side, and just have really large negative numbers.
res_3 <- as(normalize_log(m2, add_one = FALSE), "dgCMatrix")
res_3@x[res_3@x < -60] <- -Inf
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any better way of doing this?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested above, I think we just get rid of the add_one option

Copy link
Owner

@bnprks bnprks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overall looks reasonable, though I'm not sure if the formulas implemented are quite what we want.

For documenting the formulas, I'd suggest taking advantage of the fact that we can render latex inside \eqn{} blocks from docstrings. Due to a current pkgdown bug we'll need to edit _pkgdown.yml to un-break our equation rendering:

includes:
    in_header: |
       <script defer data-domain="benparks.net" src="https://plausible.benparks.net/js/visit-counts.js"></script>
       <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.css" integrity="sha384-Htz9HMhiwV8GuQ28Xr9pEs1B4qJiYu/nYLLwlDklR53QibDfmQzi7rYxXhMH/5/u" crossorigin="anonymous">
       <!-- The loading of KaTeX is deferred to speed up page rendering -->
       <script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.js" integrity="sha384-bxmi2jLGCvnsEqMuYLKE/KsVCxV3PqmKeK6Y6+lmNXBry6+luFkEOsmp5vD9I/7+" crossorigin="anonymous"></script>
       <!-- To automatically render math in text elements, include the auto-render extension: -->
       <script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/contrib/auto-render.min.js" integrity="sha384-hCXGrW6PitJEwbkoStFjeJxv+fSOOQKOPbJxSfM6G5sWZjAyWhXiTIIAmQqnlLlh" crossorigin="anonymous" onload="renderMathInElement(document.body);"></script>

I'd suggest documenting the transformation equation for a single matrix element, like this for tfidf: eqn{\tilde{x}_{ij} = \log(\frac{x_{ij} \cdot \text{scaleFactor}}{ \text{rowMean}_i\cdot \text{colSum}_j} + 1)}

This would render as:
image

r/R/transforms.R Outdated
#' @param add_one (logical) Add one to the matrix before log normalization
#' @returns log normalized matrix.
#' @export
normalize_log <- function(mat, scale_factor = 1e4, add_one = TRUE) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • We should remove the add_one parameter and always just do log1p. Every time I've seen this normalization it's done with a log1p, as otherwise the zero values would become -Inf (dgCMatrix actually messes this up)
  • We should divide by the colSums prior to multiplying by scale_factor. It might make sense to use colMeans from matrix_stats() so we can do multi-threading
  • We should give the specific normalization formula in the docs (perhaps in the returns section)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree on removing the add_one, and providing formulas in docs

r/R/transforms.R Outdated
#' Else, map each feature name to its mean value.
#' @returns tf-idf normalized matrix.
#' @export
normalize_tfidf <- function(mat, feature_means = NULL, threads = 1L) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Nice touch thinking of the name-matching option for feature_means
  • I think we might want to follow the ArchR/Signac TF-IDF default formulas which include some logarithms. Perhaps you were thinking users would combine normalize_log with normalize_tfidf but I think it's better to keep each standalone
  • We should also give the specific normalization formula here

# Test that removing the add_one works
# log of 0 is -inf, but we don't do that on the c side, and just have really large negative numbers.
res_3 <- as(normalize_log(m2, add_one = FALSE), "dgCMatrix")
res_3@x[res_3@x < -60] <- -Inf
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As suggested above, I think we just get rid of the add_one option

@immanuelazn
Copy link
Collaborator Author

Thanks for the review Ben! I agree with the styling changes. I believe a small discussion on the logic of these functions would be a good idea to make sure we're aligned.

My initial thought was that normalize_log() and normalize_tfidf() would be to the tee of what these functions are named as. In the case of log normalization we would have the division by cell sums as built in the wrapper function (such as in select_features_by_variance() instead). What I'm worried about is in cases where we would want to pass the function, but we don't want a separate cell sum normalization as well. select_features_by_dispersion() is an example in the design docs, which has an optional log normalization without doing cell normalization This reflects the design decisions in ArchR here.

Similarly, from what I recall, tf-idf normalization doesn't necessarily indicate that a log normalization will take place after it. I was thinking in the case of LSI/iterative LSI, we would again just have the log normalization as a required step post tf-idf normalization.

I think having a boolean flag for cell normalization for normalize_log(), as well as having a boolean flag for log normalization in normalize_tfidf() would provide a little bit more flexibility. What do you think?

@bnprks
Copy link
Owner

bnprks commented Dec 13, 2024

I'm thinking of normalize_log() and normalize_tfidf() as trying to simplify single cell workflows by implementing common normalizations. We don't need these to cater to every possible need, just focus on making common cases easy (hard cases can be implemented by hand using our existing math operators). I think the versions I've suggested are the common cases

  • I think the ArchR example you link is actually an example of ArchR performing no normalization rather than doing a log without normalizing by read depth. That's a use-case we can easily support by allowing normalize=NULL or something to that effect.
  • I don't think I've ever seen a case of log normalization not include normalizing by colSums first. (e.g. this is what Seurat does, when it says "logNormalize"). I really don't think people will be confused here, and we can include the formula in the docs for clarity. I don't think we need a boolean flag unless we have a known example of a common analysis approach it would enable.
  • For tfidf, it would be okay to implement method argument from Signac, though just providing the default method is okay, as probably 99% of people just use the default that both ArchR/Signac use. Annoyingly Signac and ArchR disagree on method and LSIMethod numbering scheme, so we might need to use string naming if we implement this. (again, optional for now to implement the other variants. Implementing the ArchR/Signac default is most important)

@immanuelazn
Copy link
Collaborator Author

Okay, my bad on the ArchR thing! You're right, I will address these changes

@immanuelazn
Copy link
Collaborator Author

immanuelazn commented Jan 8, 2025

There was a previous decision to try to split PRs into more bite-sized pieces. We decided that this was going too far in this direction, as there becomes many layers of PRs stacked on top of each other. While #169 is downstream of this PR, this PR and the feature selection PR will both point to main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants