Skip to content

pyNBS.pyNBS_core.mixed_netNMF

Tongqiu (Iris) Jia edited this page Jan 29, 2018 · 17 revisions

This function solves the network-regularized non-negative matrix factorization (netNMF) problem at the center of the NBS algorithm. In the original Matlab code released by Hofree et al., the implementation for this netNMF function was adapted from the Matlab NMF Toolbox from Li and Agnom.

Solving Graph- (or Network-) Regularized Non-Negative Matrix Factorization

The graph-regularized NMF (GNMF) technique was described in detail by Cai et al.. It is a variation on the traditional NMF factorization, with an additional regularization term that contrains the solution of the matrix factors to a nearest neighbor graph, derived in this case from the molecular network using the network_inf_KNN_glap function. The goal of GNMF/netNMF is to minimize the following objective function:

GNMF

Here, X is the data to be factorized (in this case a genes-by-patients matrix, typically the quantile normalized, network-smoothed somatic mutation data). W and H are the basis and patient cluster (or factor) matrices and have the dimensions of genes-by-k and k-by-patients respectively. L is the graph laplacian of the K-nearest-neighbor network constructed by network_inf_KNN_glap. L is then decomposed into A and D, the adjacency matrix of the KNN network, and the diagonalized node degrees of the KNN network (L=D-A).

This function initializes two random H and W matrices and then mixes two types of techniques to update the factor matrices: To update the basis (gene-by-k) factor matrix, the function uses a multiplicative update rule (as described by Cai et al) using the formula for element-wise multiplication of W described below:

W_update

Then, the function uses a non-negative least squares approach to update the patient cluster (k-by-patients) factor matrix (also referred to as the H matrix throughout the wiki). The original Matlab code used a custom implementation of this function developed by Li and Agnom, but we use the scipy implementation of the non-negative least squares algorithm.

Function Call:

mixed_netNMF(data, KNN_glap, k=3, l=200, maxiter=250, eps=1e-15, err_tol=1e-4, err_delta_tol=1e-4, verbose=False)

Parameters:

  • data (required, numpy.ndarray): Transposed binary somatic matrix loaded from file. This is a matrix of the binary somatic mutation profiles of the cohort to perform pyNBS on. However, for this function, the input data matrix will be transposed to better align with the objective function described above. Therefore, in this case the rows of data are genes of the patient profiles and the columns of data are patients/samples (transposed vs the output of load_binary_mutation_data. The rows of this matrix must be the same order as the rows/columns of KNN_glap (see below).
  • KNN_glap (required, numpy.ndarray): This is the numpy array of the graph laplacian (gene-by-gene) of KNN influence network constructed by the network_inf_KNN_glap function. This is the regularization matrix for this step (the L matrix in equations above). The rows and columns of this array must match the same order of genes as the data array.
  • k (optional, int, default=3): Number of components to decompose patient mutation data into during the netNMF. This is also the same as the number of clusters of patients to separate data into.
  • l (optional, float, default=200): This is the regularization constant (the λ value in the equations above)to scale network regularizer (KNN_glap or L) term in netNMF. The value value must be able to be converted to a Python int and the default value of this parameter is 200. We have found that larger positive integers for this value produce better, and more robust results. We suggest using a value between 100-1000 for this parameter. Setting this value to 0 will perform netNMF with no network regularization penalty (similar to a non-network-regularized NMF).
  • maxiter (optional, int, default=250): Maximum number of update steps to perform during this function if the result does not reach convergence by a different method.
  • eps (optional, float, default=1e-15): Epsilon error value to adjust 0 (or very small) values during multiplicative matrix updates in netNMF. Essentially this is a parameter to define the machine precision for the netNMF step.
  • err_tol (optional, float, default=1e-4): This is the minimum error tolerance for matrix reconstruction of original data for this function to reach convergence. If the decomposition has reached a sufficiently close estimation of data, the function will return the H factor matrix from that decomposition at that time. * err_delta_tol (optional, float, default=1e-8): This is the minimum error tolerance for the L2 norm of difference in matrix reconstructions between iterations of netNMF for convergence. If the reconstruction error of the decomposition is not improving significantly, the function will return the H factor matrix from the decomposition at that time. * verbose (optional, bool, default=False): Verbosity flag for determining whether or not to have the netNMF function report intermediate progress at each iteration.

Returns:

  • W (numpy.ndarray): The converged (genes-by-k) array of the basis factor matrix from this function.
  • H (numpy.ndarray): The converged (k-by-patients) array of the basis factor matrix from this function. Multiple instances of this H matrix will be combined together during the consensus clustering step of the algorithm by the consensus_hclsut_hard function.
  • numIter (int): The number of update steps performed by this function before until the result converges.
  • finalResidual (float): The residual reconstruction error of the two factor matrices as compared to the original data at convergence of this function. This is the following L2 norm: recon_err.

Additional notes about this function:

This function is a required step in the NBS algorithm. It is also called within the [NBS_single][NBS_single_fxn_doc] function.

Clone this wiki locally