-
Notifications
You must be signed in to change notification settings - Fork 5
Regularization FALCON
Regularization is a technique to include additional constrains into an optimization problem. This is done by adding to the cost function, a function of the parameters themselves. For example, Tikhonov regularization, also called L2 or ridge regression or weight decay considers the sum of the squares of the parameters values. In general, regularizations materialize prior knowledge about the parameter distributions, and so are very Bayesian in nature. For example, the Tikhonov penality induces smaller parameter weights, something that makes machine learning models usually more generalizable and moderates overfitting. By increasing incrementally the importance of this additional term in the objective function while controlling for the total goodness of fit, it is possible to choose a threshold to determine the 'best' model. Some regularization schemes are now available for FALCON:
This introduces a partial norm on all parameters. This will induce a simplification in the topology of the network, in the sense that stronger regularization will force the optimization to choose between additive edges to the same nodes and will pick the most important of competitive interactions. It will also have a pruning effect on the negative edges.
This introduces a simple L1 norm on all parameters. Because of the Bayesian structure in FALCON (the sum of activating edges must be equal to 1) this is not useful for activating edges. This will however prune out less important inhibiting edges.
This introduces a simple L2 norm on all parameters. Because of the Bayesian structure in FALCON (the sum of activating edges must be equal to 1) this is not useful for activating edges, actually pushing competitive edges to have more equal contributions. It is not clear what biological assumption this materializes. This will however decrease the weight of less important inhibiting edges.
This penalizes the sum of the absolute differences to the average parameter value for groups of parameters. It will make the parameter values for the same edge more equal across contexts, where contexts are for example cell lines or patients or treatments. By screening regularization strengths, this can determine which edges are probably context specific and which ones are probably not given the data. However in most cases we think this is not the best option when the number of contexts is more than 2. See LCluster below.
This penalizes the sum of cumulative differences across contexts, when the contexts have a natural order, for example timepoints, spatial location, increasing amount of genetic damage etc. This will 'smooth' parameter values and assume that differences between two successive contexts should be minimized. By screening regularization strengths, this can determine which edges are probably context specific (timepoint a change occurs for example) and which ones are probably not given the data. In spirit, this is similar to using convolutions for neural networks, and in general materializes the assumption of local signal correlation.
Combines L1/2 with LGroups.
Combines L1/2 with LGroups with LSmooth. Not well tested.
This is an original contribution of ours (see De Landtsheer et al., 2018, Frontiers on Physiology, xxx). By summing the local deviations to the expected density for all n-dimensional rectangles in the parameter space, we obtain a penality which we demonstrated is useful to group contexts together when there is a high number of them, and that therefore a certain level of heterogeneity is expected. By screening regularization strengths, this can determine which edges are probably context specific and which ones are probably not given the data.
Combines L1/2 with LCluster. Not well tested.
See the driver script Driver_Regularization.m. In order to conduct a regularized optimization on a certain network model over a series of contexts, it is necessary to perform the optimization on all context-specific models simultaneously, as the regularization function is computed from the homologous parameters between the contexts. In other words, we are going to create a super-model with identical input nodes, but parallel signal processing in topologically identical but parametrically free networks and fit each system on its own output nodes.
For example, if we have the conceptual network A -> B -> C, an experimental setup with presence or absence of A, and measurements of C over 3 cell lines, we can construct the following network:
- A -> B1 -> C1
- A -> B2 -> C2
- A -> B3 -> C3.
This network can the be contextualized on the dataset [C1, C2. C3], and the values of the context-specific edges will only depend on the datapoints of that particular context. This is the same as performing three independent optimizations. However, if we have prior knowledge that the edges between the context should have a certain structure (for example, they should be similar) then we can add a function of the context-specific parameters (k1, k2, k3) to the objective function. This setting-up is done with the FalconMakeGlobalModel()
function. The following run-through of the script explains the main steps, taking as example a toy-model designed to represent a simple signaling pathway over four contexts. The four cell lines form two groups, which is the information we are going to retrieve from the modeling. First, we define the network filename, the list of topologically locked interactions (empty in our case), an array containing the measurement filenames, and an array of strings used for naming the contexts. We recommend the following code when the files are named logically:
InputFile = 'ToyDiff.xlsx';
GlobalEdgesList = 'ToyDiff_fixed.xlsx';
ContextsList = {'CL1a','CL2a','CL3a','CL4a'};
MeasFileList = {};
for f = 1:length(ContextsList)
MeasFileList = [MeasFileList, ['ToyDiff_meas_', char(ContextsList(f)), '.xlsx']];
end
The network is then constructed in a similar way as with a standard optimization, including the default values of the optimization parameters.
estim = FalconMakeGlobalModel(InputFile, GlobalEdgesList, MeasFileList, ContextsList);
Network loaded: 57 nodes and 96 interactions (24 Boolean gates)
Data loaded: 5 inputs, 16 outputs and 16 experimental conditions
The model has 32 parameters for 256 datapoints
There are 4 different contexts
Then, in this experiment, we add 5% Gaussian noise to the measurements:
estim = FalconPreProcess(estim, 'noise', 5);
The network can be optimizaed and resimulated as previously:
estim = FalconOptimize(estim);
[estim, StateValues, MSE] = FalconSimul(estim);
Without noise, the MSE should be about 5E-9, and with noise, about 2E-3
FALCON can use a number of different optimization schemes, detailed above. The type of regularization needs to be chosen with estim.Reg = S
where S is one of the strings mentioned above. The relative weight of the regularization term in the objective function is defined by the parameter lambda. Here, performing one optimization does not have sense, as we are interested in the behavior of the objective function of a range of regularization values. We therefore define a vector of regularization values and loop over them. We apply exponentially increasing regularization strength with:
estim.Reg = 'LCluster';
PowerLambda1 = -12:0.5:-2;
ListLambda1 = [0,2.^PowerLambda1];
This parametrization will screen 21 values of lambda, from 1E-12 to 1E-2 by half-log steps, in addition to the unregularizized model. The scheme is the LCluster regularization, which applies a penalty proportional to the uniformity of the parameter values in the parameter space for each interaction in the network. It needs one more information, that is which edges is supposed to cluster together. This is done with the estim.RegMatrix.Cluster = M
where M is a N-by-P matrix with parameter ID number for N edges and P contexts. For example, a 5-edge network across 3 contexts would basically have the following matrix: [1 2 3; 4 5 6; 7 8 9; 10 11 12; 13 14 15]. It is automatically generated but can be customized before optimization, for example to exclude some edges from regularization or group parameters even further. In the above example we could want to have [1 2 3 4 5 6; 7 8 9 10 11 12] and not care about the last edge. For mixed regularizations, we need two or three vectors of lambda values (depending on the case), as we will perform a grid search in two or three dimensions. Optimization is performed with the familiar command estim = FalconOptimize(estim);
.
After the optimizations are over, the one thing to look for is the most appropriate balance between goodness-of-fit (Mean Squared Error) and model size (number of non-null parameters). Different metrics can be used, we usually prefer the BIC, however AIC often leads to the same conclusion about model topology. We consider 0.01 as the threshold for a parameter to be different from zero.
The top-left plot displays the differences in parametrization compared to the non-regularized model. The top-right plot displays the values of the parameters. The X-axis represents the regularization strength. In these graphs, we can observe that the parameter values change when regularization is applied. The same parameter across the four contexts has different values at first, and start to converge to the same value when regularization is applied however the MSE does not increase much, indicating these differences were due to noise in the measurements. Accordingly, the number of free parameters in the model starts to decrease. At a later point, the cost incurred by the residuals and the regularization function balance out and we reach the model that is probably optimal, when the BIC is minimal (around log(lambda) = -6). The topology of the model at this point (which parameters are actually different and which ones can be reduced to a single degree of liberty) is supposedly the best adapted to the dataset. When regularization is increased further, the MSE continues to increase and the number of parameters continues to decrease, however this model parametrization probably results in a simplistic model, underfitting the dataset and erasing the actual differences between the cell lines.
Regularization can also be applied in two dimensions, each materializing a set of assumptions about the underlying phenomenon being modeled. For example, one assumption could be that some of the edges in the network are not active overall, if these interactions were inferred from a generic protein-protein interaction network, but not active in this cell type. Another one would be that the cell lines differ in only a small subset of the regulatory network, with most edges being equally parametrized across all cell lines of forming a limited number of clusters.
estim.Reg = 'PruneCluster';
PowerLambda1 = -16:1:-6;
PowerLambda2 = -12:1:-2;
ListLambda1 = [0,2.^PowerLambda1];
ListLambda2 = [0,2.^PowerLambda2];
The optimization is then performed as usual, with estim.Lambda = [L1, L2]
where L1 and L2 are the values of the regularization strength for the L1/2 and the LCluster regularization scheme, respectively.
This analysis generates a matrix of BIC values which can be represented as a heatmap or a surface. The minimum BIC defines the model with probably the best trade-off between model accuracy and model sparsity, accounting for two sources of sparsity: the removal of unnecessary parameters for all contexts and context-specific reduction in the parameter space.