diff --git a/content/15.raw-data-analysis.md b/content/15.raw-data-analysis.md index 0406199b..e0d1c46d 100644 --- a/content/15.raw-data-analysis.md +++ b/content/15.raw-data-analysis.md @@ -247,26 +247,26 @@ Data normalization, the process for adjusting data to be comparable between samp Normalization removes systematic bias in peptide/protein abundances that could mask true biological discoveries or give rise to false conclusions [@DOI:10.1038/s41587-022-01440-w]. Bias may be due to factors such as measurement errors and protein degradation [@DOI:10.1021/pr050300l], although the causes for these variations are often unknown [@DOI:10.1186/1471-2105-13-S16-S5]. As data scaling methods should be kept at a minimum [@DOI:10.15252/msb.202110240], a normalization technique well suited to address the nuances specific to one’s data should be selected. -The assumptions for a given normalization technique should not be violated, otherwise choosing the wrong technique can lead to misleading conclusions [@DOI:10.1093/bib/bbx008]. -There are a multitude of data normalization techniques and knowing the most suitable one for a dataset can be challenging. +The assumptions for a given normalization technique should not be violated, otherwise choosing the wrong technique can lead to misleading conclusions [@DOI:10.1093/bib/bbx008]. +There are a multitude of data normalization techniques and knowing the most suitable one for a dataset can be challenging. Visualization of peptide or protein intensity distributions among samples is an important step prior to selecting a normalization technique. -Normalization is suggested to be done on the peptide level [@DOI:10.15252/msb.202110240]. +Normalization is suggested to be done on the peptide level [@DOI:10.15252/msb.202110240]. If the technical variability causes the peptide/protein abundances from each sample to be different by a constant factor, and thus intensities are graphically similar across samples, then a central tendency normalization method such as mean, median or quantile normalization may be sufficient [@DOI:10.15252/msb.202110240; @DOI:10.1021/pr050300l]. However, if there is a linear relationship between bias and the peptide/protein abundances, a different method may be more appropriate. -To visualize linear and nonlinear trends due to bias, we can plot the data in a ratio versus intensity, or a M (minus) versus A (average), plot [@DOI:10.1021/pr050300l; @DOI:10.1093/bioinformatics/19.2.185]. -Linear regression normalization is an available technique if bias is linearly dependent on peptide/protein abundance magnitudes [@DOI:10.1093/bib/bbw095; @DOI: 10.1021/pr050300l]. +To visualize linear and nonlinear trends due to bias, we can plot the data in a ratio versus intensity, or a M (minus) versus A (average), plot [@DOI:10.1021/pr050300l; @DOI:10.1093/bioinformatics/19.2.185]. +Linear regression normalization is an available technique if bias is linearly dependent on peptide/protein abundance magnitudes [@DOI:10.1093/bib/bbw095; @DOI:10.1021/pr050300l]. Alternatively, local regression (LOESS) normalization assumes nonlinearity between protein intensity and bias [@DOI:10.1093/bib/bbw095]. Another method, removal of unwanted variation (RUV), uses information from negative controls and a linear mixed effect model to estimate unwanted noise, which is then removed from the data [@DOI:10.1038/nbt.2931]. -If sample distributions are drastically different, for example due to different treatments or samples are obtained from various tissues, one must use a method that preserves the heterogeneity in the data, including information present in outliers, meanwhile reducing systematic bias [@DOI:10.15252/msb.202110240]. +If sample distributions are drastically different, for example due to different treatments or samples are obtained from various tissues, one must use a method that preserves the heterogeneity in the data, including information present in outliers, meanwhile reducing systematic bias [@DOI:10.15252/msb.202110240]. For example, Hidden Markov Model (HMM)-assisted normalization [@DOI:10.15252/msb.202110240], RobNorm [@DOI:10.1093/bioinformatics/btaa904] or EigenMS [@DOI:10.1093/bioinformatics/btp426] may be suitable for this type of data. These techniques assume error is only due to the batch and order of processing. The first method that addresses correlation of errors between compounds by using the information from the variation of one variable to predict another is systematic error removal using random forest (SERRF) [@DOI:10.1021/acs.analchem.8b05592]. SERRF, among 14 normalization methods, was the most effective in significantly reducing systematic error [@DOI:10.1021/acs.analchem.8b05592]. Studies aiming to compare these methods for omics data normalization have come to different conclusions. -Ranking of different normalization methods can be done by assessing the percent decrease in median log2(standard deviation) and log2 pooled estimate of variance (PEV) in comparison to the raw data [@DOI:10.1074/ mcp.M800514-MCP200]. +Ranking of different normalization methods can be done by assessing the percent decrease in median log2(standard deviation) and log2 pooled estimate of variance (PEV) in comparison to the raw data [@DOI:10.1074/mcp.M800514-MCP200]. One study found linear regression ranked the highest compared to central tendency, LOESS and quantile normalization for peptide abundance normalization for replicate samples with and without biological differences [@DOI:10.1021/pr050300l]. A paper comparing multiple normalization methods using a large proteomic dataset found that mean/median centering, quantile normalization and RUV had the highest known associations between proteins and clinical variables [@DOI:10.1016/j.biosystems.2022.104661]. Rather than individually implementing normalization techniques, which can be challenging for non-domain experts, there are several R and Python packages that automate mass spectrometry data analysis and visualization. @@ -284,7 +284,7 @@ Imputing these censored values will lead to bias as the imputed values will be o However, if the quantity is present at detectable limits but was missed due to a problem with the instrument, this peptide is missing completely at random (MCAR) [@DOI:10.1186/1471-2105-13-S16-S5]. While imputation of values that are MCAR using observed values would be a reasonable approach, censored peptides should not be imputed because their missingness is informative [@DOI:10.1186/1471-2105-13-S16-S5]. Peptides MCAR are a less frequent problem compared to censored peptides [@DOI:10.1186/1471-2105-13-S16-S5]. -Understanding why the peptide is missing can be challenging [@DOI:10.1186/1471-2105-13-S16-S5], however there are techniques such as maximum likelihood model [@DOI:10.1093/bioinformatics/btp3620] or logistic regression [@DOI:10.1007/s12561-009-9013-2] that may distinguish censored versus MCAR values. +Understanding why the peptide is missing can be challenging [@DOI:10.1186/1471-2105-13-S16-S5], however there are techniques such as maximum likelihood model [@DOI:10.1093/bioinformatics/btp362] or logistic regression [@DOI:10.1007/s12561-009-9013-2] that may distinguish censored versus MCAR values. Commonly used imputation methods for omics data are random forest (RF) imputation[@DOI:10.1093/bioinformatics/btr597], k-nearest neighbors (kNN) imputation [@DOI:10.1093/bioinformatics/17.6.520], and single value decomposition (SVD) [@DOI:10.1093/bioinformatics/btm069]. Using the mean or median of the non-missing values for a variable is an easy approach to imputation but may lead to underestimating the true biological differences [@DOI:10.1186/1471-2105-13-S16-S5]. @@ -326,8 +326,8 @@ There are many additional batch effect correction methods for single cell data, Prior to conducting any statistical analysis, the raw data matrix should be compared to the data after the above-described pre-processing steps have been performed to ensure bias is removed. We can compare data using boxplots of peptide intensities from the raw data matrix versus corrected data in sample running order to look at batch associated patterns; after correction, we should see uniform intensities across batches [@DOI:10.15252/msb.202110240]. We can also use clustering methods such as Principal Component analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), or t-SNE (t-Distribute Stochastic Neighbor Embedding) and plot protein quantities colored by batches or technical versus biological samples to see how proteins cluster in space based on similarity. -We can measure the variability each PC contributes; we want to see similar variability among all PCs, however if see one PC contributing to overall variability highly then means variables are dependent [@DOI: 10.1002/pmic.202100103]. -tSNE and UMAP allow for non-linear transformations and allow for clusters to be more visually distinct [@DOI: 10.1002/pmic.202100103]. +We can measure the variability each PC contributes; we want to see similar variability among all PCs, however if see one PC contributing to overall variability highly then means variables are dependent [@DOI:10.1002/pmic.202100103]. +tSNE and UMAP allow for non-linear transformations and allow for clusters to be more visually distinct [@DOI:10.1002/pmic.202100103]. Grouping of similar samples by batch or other non-biological factors, such as time or plate, indicates bias [@DOI:10.15252/msb.202110240]. Quantitative measures of whether batch effects have been removed are principal variance components analysis (PVCA), which provides information on factors driving variance between biological and technical differences, and checking correlation of samples between different batches, within the same batch and between replicates. When batch effects are present, samples in the same batch will have higher correlation than samples from different batches and between replicates [@DOI:10.15252/msb.202110240]. @@ -375,7 +375,7 @@ Despite ML methods being highly effective in finding signals in a high dimension Supervised classification is the most common type of ML used for proteomic biomarker discovery, where an algorithm has been trained on variables to predict the class labels of unseen test data [@DOI:10.1089/omi.2013.0017]. Supervised means the class labels, such as disease versus controls, is known [@DOI:10.1016/j.xcrp.2022.101069]. Decision trees are common model choice due to their many advantages: variables are not assumed to be linearly related, models are able to rank more important variables on their own, and interactions between variables do not need to be pre-specified by the user [@DOI:10.1016/j.euprot.2015.08.001]. -There are three phases of model development and evaluation [@DOI: 10.1038/s41598-022-09954-8]. +There are three phases of model development and evaluation [@DOI:10.1038/s41598-022-09954-8]. In the first step, the dataset is split into training and testing splits, commonly 70% training and 30% testing. Second, the model is constructed using only the training data, which is further subdivided into training and test sets. During this process, an internal validation strategy, or cross-validation (CV), is employed [@DOI:10.1016/j.jprot.2018.12.004]. @@ -389,9 +389,9 @@ However, addition of EHR data may not be informative in some instances; in study A common mistake in proteomics ML studies is allowing the test data to leak into the feature selection step [@DOI:10.1021/acs.jproteome.2c00117; @DOI:10.1016/j.xcrp.2022.101069]. It has been reported that 80% of ML studies on the gut microbiome performed feature selection using all the data, including test data [@DOI:10.1021/acs.jproteome.2c00117]. -Including the testing data in the feature selection step leads to development of an artificially inflated model [@DOI:10.1021/acs.jproteome.2c00117] that is overfit on the training data and performs poorly on new data [@DOI: 10.1016/j.cels.2021.06.006]. +Including the testing data in the feature selection step leads to development of an artificially inflated model [@DOI:10.1021/acs.jproteome.2c00117] that is overfit on the training data and performs poorly on new data [@DOI:10.1016/j.cels.2021.06.006]. Feature selection should occur only on the training set and final model performance should be reported using the unseen testing set. -The number of samples should be ten times the number of features to make statistically valid comparisons, however this may not be possible in many cases [@DOI:10.1021/acs.jproteome.2c00117, @DOI:10.1038/nrc1322]. +The number of samples should be ten times the number of features to make statistically valid comparisons, however this may not be possible in many cases [@DOI:10.1021/acs.jproteome.2c00117, @DOI:10.1038/nrc1322]. If a study is limited by its number of samples, one can perform classification without feature selection [@DOI:10.1016/j.xcrp.2022.101069, @DOI:10.1021/acs.jproteome.2c00117]. Pitfalls also arise when a ML classifier is trained using an imbalanced dataset [@DOI:10.1038/s41598-022-09954-8].