Data from high-throughput technologies assessing global patterns of biomolecules (omic data), is often afflicted with missing values and with measurement-specific biases (batch-effects), that hinder the quantitative comparison of independently acquired datasets. This repository provides the BERT algorithm, a high-performance method for data integration of incomplete omic profiles.
Important
This repository is primarily intended for development purposes. For typical users, BERT is provided via Bioconductor. Note that repository badges refer to the release version of BERT, which may be multiple commits behind the source code provided here. The latest CI/CD results for BERT may be obtained here.
Warning
The R package provided here is neither affiliated with nor related to Bidirectional Encoder Representations from Transformers as published by Devlin et al in 2019 (arXiv:1810.04805).
Tip
It is recommended to install BERT via Bioconductor as described here.
For development purposes, the BERT package can be installed directly from this repository using devtools.
if (!require("devtools", quietly = TRUE))
install.packages("devtools")
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install(c('S4Vectors', 'S4Arrays', 'XVector', 'genefilter', 'SparseArray'))
devtools::install_github('HSU-HPC/BERT')
Please compare the installed version of R to the required version for Bioconductor and install all build dependencies if compilation from source is required for your target1.
The BERT library is designed to offer high user friendliness whilst providing maximum flexibility. The following example demonstrates how to use the software on a simulated dataset with batch-effects and missing values:
# import library
library(BERT)
# simulate dataset with 10% missing values
dataset_raw <- generate_dataset(features=60, batches=10, samplesperbatch=10, mvstmt=0.1, classes=2)
# apply BERT with default arguments
dataset_corrected <- BERT(dataset_raw)
Tip
A detailed explanation of all available parameters, their default values and optimal configurations for typical scenarios can be found in the Bioconductor vignette.
Users may ask for assistance via the Bioconductor support site. Bug reports may be filed via the Issues tab of this repository. For confidential or security-related problems, please send an email to
yannis [dot] schumann [at] desy [dot] de .
This code is published under the GPLv3.0 License.
Citations make research visible. If you use BERT for your research, please cite the following publication:
- Computational Methods for Data Integration and Imputation of Missing Values in Omics Datasets, Y. Schumann Gocke / A. Gocke / J. E. Neumann, 2024-12 PROTEOMICS, Wiley, https://doi.org/10.1002/pmic.202400100
Footnotes
-
On Ubuntu 24.04, a complete list of depencies would be: wget, _curl _, build-essential, libssl-dev, libcurl4-openssl-dev, pkg-config, git, ca-certificates, libxml2, libxml2-dev, gnupg, software-properties-common, libfontconfig1-dev, libharfbuzz-dev, libfribidi-dev, libfreetype6-dev, libpng-dev, libtiff5-dev, libjpeg-dev ↩