Skip to content

Latest commit

 

History

History
229 lines (153 loc) · 14.5 KB

README.md

File metadata and controls

229 lines (153 loc) · 14.5 KB

AI2BMD: AI-powered ab initio biomolecular dynamics simulation

Contents

Overview

AI2BMD is a program for efficiently simulating protein molecular dynamics with ab initio accuracy. This repository contains the simulation program, datasets, and public materials related to AI2BMD.

AI2BMD Setup Guide

The source code of AI2BMD is hosted in this repository. We package the source code and runtime libraries into a Docker image, and provide a Python launcher program to simplify the setup process. To run the simulation program, you don't need to clone this repository. Simply download scripts/ai2bmd and launch it (Python >=3.7 is required).

wget 'https://raw.githubusercontent.com/microsoft/AI2BMD/main/scripts/ai2bmd'
chmod +x ai2bmd
# you may need to "sudo" the following line if the docker group is not configured for the user
./ai2bmd --prot-file path/to/target-protein.pdb --sim-steps nnn  ...
#        '-------- required argument ---------' '-- optional arguments --'
#
# Notable optional arguments:
#
# [Simulation directory mapping options]
#   --base-dir path/to/base-dir    Directory for running simulation (default: current directory)
#   --log-dir  path/to/log-dir     Directory for logs, results (default: base-dir/Logs-protein-name)
#   --src-dir  path/to/src-dir     Mount src-dir in place of src/ from this repository (default: not used)
#
# [Simulation parameter options]
#   --sim-steps nnn                Simulation steps
#   --temp-k nnn                   Simulation temperature in Kelvin
#   --timestep nnn                 Time-step (fs) for simulation
#   --preeq-steps nnn              Pre-equilibration simulation steps for each constraint
#   --max-cyc nnn                  Maximum energy minimization cycles in preprocessing
#
# [Performance tweaks]
#   --device-strategy [strategy]   The compute device allocation strategy
#       excess-compute                 Reserves last GPU for non-bonded/solvent computation
#       small-molecule                 Maximize resources for model inference
#       large-molecule                 Improve performance for large molecules
#   --chunk-size nnn               Number of atoms in each batch (reduces memory consumption)
#
# [Additional launcher options]
#   --software-update              When specified, updates the program in the Docker image before running
#   --download-training-data       When specified, downloads the AI2BMD training data, and unpacks it in the working directory. 
#                                  Ignores all other options.
#   --gpus                         Specifies the GPU devices to passthrough to the program. Can be one of the following:
#                                  all:        Passthrough all available GPUs to the program.
#                                  none:       Disables GPU passthrough.
#                                  i[,j,k...]  Passthrough some GPUs. Example: --gpus 0,1

Running Simulation

We can run a molecular dynamics simulation as follows.

# skip the following two lines if you've already set up the launcher
wget 'https://raw.githubusercontent.com/microsoft/AI2BMD/main/scripts/ai2bmd'
chmod +x ai2bmd
# download the Chignolin protein structure data file
wget 'https://raw.githubusercontent.com/microsoft/AI2BMD/resources/samples/chig.pdb'
# launch the program, with all simulation parameters set to default values
# you may need to "sudo" the following line if the docker group is not configured for the user
./ai2bmd --prot-file chig.pdb

Here we use a very simple protein Chignolin as an example. The program will run a simulation with the default parameters.

The results will be placed in a new directory Logs-chig. The directory contains the simulation trajectory file:

  • chig-traj.traj: The full trajectory file in ASE binary format.

Note: Currently, AI2BMD supports protein simulations with single chain, standard amino acids and non-disulfide bonds.

Datasets

Protein Unit Dataset

The protein unit dataset covers a wide range of conformations for dipeptides. It can be downloaded with the following commands:

# skip the following two lines if you've already set up the launcher
wget 'https://raw.githubusercontent.com/microsoft/AI2BMD/main/scripts/ai2bmd'
chmod +x ai2bmd
# you may need to "sudo" the following line if the docker group is not configured for the user
./ai2bmd --download-training-data

When it finishes, the current working directory will be populated by the numpy data files (*.npz).

AIMD-Chig Dataset

The AIMD-Chig dataset consists of 2 million conformations of the 166-atom Chignolin, along with their corresponding potential energy and atomic forces calculated using Density Functional Theory (DFT) at the M06-2X/6-31G* level.

System Requirements

Hardware Requirements

The AI2BMD program runs on x86-64 GNU/Linux systems. We recommend a machine with the following specs:

  • CPU: 8+ cores
  • Memory: 32+ GB
  • GPU: CUDA-enabled GPU with 8+ GB memory

The program has been tested on the following GPUs:

  • A100
  • V100
  • RTX A6000
  • Titan RTX

Software Requirements

The program has been tested on the following systems:

  • OS: Ubuntu 20.04, Docker: 27.1
  • OS: ArchLinux, Docker: 26.1

AI2BMD Related Research

Model Architectures

ViSNet

ViSNet (Vector-Scalar interactive graph neural Network) is an equivariant geometry-enhanced graph neural for molecules that significantly alleviates the dilemma between computational costs and the sufficient utilization of geometric information.

Geoformer

Geoformer (Geometric Transformer) is a novel geometric Transformer to effectively model molecular structures for various molecular property predictions. Geoformer introduces a novel positional encoding method, Interatomic Positional Encoding (IPE), to parameterize atomic environments in Transformer. By incorporating IPE, Geoformer captures valuable geometric information beyond pairwise distances within a Transformer-based architecture. Geoformer can be regarded as a Transformer variant of ViSNet.

Fine-grained force metrics for MLFF

Machine learning force fields (MLFFs) have gained popularity in recent years as a cost-effective alternative to ab initio molecular dynamics (MD) simulations. Despite their small errors on test sets, MLFFs inherently suffer from generalization and robustness issues during MD simulations.

To alleviate these issues, we propose the use of global force metrics and fine-grained metrics from elemental and conformational aspects to systematically measure MLFFs for every atom and conformation of molecules. Furthermore, the performance of MLFFs and the stability of MD simulations can be enhanced by employing the proposed force metrics during model training. This includes training MLFF models using these force metrics as loss functions, fine-tuning by reweighting samples in the original dataset, and continued training by incorporating additional unexplored data.

Stochastic lag time parameterization for Markov State Model

Markov state models (MSMs) play a key role in studying protein conformational dynamics. A sliding count window with a fixed lag time is commonly used to sample sub-trajectories for transition counting and MSM construction. However, sub-trajectories sampled with a fixed lag time may not perform well under different selections of lag time, requiring strong prior experience and resulting in less robust estimations.

To alleviate this, we propose a novel stochastic method based on a Poisson process to generate perturbative lag times for sub-trajectory sampling and use it to construct a Markov chain. Comprehensive evaluations on the double-well system, WW domain, BPTI, and RBD–ACE2 complex of SARS-CoV-2 reveal that our algorithm significantly increases the robustness and accuracy of the constructed MSM without disrupting its Markovian properties. Furthermore, the advantages of our algorithm are especially pronounced for slow dynamic modes in complex biological processes.

Citation

(#: co-first author; *: corresponding author)

Yusong Wang#, Tong Wang#*, Shaoning Li#, Xinheng He, Mingyu Li, Zun Wang, Nanning Zheng, Bin Shao*, Tie-Yan Liu, Enhancing geometric representations for molecules with equivariant vector-scalar interactive message passing, Nature Communications, 15.1 (2024): 313.

Yusong Wang#, Shaoning Li#, Tong Wang*, Bin Shao, Nanning Zheng, Tie-Yan Liu. Geometric Transformer with Interatomic Positional Encoding. NeurIPS 2023.

Zun Wang#, Hongfei Wu#, Lixin Sun, Xinheng He, Zhirong Liu, Bin Shao, Tong Wang*, Tie-Yan Liu. Improving machine learning force fields for molecular dynamics simulations with fine-grained force metrics, The Journal of Chemical Physics, Volume 159, Issue 3, Cover Story.

Tong Wang#*, Xinheng He#, Mingyu Li#, Bin Shao*, Tie-Yan Liu. AIMD-Chig: Exploring the conformational space of a 166-atom protein Chignolin with ab initio molecular dynamics, Scientific Data 10, 549 (2023).

Shiqi Gong#, Xinheng He#, Qi Meng, Zhiming Ma, Bin Shao*, Tong Wang*, Tie-Yan Liu. Stochastic Lag Time Parameterization for Markov State Models of Protein Dynamics, The Journal of Physical Chemistry B 2022 126 (46), Cover Story, 2022.

License

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT license.

Disclaimer

AI2BMD is a research project. It is not an officially supported Microsoft product.

Contacts

Please contact AI2BMD Team for any questions or suggestions. The main team members include: