README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```


<img src="inst/logo/ampir_hex.png" width="90" align="right" height="100" />

# Introduction to ampir

<!-- badges: start -->
[![](https://img.shields.io/badge/Shiny-ampir-blue?style=flat&labelColor=white&logo=RStudio&logoColor=blue)](https://ampir.marine-omics.net/)
[![](https://img.shields.io/badge/doi-10.1093/bioinformatics/btaa653-yellow.svg)](https://doi.org/10.1093/bioinformatics/btaa653)
[![Travis build status](https://travis-ci.com/Legana/ampir.svg?branch=master)](https://travis-ci.com/Legana/ampir) [![codecov](https://codecov.io/gh/Legana/ampir/branch/master/graph/badge.svg)](https://codecov.io/gh/Legana/ampir) [![License: GPL v2](https://img.shields.io/badge/License-GPL%20v2-blue.svg)](https://www.gnu.org/licenses/old-licenses/gpl-2.0.en.html) [![CRAN\_Release\_Badge](http://www.r-pkg.org/badges/version-ago/ampir)](https://CRAN.R-project.org/package=ampir?color=yellow) ![CRAN\_Download\_Badge](http://cranlogs.r-pkg.org/badges/grand-total/ampir?color=red) 
[![](http://cranlogs.r-pkg.org/badges/last-month/ampir?color=green)](https://cran.r-project.org/package=ampir)
<!-- badges: end -->


The **ampir** (short for **a**nti**m**icrobial **p**eptide prediction **i**n **r** ) package was designed to be a fast and user-friendly method to predict antimicrobial peptides (AMPs) from any given size protein dataset. **ampir** uses a *supervised statistical machine learning* approach to predict AMPs. It incorporates two support vector machine classification models, "precursor" and "mature" that have been trained on publicly available antimicrobial peptide data. The default model, "precursor" is best suited for full length proteins and the "mature" model is best suited for small mature proteins (<60 amino acids). **ampir**  also accepts custom (user trained) models based on the [caret](https://github.com/topepo/caret) package. Please see the **ampir** *"How to train your model"* [vignette](https://CRAN.R-project.org/package=ampir/vignettes/train_model.html) for details.

ampir's associated paper is published in the *Bioinformatics* journal as [btaa653](https://academic.oup.com/bioinformatics/article-abstract/doi/10.1093/bioinformatics/btaa653/5873588). Please cite this paper if you use ampir in your research.

ampir is also available via a Shiny based GUI at [https://ampir.marine-omics.net/](https://ampir.marine-omics.net/) where users can submit protein sequences in FASTA file format to be classified by either the "precursor" or "mature" model. The prediction results can then be downloaded as a csv file.

## Installation

You can install the released version of ampir from [CRAN](https://CRAN.R-project.org) with:

``` r
install.packages("ampir")
```

And the development version from [GitHub](https://github.com/) with:

``` r
# install.packages("devtools")
devtools::install_github("Legana/ampir")
```

```{r setup, warning=FALSE, message=FALSE}
library(ampir)
```

## Usage

Standard input to **ampir** is a `data.frame` with sequence names in the first column and protein sequences in the second column. 

Read in a FASTA formatted file as a `data.frame` with `read_faa()` 

```{r, warning=FALSE, message=FALSE}
my_protein_df <- read_faa(system.file("extdata/little_test.fasta", package = "ampir"))
```

```{r, echo=FALSE}
display_df <- my_protein_df
display_df$seq_aa <- paste(substring(display_df$seq_aa,1,45),"...",sep="")
display_df$seq_name <- gsub("tr\\|[^\\|]*\\|","",display_df$seq_name)
knitr::kable(display_df)
```

Calculate the probability that each protein is an antimicrobial peptide with `predict_amps()`. Since these proteins are all full length precursors rather than mature peptides we use `ampir`'s built-in precursor model.

*Note that amino acid sequences that are shorter than 10 amino acids long and/or contain anything other than the standard 20 amino acids are not evaluated and will contain an `NA` as their `prob_AMP` value.*

```{r}
my_prediction <- predict_amps(my_protein_df, model = "precursor")
```

```{r, echo=FALSE}
my_prediction$seq_aa <- paste(substring(my_prediction$seq_aa,1,45),"...",sep="")
my_prediction$seq_name <- gsub("tr\\|[^\\|]*\\|","",my_prediction$seq_name)
knitr::kable(my_prediction, digits = 3)
```

Predicted proteins with a specified predicted probability value could then be extracted and written to a FASTA file:

```{r}
my_predicted_amps <- my_protein_df[which(my_prediction$prob_AMP >= 0.8),]
```

```{r, echo=FALSE}
my_predicted_amps$seq_aa <- paste(substring(my_predicted_amps$seq_aa,1,45),"...",sep="")
my_predicted_amps$seq_name <- gsub("tr\\|[^\\|]*\\|","",my_predicted_amps$seq_name)
knitr::kable(my_predicted_amps)
```

Write the `data.frame` with sequence names in the first column and protein sequences in the second column to a FASTA formatted file with `df_to_faa()`

```{r, eval=FALSE}
df_to_faa(my_predicted_amps, "my_predicted_amps.fasta")
```