Skip to content

4. tutorial on GECCO based analysis of BGCs from Cutibacterium avidum and Cutibacterium acnes

Rauf Salamzade edited this page Aug 4, 2024 · 11 revisions

Overview

This tutorial uses the testing dataset of 7 genomes belonging to the species Cutibacterium avidum and Cutibacterium acnes, which are common species on the healthy human skin. As you will see, these genomes do not have a ton of predicted BGCs and so it should run in 10-20 minutes using just 4 threads.

Download input dataset

# get the input dataset
wget https://github.com/Kalan-Lab/lsaBGC-Pan/raw/main/test_case.tar.gz

# uncompress it
tar -zcvf test_case.tar.gz 

# view the genomes
ls -lht test_case/input_genomes/

Part 1

Let's get to it and run lsaBGC-Pan, you can view all the options by issuing: lsaBGC-Pan -h:

lsaBGC-Pan -g test_case/input_genomes/ -o CaCa_Pan_Results/ --threads 6

lsaBGC-Pan will process the genomes, perform gene calling, and run GECCO for BGC predictions in step 1. Then in step 2, it will perform ortholog group inference across the genomes using OrthoFinder and process the results to account for applying the program on bacterial genomes. In step 3 and 4, lsaBGC-Pan will run phylogenate and popstrat - programs within lsaBGC - for customized phylogeny construction and delineation of populations/clades based on core-genome identity thresholds. Clustering of BGCs into GCFs is the final step performed in part 1 of the workflow before lsaBGC-Pan exits with the following message to allow users to alter parameters controlling GCF clustering and population stratification:

Breaking workflow! 
------------------
This is to give you an opportunity to adjust parameters for GCF
clustering and population stratification based on manual assessment
of the PDF report showing the impact of parameters on clustering
found at: /home/rauf/Projects/KalanLab/update_lsabgc/test/test_case/lsaBGC-Pan_Results/GCF_Clustering/ 
and PDF graphics of the species phylogeny with 
different population divisions marked which can be found in the 
directory /home/rauf/Projects/KalanLab/update_lsabgc/test/test_case/lsaBGC-Pan_Results/Delineate_Populations/. 
For more insight into how to select appropriate parameters, 
check out: 
https://github.com/Kalan-Lab/lsaBGC-Pan/wiki/7.-guide-to-parameter-selections-during-midway-break

Selecting population stratification parameters

We can take a look at the directory mentioned with the sample population classifications overlaid upon the species phylogeny and we will see that at a 95% protein identity, we get two clades corresponding to the two species. We also find only 6 genomes instead of the 7 we input in the phylogeny. This is because one of the genomes is dropped because no BGCs were found in it.

Customized editing of Rscripts for plots

We also see that the plots don't look great, labels are cutoff and the legend is overlaying on top of the phylogeny. However, we can easily fix this and simply recreate the figure via adapting the Rscript used to create the plots. This is the file located in the parent directory of the PDFs, called phylo_plotting.R.

NOTE: All plots in lsaBGC-Pan are made using customizable Rscripts!

We can change the code chunk in it from

track.data <- read.table("/home/rauf/Projects/KalanLab/update_lsabgc/test/test_case/lsaBGC-Pan_Results/Delineate_Populations/Population_Listings/Sample_to_Population_95.txt", header=F, sep="\t")
colnames(track.data) <- c("name", "population")
tree.labels <- phylo.tree$tip.label
gg_tr <- ggtree(phylo.tree) + geom_tiplab() + ggtitle("Identity Cutoff95")
gg_tr <- gg_tr %<+% track.data + geom_tippoint(aes(color=as.factor(population)), show.legend=T, size=3)
pdf("/home/rauf/Projects/KalanLab/update_lsabgc/test/test_case/lsaBGC-Pan_Results/Delineate_Populations/Phylogenetic_Views/Phylo_View_of_Population_Designations_95.pdf", height=10, width=10)
gg_tr + coord_cartesian(clip="off")
dev.off()

to

track.data <- read.table("/home/rauf/Projects/KalanLab/update_lsabgc/test/test_case/lsaBGC-Pan_Results/Delineate_Populations/Population_Listings/Sample_to_Population_95.0.txt", header=F, sep="\t")
colnames(track.data) <- c("name", "population")
tree.labels <- phylo.tree$tip.label
gg_tr <- ggtree(phylo.tree) + geom_tiplab() + ggtitle("Identity Cutoff95.0")
gg_tr <- gg_tr %<+% track.data + geom_tippoint(aes(color=as.factor(population)), show.legend=F, size=5)
pdf("/home/rauf/Projects/KalanLab/update_lsabgc/test/test_case/lsaBGC-Pan_Results/Delineate_Populations/Phylogenetic_Views/Phylo_View_of_Population_Designations_95.0.pdf", height=5, width=10)
gg_tr + coord_cartesian(clip="off")+ ggplot2::xlim(0, 0.2)
dev.off()

and rerun by issuing Rscript phylo_plotting.R to get a much more appealing figure that can now go in a publication:

Selecting parameters controlling clustering of BGCs into GCFs

Ok, next we want to assess which parameters to use for BGC clustering into GCFs. Do this we will take a look at the file: Plots_Depicting_Parameter_Influence_on_GCF_Clustering.pdf in the GCF_Clusterng/ directory the message pointed us to. These plots are described on the lsaBGC-Cluster page at: https://github.com/Kalan-Lab/lsaBGC/wiki/05.-Clustering-BGCs-into-GCFs. There are some minor differences in lsaBGC-Cluster and these plots between what we had in the original lsaBGC suite and what is present in lsaBGC-Pan.

NOTE: lsaBGC-Cluster in lsaBGC-Pan also has an option to control leniency in allowing BGCs which are near scaffold/contig edges to be grouped in with GCFs so that you do not get a ton of singleton GCFs due to technical artifacts if you have some low quality assemblies incorporated in your analysis.

While there are different types of plots, we can clearly see that using an MCL inflation parameter of 0.8 and a Jaccard Index cutoff of 0% results in a single GCF with multiple annotation categories whereas increasing the Jaccard Index cutoff to 20% results in a cleaner partition:

Part 2

Now that we have selected values for parameters controlling how populations are defined and GCFs clustered, we can simply restart lsaBGC-Pan as we did before. Usage of checkpoints within lsaBGC-Pan will avoiding re-running steps which had already completed successfully, however, GCF clustering and population delineation will always be re-run!

lsaBGC-Pan -g input_genomes/ -o lsaBGC-Pan_Results/ -threads 6 -ci 0.8 -cj 20.0 -pic 95.0

And that's it! This will run the full pipeline from start to finish. To learn how to interpret the major result files check out the wiki page: 8. explanation of final spreadsheet and visual reports.