Skip to content

4. tutorial on GECCO based analysis of BGCs from Cutibacterium avidum and Cutibacterium acnes

Rauf Salamzade edited this page Aug 3, 2024 · 11 revisions

Overview

This tutorial uses the testing dataset of 7 genomes belonging to the species Cutibacterium avidum and Cutibacterium acnes, which are common species on the healthy human skin. As you will see, these genomes do not have a ton of predicted BGCs and so it should run in 10-20 minutes using just 4 threads.

Download input dataset

# get the input dataset
wget https://github.com/Kalan-Lab/lsaBGC-Pan/raw/main/test_case.tar.gz

# uncompress it
tar -zcvf test_case.tar.gz 

# view the genomes
ls -lht test_case/input_genomes/

Part 1:

Let's get to it and run lsaBGC-Pan, you can view all the options by issuing: lsaBGC-Pan -h:

lsaBGC-Pan -g test_case/input_genomes/ -o CaCa_Pan_Results/ --threads 8

lsaBGC-Pan will process the genomes, perform gene calling, and run GECCO for BGC predictions in step 1. Then in step 2, it will perform ortholog group inference across the genomes using OrthoFinder and process the results to account for applying the program on bacterial genomes. In step 3 and 4, lsaBGC-Pan will run phylogenate and popstrat - programs within lsaBGC - for customized phylogeny construction and delineation of populations/clades based on core-genome identity thresholds. Clustering of BGCs into GCFs is the final step performed in part 1 of the workflow before lsaBGC-Pan exits with the following message to allow users to alter parameters controlling GCF clustering and population stratification:

Breaking workflow! 
------------------
This is to give you an opportunity to adjust parameters for GCF
clustering and population stratification based on manual assessment
of the PDF report showing the impact of parameters on clustering
found at: /home/rauf/Projects/KalanLab/update_lsabgc/test/test_case/lsaBGC-Pan_Results/GCF_Clustering/ 
and PDF graphics of the species phylogeny with 
different population divisions marked which can be found in the 
directory /home/rauf/Projects/KalanLab/update_lsabgc/test/test_case/lsaBGC-Pan_Results/Delineate_Populations/. 
For more insight into how to select appropriate parameters, 
check out: 
https://github.com/Kalan-Lab/lsaBGC-Pan/wiki/7.-guide-to-parameter-selections-during-midway-break

We can take a look at the directory mentioned with the sample population classifications overlaid upon the species phylogeny and we will see that at a 95% protein identity, we get two clades corresponding to the two species. We also find only 6 genomes instead of the 7 we input in the phylogeny. This is because one of the genomes is dropped because no BGCs were found in it.

We also see that the