-
Notifications
You must be signed in to change notification settings - Fork 11
Run FastOMA on your own grouping
A user can provide their own initial grouping of proteins (rootHOGs) to be used with FastOMA. This could be put in practice in two ways:
-
running two processes of hog_rest and collect_subhog in FastOMA.nf on the user's protein family in FASTA format.
-
Providing group mapping of proteins in OMAmer format.
For the first approach, note that each fasta record should have the formatting >gene_name|species_name|unique_integerID
. see example below.
>ANAPLA_R14405||ANAPLA||1134003114 ANAPLA_R14405
SPMFDGKVPHWHHYSCFWKRARIVSHTDIDGFPELRWEDQEKIKKAIETGGPGGGGDQEG
GGKAEKSLNDFAAEYAKSNRSTCKGCEQKIEK
>OREMEL_R06256||OREMEL||1323005702 OREMEL_R06256
MASKRHAVPPKQQDGKGKKVKRGEEDDVWSSTLAALKTAPKEKPPATIDGLCPLSSMPGA
QVYEDYDCTLNQTNISANNNKFYIIQLLEHDGAYSVW
what comes after space in the record ID does not matter.
In the manuscript, we described the InterProScan tool as an alternative to OMAmer+OMAdb.
For the second way, you could make an adapter that writes for each genome a “.hogmap” file in TSV format with at least the columns qseqid, hogid, family_p, qseqlen, subfamily_medianseqlen.
The hogid column must be in the format “HOG:[A-Z][0-9]{7}..*”. Everything after the dot will be truncated (so root hog id only). family_p, qseqlen and subfamily_medianseqlen are used to identify the best isoform if there are many, but I don’t think you will use those (at least not in the beginning).
See example:
# qseqid hogid family_p qseqlen subfamily_medianseqlen
sp|P15943|APLP2_RAT HOG:B0595810.1a 1.0 0.9889574448284 0.9876786945621626 766 733
tr|H2Q546|H2Q546_PANTR HOG:B0595810.1a.1b 1.0 0.9986343792025645 0.9986343792025645 762 733
tr|E3L250|E3L250_PUCGT HOG:B0811161 0.6180555555555556 0.08359449706661991 0.0161118229169709 145 438
With that, you should in principle be able to use FastOMA with your own initial root hogs.
If you faced any difficulties, fell free to contact us through GitHub issue.