Releases: iquasere/KEGGCharter
"--resume" now evaluates the files already produced
For both data_for_charting.tsv
and taxon_to_mmap_to_orthologs .json
:
- if the
--resume
parameter was used and the file is found, KEGGCharter won't generate it again. - else, KEGGCharter will again generate the data, and overwrite the file if it exists.
Also, an important fix
On retrieving kegg taxa prefixes - checks with type(taxa) == str
now, instead of taxa != np.nan
.
New needs, new regexes
Changed regex for EC numbers to account for provisional ECs
Changed ^(\d+)(\.(\d+|-)){3}$
to ^(\d+)(\.(\d+|-)){2}(\.(.*))?$
, which accepts provisional EC numbers (e.g., 1.1.1.n1
).
Changed regex for KEGG IDs to account for other taxonomy codes
Changed ^[A-Za-z]{3}:.+$
to ^[A-Za-z]+:.+$
to accept taxonomy codes that have less or more than three characters (e.g., pall:UYA_22060
).
Also, some bug fixes
- One of the weirdest bugs ever -
pandas.DataFrame.groupby
has a maximum number of columns (20). - Fix on saving
box2taxon
when it is empty - Also removed some code from the time only one functional column was considered at a time
Important fixes on ID cross-referencing, validation of functional ID columns and colormap picking
Validation of input data columns implemented
Four regexes will check if values in columns are valid.
KEGG ID
:^[A-Za-z]{3}:.+$
KO
:^K\d{5}$
EC number
:^(\d+)(\.(\d+|-)){3}$
COG
:^COG\d{4}$
Values can come in comma separated values, but each value between commas must obey the regexes.
Also, several fixes
Fix on adding new ids from API
Merging new IDs with old ones was creating some disconnect between old and new columns, and the new IDs were being placed in new columns disconnected from the rest of the dataframe. It's fixed now.
Differential colormap starts at 0
Before, the colormap was being generated between the maximmum and minimum values of the dataframe. Now, it begins at 0, up to the maximum of the dataframe.
Implemented new parameter for chosing colormap of differential maps
--differential-colormap
allows to chose a new colormap instead of the default (summer
). Valid values can be consulted at matplotlib.
Also, KEGGCharter now only creates output dirs when it passes input file validation
Fix on having cog2ko available
Must be updated on the meta. Lines change:
cp *.py resources/KEGGCharter_prokaryotic_maps.txt resources/cog2ko_keggcharter.tsv $PREFIX/share &&
to
cp *.py *.txt *.tsv $PREFIX/share &&
Fix on checking for columns of functional IDs
KEGGCharter was only looking for KEGG IDs, KOs and EC numbers columns to check if some functional IDs column was inputted.
This would make it exit with error if only a column with COG IDs was inputted.
Now it also looks for COGs columns, and accepts to only input a COG IDs column.
Also am trying to understand with it doesn't find cog2ko.tsv
.
KEGGCharter as a proper tool of science
Implemented COG2KO
This idea belongs to Lovro Grum. For each KO, COGs are extracted from their KEGG HTML page. This information is reversed, and becomes COG to KO conversion.
New database, making KEGGCharter far more powerful! Makes for a great synergy with reCOGnizer.
Because this is webscrapping, 403 - Forbidden
and Timeouts may often occur.
KEGGCharter gives some time between failed tries, and at the end checks for any KOs whose HTMLs were not retrieved. It tries to retrieve those as well.
Sanitization of input file
Checks if:
- inputted columns exist in the input file
- if
--kegg-column
,--ko-column
,--ec-column
,--cog-column
columns don't have invalid values / bad characters (" " and ";").
Added parameter for dividing quantification of each enzyme by the KOs assigned to it
When set, the --distribute-quantification
parameter will instruct KEGGCharter to split the quantification of each enzyme by all the KOs that were assigned to it.
This information is outputted in data_for_charting.tsv
.
New tests for several different parameters' combinations
show-available-maps
for --show-available-maps
parameter.
input-quantification-and-taxonomy
for --input-taxonomy
and --input-quantification
parameters.
include-missing-genomes
for --include-missing-genomes
parameter.
map-all
for --map-all
parameter.
New output folders and writting of JSON information
KEGGCharter now stores metabolic maps representations in a maps
folder. No brainer.
KEGGCharter additionally stores the information concerning the maps into a json
folder. This folder will contain the dictionaries used for generating both the potential
and differential
maps.
"Potential" JSONs come in the form {box_id: [tax1, tax2, ...]}
.
"Differential" JSONs come in the form {box_id: [col1, col2, ...]}
. In the future, these should include the quantification value instead.
Also added lxml
as dependency.
Sanitization of input file
Forces input file to have the columns specified through the command line.
Applies to taxa-column
, kegg-column
, ko-column
, ec-column
and columns specified through --quantification-columns
.
Information from "kegg-column", "ko-column" and "ec-column" is now all combined
Multiple new columns are now outputted, depending on the source of information, e.g., KO (kegg-column)
contains the KOs obtained from the IDs on the column specified with -keggc
.
All KOs obtained are grouped into the KO (KEGGCharter)
column, now the only used for charting functions.
Multiple IDs in the same cell now accepted and considered properly
Comma ,
is the only delimiter accepted for parsing multiple IDs inside the same cell.
Multiple KEGG IDs were accepted before, if separated by semi-comma (;
). This is now deprecated, and they most come comma-separated.
"Data" dataframe extends and compresses with each cycle of ID conversion.
Simplified input of quantification columns
No more --genomic-columns
nor --transcriptomic-columns
, only --quantification-columns
(-tcols
) now.
All maps ("potential" and "differential") are produced for those columns.
"gene" features now also mapped
KEGGCharter was only considering the orthologs
attribute of the Pathway
instances, but some boxes are present in the KGML as gene
features. Now, KEGGCharter considers those as well.
Reestructured the repo, simplified CICD, improved output to the command line, performance improvements
Maps inside resources
folder, all yamls and CI files in cicd
folder.
Much smaller keggcharter_input.tsv
is still enough to build nice maps.
Had to specify version of libarchive
(3.6.2=h039dbb9_1
) in the Dockerfile.
More comprehensive messages.
Lighter progress bars.
--map-all
workflow was running write_kgmls
function for all taxa. Simply runs for ko
now, and associates information to all taxa. Much faster, less dumber.
New options for dealing with tax information
Original workflow of KEGGCharter attempts to download taxa specific KGMLs for organisms in KEGG Genomes (Fig. 1).
Fig. 1 - Original KEGGCharter workflow. Only arcticus had KOs with functions for the TCA cycle attributed that, simultaneously, were present in the KGML for the TCA cycle and the taxon arcticus.
This type of workflow uses both taxon-specific information and results from the datasets inputted. All functions represented validated by KEGG (i.e., those functions are available for those organisms), but many identifications may be lacking, since information at KEGG is often incomplete.
Setting "--include-missing-genomes" represents organisms that are not in KEGG Genomes
Organisms that are not identified in KEGG Genomes can still be represented, if running KEGGCharter with the option --include-missing-genomes
. All functions for the KOs identified for that organism will be represented (Fig. 2).
Fig. 2 - KEGGCharter output expanded with --include-missing-genomes
parameter. hydrocola is not present in KEGG Genomes, but all functions attributed to its KOs are still represented.
This setting allows to still obtain validated information for the taxonomies that are present in KEGG Genomes, while also allowing for representation of organisms not present in KEGG Genomes. It should offer the best compromise between false positives and false negatives.
Setting "--map-all" ignores KEGG Genomes completely, and represents all functions identified
Functions that are not present organisms specific KGMLs can still be represented, if running KEGGCharter with the option --map-all
. This will bypass all taxon specific KGMLs, and map all functions for all KOs present in the input dataset (Fig. 3).
Fig. 3 - KEGGCharter output expanded with --map-all
parameter. No functions for oleophylus and franklandus were simultaneously present in the KOs identified and available in their KGMLs. In this case, the requirement for presence in the KGMLs is bypassed, and all functions are represented for all taxa.
This setting represents the most information on the KEGG maps, and will produce the most colourful representations, but will likely return many false positives. Maps produced should be analyzed with caution This setting may be required, however, if information for organisms in KEGG Genomes is very incomplete.
Fixed mapping boxes' IDs and submitting too many IDs to KEGG
Major fix in mapping boxes IDs and positions in orthologs array
Difference between mapping by box.id
and by the index in the pathway.orthologs
array.
Also changed default "step" to 40
KEGG's API will report on less ID mappings if many IDs are submitted in the same request.
This will take much longer, but all information will be obtained.