Maps identifiers and names to other identifiers and names.
$ echo -e "\tHomo sapiens" | nomer append itis
Homo sapiens SAME_AS ITIS:180092 Homo sapiens species Animalia | Bilateria | Deuterostomia | Chordata | Vertebrata | Gnathostomata | Tetrapoda | Mammalia | Theria | Eutheria | Primates | Haplorrhini | Simiiformes | Hominoidea | Hominidae | Homininae | Homo | Homo sapiens ITIS:202423 | ITIS:914154 | ITIS:914156 | ITIS:158852 | ITIS:331030 | ITIS:914179 | ITIS:914181 | ITIS:179913 | ITIS:179916 | ITIS:179925 | ITIS:180089 | ITIS:943773 | ITIS:943778 | ITIS:943782 | ITIS:180090 | ITIS:943805 | ITIS:180091 | ITIS:180092 kingdom | subkingdom | infrakingdom | phylum | subphylum | infraphylum | superclass | class | subclass | infraclass | order | suborder | infraorder | superfamily | family | subfamily | genus | species http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=180092
Standard out (stdout) is used for results, and standard error (stderr) is used for logging (e.g., progress reporting). Designed to work with *nix pipes or as simple commandline tool.
Nomer
expects tab separated input in form of [term id]\t[term name]
. To change this default behavior, you can select the columns to be used for id/name selection by defining an alternate nomer.schema.*
properties. See properties
command to list available properties.
Different kind of matchers can be select to do the term matching. Offline matching is supported by some matchers like col
, and itis
. Note that offline-enabled matchers will download versioned taxonomic resources initially (e.g., an ITIS data dump as included in Nomer's Corpus of Taxonomic Resources*), and re-uses the indexes until the cache is cleaned up. The cache itself can be archived so that results can be reproduced in a different environment without need to rebuild the term match index. For prebuilt indexes, please inspect the release assets at https://github.com/globalbioticinteractions/nomer/releases/0.5.13 (e.g., Catalogue of Life: prebuilt index Nomer v0.5.13). Note, however, that indexes may take some time to build from scratch, and, when built, they can take hundreds of megabytes of disk space.
Matchers can be added by writing some java code that implements an interface.
Note that a python wrapper was made available by nleguillarme at https://github.com/nleguillarme/pynomer.
Note that a NodeJS wrapper was made available by zedomel at https://github.com/zedomel/nodejs-nomer.
*Poelen, J. H. (ed . ) . (2024). Nomer Corpus of Taxonomic Resources hash://sha256/b60c0d25a16ae77b24305782017b1a270b79b5d1746f832650f2027ba536e276 hash://md5/17f1363a277ee0e4ecaf1b91c665e47e (0.27) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.12695629
Nomer needs Java 8+, and is developed and tested on OpenJDK 8. Why OpenJDK 8?. Because OpenJDK 8 is Long Term Supported (LTS), and (at time of writing Nov 2021) supported until at least May 2026, longer than any of the newer OpenJDK versions.
Please see https://github.com/nleguillarme/pynomer for a python wrapper.
Nomer is a stand-alone java application, packaged in a jarfile. You can build you own (see building) or download a prebuilt jar at releases.
On linux and mac, you can use the following script to install nomer:
sudo sh -c '(curl -L https://github.com/globalbioticinteractions/nomer/releases/download/0.5.13/nomer.jar) > /usr/local/bin/nomer && chmod +x /usr/local/bin/nomer && nomer install-manpage' && nomer clean && nomer version
With this, you can now run things like nomer version
instead of java -jar [some long dir path]/nomer.jar version
.
Note that a debian package (Debian, Ubuntu, etc) is also available for use with the Advanced Package Tool (or apt) via:
sudo apt update
sudo apt upgrade
curl -L https://github.com/globalbioticinteractions/nomer/releases/download/0.5.13/nomer.deb > nomer.deb
sudo apt install ./nomer.deb
To remove type sudo apt remove nomer
.
Nomer is made available through a maven repository.
To include nomer
in your project, add the following sections to your pom.xml (or equivalent for sbt, gradle etc):
<repositories>
<repository>
<id>depot.globalbioticinteractions.org</id>
<url>https://depot.globalbioticinteractions.org/release</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>org.globalbioticinteractions</groupId>
<artifactId>nomer</artifactId>
<version>0.0.1</version>
</dependency>
</dependencies>
Please use maven version 3.3+ , otherwise you might find issues like this one.
- Clone this repository
- Run tests using
mvn test
(optional). - Run
mvn package -DskipTests
to build (standalone) jar - Copy
nomer/target/nomer-[version]-jar-with-dependencies.jar
to[some dir]/nomer.jar
For documentation see docs/nomer.adoc or type man nomer
in the terminal after installing nomer.
As generated using:
nomer help
Usage: nomer [-hV] [COMMAND]
maps identifiers and names to other identifiers and names
-h, --help Show this help message and exit.
-V, --version Print version information and exit.
Commands:
version Show Version
replace Replace exact term matches in
row from stdin. The input
schema is used to select the
id and/or name to match to.
The output schema is used to
select the columns to write
into. If a term has multiple
matches, first match is used.
For example:
echo -e '\tHomo sapiens' | nomer
replace col
has expected result:
COL:6MB3T Homo sapiens
append Append term match to row from
stdin using id and name
columns specified in input
schema. Multiple matches
result in multiple rows.
For example:
echo -e '\tHomo sapiens' | nomer
append col
has expected result:
Homo sapiens HAS_ACCEPTED_NAME
COL:6MB3T Homo sapiens
Linnaeus, 1758 species Biota
| Animalia | Chordata |
Vertebrata | Gnathostomata |
Osteichthyes | Sarcopterygii |
Tetrapoda | Amniota | Mammalia
| Theria | Eutheria | Primates
| Haplorrhini | Simiiformes |
Hominoidea | Hominidae |
Homininae | Homo | Homo
sapiens COL:5T6MX | COL:N |
COL:CH2 | COL:8V4V3 | COL:
8V4V5 | COL:8VVWB | COL:8VSMX
| COL:9CK8W | COL:8VLBH | COL:
6224G | COL:924GT | COL:LG |
COL:8ZXYB | COL:4DT | COL:4PM
| COL:58L | COL:6256T | COL:
JPH | COL:636X2 | COL:6MB3T
unranked | kingdom | phylum |
subphylum | infraphylum |
parvphylum | gigaclass |
megaclass | superclass | class
| subclass | infraclass |
order | suborder | infraorder
| superfamily | family |
subfamily | genus | species |
| | | | | | | |
Linnaeus, 1758 | Parker &
Haswell, 1897 | Gill, 1872 |
Linnaeus, 1758 | Pocock, 1918
| Haeckel, 1866 | Gray, 1825 |
Gray, 1825 | Gray, 1825 |
Linnaeus, 1758 | Linnaeus,
1758 https://www.
catalogueoflife.
org/data/taxon/6MB3T
list, ls, dump, export Dumps all terms into the defined
output schema.
For example:
nomer ls col | head -n2
has expected result:
providedExternalId providedName
providedAuthorship
relationName
resolvedExternalId
resolvedName ...
resolvedAuthorship
resolvedRank
resolvedCommonNames
resolvedPath resolvedPathIds
resolvedPathNames
resolvedPathAuthorships
resolvedExternalUrlCOL:
001417c6-d3fc-4f42-aa3d-b1de3a5
92e58 Cheilostomatida incertae
sedis HAS_ACCEPTED_NAME COL:
001417c6-d3fc-4f42-aa3d-b1de3a5
92e58 Cheilostomatida incertae
sedis suborder Biota |
Animalia | Bryozoa |
Gymnolaemata | Cheilostomatida
| Cheilostomatida incertae
sedis COL:5T6MX | COL:N | COL:
622CG | COL:8ZXG2 | COL:84JWL
| COL:
001417c6-d3fc-4f42-aa3d-b1de3a5
92e58 unranked | kingdom |
phylum | class | order |
suborder | | | Allman, 1856
| Busk, 1852 | https://www.
catalogueoflife.
org/data/taxon/001417c6-d3fc-4f
42-aa3d-b1de3a592e58
matchers Lists supported matcher and
(optionally) their
descriptions.
properties Lists configuration properties.
Can be used to make a local
copy and override default
settings using the
[--properties=[local copy]]
option.
input-schema Show input schema in JSON.
output-schema Show output schema in JSON.
validate-terms Validate terms.
validate-term-link Validate term links.
clean Cleans term matcher cache.
config-man, config-manpage, install-manpage Installs/configures Nomer man
page, so you can type [man
nomer] on unix-like system to
learn more about Nomer.
gen-manpage Generates man pages for all
commands in the specified
directory.
help Displays help information about
the specified command
nomer version
produces:
0.5.13
nomer matchers -v
Result as of v0.5.13 (July 2024) is formatted as a table below:
ala | Lookup taxon in Atlas of Living Australia by name or by id using ALATaxon:* prefix. |
---|---|
batnames | Lookup BatNames taxa by name, synonym using offline-enabled database dump |
bold-web | Use BOLD webservice to lookup taxa by bin/taxon id using BOLD:* and BOLDTaxon:* prefixes. |
col | Lookup Catalogue of Life taxon by name or COL:* prefixed ids using offline-enabled database dump |
crossref-doi | uses api.crossref.org to resolve doi associated with human readable citation |
discoverlife | Lookup DiscoverLife taxa by name, synonym using offline-enabled database dump |
envo | Lookup envo terms by name or by id using ENVO:* prefix. |
eol | Lookup EOL pages by id with EOL:* prefix using offline-enabled database dump |
gbif | Lookup GBIF taxa by name, synonym or id using offline-enabled database dump |
gbif-parse | Attempts extract canonical taxonomic name from name string using https://github.com/gbif/name-parser . |
gbif-web | Web-based taxon id/name lookup using GBIF backbone API and GBIF:* prefix. |
globalnames | Uses https://resolver.globalnames.org to match taxon names. Searches by name only (not id). |
globi | Uses GloBI's Taxon Graph to lookup terms by id or name across many taxonomies / ontologies. Caches a copy locally on first use to allow for subsequent offline usage. Use properties [nomer.term.cache.url] and [nomer.term.map.url] to override default cache and map locations. See https://doi.org/10.5281/zenodo.755513 for more information. |
globi-correct | Scrubs names using GloBI's (taxonomic) name scrubber. Scrubbing includes removing of stopwords (e.g., undefined), correcting common typos using a "crappy" names list, parse to canonical name using gnparser (see https://github.com/GlobalNamesArchitecture/gnparser), and more. |
globi-enrich | Uses GloBI's taxon enricher to find first term match by id or name. Uses various web apis like Encyclopedia of Life, World Registry of Marine Species (WoRMS), Integrated Taxonomic Information System (ITIS), National Biodiversity Network (NBN) and more. |
globi-rank | Finds taxonomic rank identifiers by rank commons name (e.g., species, order, soort). Uses Wikidata taxon rank items. Caches a copy locally on first usage to allow for subsequent offline usage. |
globi-suggest | Scrubs names using GloBI's (taxonomic) name scrubber. Scrubbing includes removing of stopwords (e.g., undefined), correcting common typos using a "crappy" names list, parse to canonical name using gnparser (see https://github.com/GlobalNamesArchitecture/gnparser), and more. |
gn-parse | Attempts extract canonical taxonomic name from name string using https://github.com/GlobalNamesArchitecture/gnparser . |
gulfbase | Look up taxa of https://gulfbase.org by name or id with BioGoMx:* prefix. |
inaturalist-id | Lookup taxon in iNaturalist by id with INAT_TAXON:* prefix. |
indexfungorum | Lookup Index Fungorum taxon by name or id using offline-enabled database dump |
itis | Lookup ITIS taxon by name or id using offline-enabled database dump |
itis-web | Use itis webservice to lookup taxa by id using ITIS:* prefix. |
mdd | Lookup Mammal Diversity Database (MDD) taxon by name or id using offline-enabled database dump |
nbn | Lookup taxon of National Biodiversity Network by id with NBN:* prefix. |
ncbi | Lookup NCBI taxa by name, synonym or id using offline-enabled database dump |
ncbi-web | Lookup NCBI taxon by id with NCBI:* prefix using web apis. |
nodc | Lookup taxon in the Taxonomic Code of the National Oceanographic Data Center (NODC) by id with prefix NODC: . Maps to ITIS terms if possible. |
openbiodiv | uses openbiodiv sparql endpoint to resolve openbiodiv terms |
orcid-web | Lookup ORCID by id with ORCID:* prefix. |
ott | Lookup Open Tree of Life taxon by name or (OTT|GBIF|WORMS|IF|NCBI|IRMNG)* prefixed ids using offline-enabled database dump |
pbdb | Lookup Paleobio Database taxon by name or id using offline-enabled database dump |
plazi | Lookup Plazi taxon treatment by name or id using offline-enabled database dump |
pmid-doi | resolves pubmed id to doi using https://www.ncbi.nlm.nih.gov/pmc/pmctopmid/ |
remove-stop-words | Removes stop words (e.g., undefined) using a stop word list specified by property [nomer.taxon.name.stopword.url] . |
tpt | Lookup TPT taxon by name or id using offline-enabled database dump |
translate-names | Translates incoming names using a two column csv file specified by property [nomer.taxon.name.correction.url] . |
uksi-current-name | Use UK Species Inventory to find current taxonomic name. |
wfo | Lookup World of Flora Online taxon by name or WFO:* prefixed ids using offline-enabled database dump |
wikidata | Lookup Wikidata taxon by name or id using offline-enabled database dump |
wikidata-web | uses wikidata to cross-walk taxon id across taxonomies |
worms | Lookup World Register of Marine Species by name or WORMS:* prefixed ids using offline-enabled database dump |
worms-web | Lookup taxon in WoRMS by name or by id with WORMS:* prefix. |
If you'd like to add new matchers, please open a new issue and describe your desires.
echo -e "NCBI:9606\t"\
| nomer append ncbi-web\
> matches.tsv
echo -e "\tHomo sapiens"\
| nomer append ncbi-web\
> matches.tsv
matches.tsv should now include entries like
$ cat matches.tsv
NCBI:9606 SAME_AS NCBI:9606 Homo sapiens species human @en cellular organisms | Eukaryota | Opisthokonta | Metazoa | Eumetazoa | Bilateria | Deuterostomia | Chordata | Craniata | Vertebrata | Gnathostomata | Teleostomi | Euteleostomi | Sarcopterygii | Dipnotetrapodomorpha | Tetrapoda | Amniota | Mammalia | Theria | Eutheria | Boreoeutheria | Euarchontoglires | Primates | Haplorrhini | Simiiformes | Catarrhini | Hominoidea | Hominidae | Homininae | Homo | Homo sapiens NCBI:131567 | NCBI:2759 | NCBI:33154 | NCBI:33208 | NCBI:6072 | NCBI:33213 | NCBI:33511 | NCBI:7711 | NCBI:89593 | NCBI:7742 | NCBI:7776 | NCBI:117570 | NCBI:117571 | NCBI:8287 | NCBI:1338369 | NCBI:32523 | NCBI:32524 | NCBI:40674 | NCBI:32525 | NCBI:9347 | NCBI:1437010 | NCBI:314146 | NCBI:9443 | NCBI:376913 | NCBI:314293 | NCBI:9526 | NCBI:314295 | NCBI:9604 | NCBI:207598 | NCBI:9605 | NCBI:9606 | superkingdom | clade | kingdom | clade | clade | clade | phylum | subphylum | clade | clade | clade | clade | superclass | clade | clade | clade | class | clade | clade | clade | superorder | order | suborder | infraorder | parvorder | superfamily | family | subfamily | genus | species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606
Similarly, you can match terms by id and produce JSON output, instead of tab-separated values using:
echo -e "NCBI:9606\tHomo sapiens"\
| nomer append ncbi-web -o json\
> matches.json
Now matches.json looks something like:
{
"species": {
"@id": "NCBITaxon:9606",
"name": "Homo sapiens",
"equivalent_to": {
"@id": "NCBITaxon:9606",
"name": "Homo sapiens"
}
},
"norank": {
"@id": "NCBITaxon:131567",
"name": "cellular organisms"
},
"superkingdom": {
"@id": "NCBITaxon:2759",
"name": "Eukaryota"
},
"clade": {
"@id": "NCBITaxon:33154",
"name": "Opisthokonta"
},
"kingdom": {
"@id": "NCBITaxon:33208",
"name": "Metazoa"
},
"phylum": {
"@id": "NCBITaxon:7711",
"name": "Chordata"
},
"subphylum": {
"@id": "NCBITaxon:89593",
"name": "Craniata"
},
"superclass": {
"@id": "NCBITaxon:8287",
"name": "Sarcopterygii"
},
"class": {
"@id": "NCBITaxon:40674",
"name": "Mammalia"
},
"superorder": {
"@id": "NCBITaxon:314146",
"name": "Euarchontoglires"
},
"order": {
"@id": "NCBITaxon:9443",
"name": "Primates"
},
"suborder": {
"@id": "NCBITaxon:376913",
"name": "Haplorrhini"
},
"infraorder": {
"@id": "NCBITaxon:314293",
"name": "Simiiformes"
},
"parvorder": {
"@id": "NCBITaxon:9526",
"name": "Catarrhini"
},
"superfamily": {
"@id": "NCBITaxon:314295",
"name": "Hominoidea"
},
"family": {
"@id": "NCBITaxon:9604",
"name": "Hominidae"
},
"subfamily": {
"@id": "NCBITaxon:207598",
"name": "Homininae"
},
"genus": {
"@id": "NCBITaxon:9605",
"name": "Homo"
},
"path": {
"names": [
"cellular organisms",
"Eukaryota",
"Opisthokonta",
"Metazoa",
"Eumetazoa",
"Bilateria",
"Deuterostomia",
"Chordata",
"Craniata",
"Vertebrata",
"Gnathostomata",
"Teleostomi",
"Euteleostomi",
"Sarcopterygii",
"Dipnotetrapodomorpha",
"Tetrapoda",
"Amniota",
"Mammalia",
"Theria",
"Eutheria",
"Boreoeutheria",
"Euarchontoglires",
"Primates",
"Haplorrhini",
"Simiiformes",
"Catarrhini",
"Hominoidea",
"Hominidae",
"Homininae",
"Homo",
"Homo sapiens"
],
"ids": [
"NCBI:131567",
"NCBI:2759",
"NCBI:33154",
"NCBI:33208",
"NCBI:6072",
"NCBI:33213",
"NCBI:33511",
"NCBI:7711",
"NCBI:89593",
"NCBI:7742",
"NCBI:7776",
"NCBI:117570",
"NCBI:117571",
"NCBI:8287",
"NCBI:1338369",
"NCBI:32523",
"NCBI:32524",
"NCBI:40674",
"NCBI:32525",
"NCBI:9347",
"NCBI:1437010",
"NCBI:314146",
"NCBI:9443",
"NCBI:376913",
"NCBI:314293",
"NCBI:9526",
"NCBI:314295",
"NCBI:9604",
"NCBI:207598",
"NCBI:9605",
"NCBI:9606"
],
"ranks": [
"",
"superkingdom",
"clade",
"kingdom",
"clade",
"clade",
"clade",
"phylum",
"subphylum",
"clade",
"clade",
"clade",
"clade",
"superclass",
"clade",
"clade",
"clade",
"class",
"clade",
"clade",
"clade",
"superorder",
"order",
"suborder",
"infraorder",
"parvorder",
"superfamily",
"family",
"subfamily",
"genus",
"species"
]
}
}
Using tools like jq, you can now do things like:
echo -e "NCBI:9606\tHomo sapiens"\
| nomer append -o json ncbi-web\
| jq .family
to list all the family taxa associated with the term.
$ echo -e "ITIS:180547" | nomer append itis
ITIS:180547 SAME_AS ITIS:180547 Enhydra lutris Species Animalia | Bilateria | Deuterostomia | Chordata | Vertebrata | Gnathostomata | Tetrapoda | Mammalia | Theria | Eutheria | Carnivora | Caniformia | Mustelidae | Lutrinae | Enhydra | Enhydra lutris ITIS:202423 | ITIS:914154 | ITIS:914156 | ITIS:158852 | ITIS:331030 | ITIS:914179 | ITIS:914181 | ITIS:179913 | ITIS:179916 | ITIS:179925 | ITIS:180539 | ITIS:552303 | ITIS:180545 | ITIS:552326 | ITIS:180546 | ITIS:180547 Kingdom | Subkingdom | Infrakingdom | Phylum | Subphylum | Infraphylum | Superclass | Class | Subclass | Infraclass | Order | Suborder | Family | Subfamily | Genus | Species http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=180547
$ echo -e "NCBI:9606" | nomer append ncbi```
NCBI:9606 SAME_AS NCBI:9606 Homo sapiens species man @en | human @en cellular organisms | Eukaryota | Opisthokonta | Metazoa | Eumetazoa | Bilateria | Deuterostomia | Chordata | Craniata | Vertebrata | Gnathostomata | Teleostomi | Euteleostomi | Sarcopterygii | Dipnotetrapodomorpha | Tetrapoda | Amniota | Mammalia | Theria | Eutheria | Boreoeutheria | Euarchontoglires | Primates | Haplorrhini | Simiiformes | Catarrhini | Hominoidea | Hominidae | Homininae | Homo | Homo sapiens NCBI:131567 | NCBI:2759 | NCBI:33154 | NCBI:33208 | NCBI:6072 | NCBI:33213 | NCBI:33511 | NCBI:7711 | NCBI:89593 | NCBI:7742 | NCBI:7776 | NCBI:117570 | NCBI:117571 | NCBI:8287 | NCBI:1338369 | NCBI:32523 | NCBI:32524 | NCBI:40674 | NCBI:32525 | NCBI:9347 | NCBI:1437010 | NCBI:314146 | NCBI:9443 | NCBI:376913 | NCBI:314293 | NCBI:9526 | NCBI:314295 | NCBI:9604 | NCBI:207598 | NCBI:9605 | NCBI:9606 | superkingdom | | kingdom | | | | phylum | subphylum | | | | | | | | | class | | | | superorder | order | suborder | infraorder | parvorder | superfamily | family | subfamily | genus | specieshttps://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9606
$ echo -e "\tCanis lupus" | nomer append globalnames
Canis lupus SAME_AS NCBI:9612 Canis lupus species | Eukaryota | Opisthokonta | Metazoa | Eumetazoa | Bilateria | Deuterostomia | Chordata | Craniata | Vertebrata | Gnathostomata | Teleostomi | Euteleostomi | Sarcopterygii | Dipnotetrapodomorpha | Tetrapoda | Amniota | Mammalia | Theria | Eutheria | Boreoeutheria | Laurasiatheria | Carnivora | Caniformia | Canidae | Canis | Canis lupus NCBI:131567 | NCBI:2759 | NCBI:33154 | NCBI:33208 | NCBI:6072 | NCBI:33213 | NCBI:33511 | NCBI:7711 | NCBI:89593 | NCBI:7742 | NCBI:7776 | NCBI:117570 | NCBI:117571 | NCBI:8287 | NCBI:1338369 | NCBI:32523 | NCBI:32524 | NCBI:40674 | NCBI:32525 | NCBI:9347 | NCBI:1437010 | NCBI:314145 | NCBI:33554 | NCBI:379584 | NCBI:9608 | NCBI:9611 | NCBI:9612 | superkingdom | | kingdom | | | | phylum | subphylum | | | | | | | | | class | | | | superorder | order | suborder | family | genus | species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=9612
Canis lupus SAME_AS OTT:247341 Canis lupus species | | Eukaryota | Opisthokonta | Holozoa | Metazoa | Eumetazoa | Bilateria | Deuterostomia | Chordata | Craniata | Vertebrata | Gnathostomata | Teleostomi | Euteleostomi | Sarcopterygii | Dipnotetrapodomorpha | Tetrapoda | Amniota | Mammalia | Theria | Eutheria | Boreoeutheria | Laurasiatheria | Carnivora | Caniformia | Canidae | Canis | Canis lupus OTT:805080 | OTT:93302 | OTT:304358 | OTT:332573 | OTT:5246131 | OTT:691846 | OTT:641038 | OTT:117569 | OTT:147604 | OTT:125642 | OTT:947318 | OTT:801601 | OTT:278114 | OTT:114656 | OTT:114654 | OTT:458402 | OTT:4940726 | OTT:229562 | OTT:229560 | OTT:244265 | OTT:229558 | OTT:683263 | OTT:5334778 | OTT:392223 | OTT:44565 | OTT:827263 | OTT:770319 | OTT:372706 | OTT:247341 no rank | no rank | domain | no rank | no rank | kingdom | no rank | no rank | no rank | phylum | subphylum | subphylum | superclass | no rank | no rank | class | no rank | superclass | no rank | class | subclass | no rank | no rank | superorder | order | suborder | family | genus | species https://tree.opentreeoflife.org/opentree/ottol@247341
Canis lupus SAME_AS INAT_TAXON:42048 Canis lupus speciesAnimalia | Chordata | Mammalia | Carnivora | Canidae | Canis | Canis lupus kingdom | phylum | class | order | family | genus | species http://inaturalist.org/taxa/42048
Canis lupus SAME_AS ITIS:180596 Canis lupus Species Animalia | Bilateria | Deuterostomia | Chordata | Vertebrata | Gnathostomata | Tetrapoda | Mammalia | Theria | Eutheria | Carnivora | Caniformia | Canidae | Canis | Canis lupus ITIS:202423 | ITIS:914154 | ITIS:914156 | ITIS:158852 | ITIS:331030 | ITIS:914179 | ITIS:914181 | ITIS:179913 | ITIS:179916 | ITIS:179925 | ITIS:180539 | ITIS:552303 | ITIS:180594 | ITIS:180595 | ITIS:180596 Kingdom | Subkingdom | Infrakingdom | Phylum | Subphylum | Infraphylum | Superclass | Class | Subclass | Infraclass | Order | Suborder | Family | Genus | Species http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=180596
Canis lupus SAME_AS IRMNG:11407661 Canis lupus species Animalia | Chordata | Mammalia | Carnivora | Canidae | Canis | Canis lupus IRMNG:11 | IRMNG:148 | IRMNG:1310 | IRMNG:12116 | IRMNG:104585 | IRMNG:1282727 | IRMNG:11407661 kingdom | phylum | class | order | family | genus | species http://www.marine.csiro.au/mirrorsearch/ir_search.list_species?sp_id=11407661
Canis lupus SAME_AS GBIF:5219173 Canis lupus species Animalia | Chordata | Mammalia | Carnivora | Canidae | Canis | Canis lupus GBIF:1 | GBIF:44 | GBIF:359 | GBIF:732 | GBIF:9701 | GBIF:5219142 | GBIF:5219173kingdom | phylum | class | order | family | genus | species http://www.gbif.org/species/5219173
The expected output includes tab separated lines like, where the first two columns are the input and the following columns are match results.
In addition to appending the found matches to a provided input row, Nomer also supports replacing the matched values.
Looking up Canis lupus using globalnames with the replace command would look like:
echo -e "\tCanis lupus"\
| nomer replace globi-globalnames
which produces:
NCBI:9612 Canis lupus
If multiple matches for the term are available, the first match will be replaced.
The replace commands also supports pipe delimited paths, like:
$ echo -e "\tAnimalia | Mammalia | Canis lupus" | nomer replace globi-globalnames
ITIS:202423 | NCBI:40674 | NCBI:9612 Animalia | Mammalia | Canis lupus
Or when using a matcher that supports lookup by id:
echo -e "ITIS:202423 | NCBI:40674 | NCBI:9612\t"\
| nomer replace globi-enrich
would produce:
ITIS:202423 | NCBI:40674 | NCBI:9612 Animalia | Mammalia | Canis lupus
If you have an existing tabular file where the id and name columns are not the first and second respectively, then, you can change the input/output schema. For instance, if you'd like to match on ids in the third (=2) column and write the matching id and name in the first (=0) and second (=1) column (= default), you can do something like:
echo -e "\t\tNCBI:9606"\
| nomer replace --properties <(echo 'nomer.schema.input=[{\"column\":2,\"type\":\"externalId\"}]') ncbi-web
which would produce:
NCBI:9606 Homo sapiens NCBI:9606
To avoid escaping of double quotes (i.e. "
-> \"
), and to keep your commands relatively short, perhaps an easier way to change the input / output schema is the save the default properties to a file using nomer properties > my.properties
.
Now, edit the properties nomer.schema.input
and nomer.schema.output
to suit your needs. After you are done, you can use the properties by running someting like:
$ echo -e "\t\tNCBI:9606" | nomer replace --properties=my.properties ncbi-web
NCBI:9606 Homo sapiens NCBI:9606
... to reproduce the results from the previous example.
To validate terms (aka TaxonCache) and term linkages (aka TaxonMap) to be used with the offline term matchers, you can use the validate-term
and validate-term-link
commands.
For instance, if you'd like to validate the first 10 lines of the taxonCache as published in https://zenodo.org/record/1213465 do:
curl -L "https://zenodo.org/record/1213465/files/taxonCacheFirst10.tsv" | nomer validate-term
Expected result looks something like [FAIL|OK]\t[validation test]\t[...]
where [...] is the validated line. Parts of the result of the above command includes:
OK 9 columns 4701dc84-660a-4c51-bd16-593997f2370b Coelomomyces iliensis species Fungi | Chytridiomycota | Blastocladiomycetes | Blastocladiales | Coelomomycetaceae | Coelomomyces | Coelomomyces iliensis urn:lsid:indexfungorum.org:names:90156 | urn:lsid:indexfungorum.org:names:90736 | urn:lsid:indexfungorum.org:names:90742 | urn:lsid:indexfungorum.org:names:90414 | urn:lsid:indexfungorum.org:names:80619 | urn:lsid:indexfungorum.org:names:20136 | 4701dc84-660a-4c51-bd16-593997f2370b kingdom | phylum | class | order | family | genus | species
FAIL supported id 4701dc84-660a-4c51-bd16-593997f2370b Coelomomyces iliensis species Fungi | Chytridiomycota | Blastocladiomycetes | Blastocladiales | Coelomomycetaceae | Coelomomyces | Coelomomyces iliensis urn:lsid:indexfungorum.org:names:90156 | urn:lsid:indexfungorum.org:names:90736 | urn:lsid:indexfungorum.org:names:90742 | urn:lsid:indexfungorum.org:names:90414 | urn:lsid:indexfungorum.org:names:80619 | urn:lsid:indexfungorum.org:names:20136 | 4701dc84-660a-4c51-bd16-593997f2370b kingdom | phylum | class | order | family | genus | species
FAIL prefixed id 4701dc84-660a-4c51-bd16-593997f2370b Coelomomyces iliensis species Fungi | Chytridiomycota | Blastocladiomycetes | Blastocladiales | Coelomomycetaceae | Coelomomyces | Coelomomyces iliensis urn:lsid:indexfungorum.org:names:90156 | urn:lsid:indexfungorum.org:names:90736 | urn:lsid:indexfungorum.org:names:90742 | urn:lsid:indexfungorum.org:names:90414 | urn:lsid:indexfungorum.org:names:80619 | urn:lsid:indexfungorum.org:names:20136 | 4701dc84-660a-4c51-bd16-593997f2370b kingdom | phylum | class | order | family | genus | species
This validation report tell us that the line starting with 4701dc84-660a-4c51-bd16-593997f2370b Coelomomyces iliensis
has (expected) 9 columns, but has an id that is not supported by nomer nor does the id conform to the [some namespace]:[some id]
format. Note that the GloBI Taxon Graph publication at http://doi.org/10.5281/zenodo.1213465 prompted the development of the validation features. For more historic context, please see #5 .
A similar feature for term links (aka TaxonMap) are available through the command validate-term-link
.
Feel free to join in. All welcome. Open an issue!