-
Notifications
You must be signed in to change notification settings - Fork 12
Smasher to do list
Here are some reflections on smasher, based on experience working on it through about 2017.
There should be a smasher repository, separate from the Open Tree reference taxonomy repository, since smasher is now used by multiple taxonomy projects.
It might finally be time to create a 'src' tree for smasher and a more conventional repository structure.
An alignment should take a Node to a Node, not a Taxon to a Taxon. That is, Synonyms should map to either Synonyms or Taxons, and it should be possible for a Taxon to map to a Synonym. (I think that if we did this there would be a direct or very close name match any time two Nodes were aligned.) This change would require an overhaul of all the code that makes use of alignments (to move synonym link logic out into the code making use of the alignment), but I think the result would be cleaner, more accurate, and easier to understand, and perhaps could provide better metadata for downstream users.
More effort should be made to distinguish objective synonyms from subjective synonyms, and to make use of that information. (Of course in many cases we don't know which of these is the case, so the default of 'unknown' would probably have to be treated the same as subjective.)
Authority information should be retained and exploited in alignment. It can be parsed using the parsing module from the Global Names project. E.g. we ought to be able to match 'Aa bb Smith 1992' with 'ZZ bb (Smith 1992)'. Note that NCBI has authority information for basically all taxa with rank 'species'.
To assist in finding alignments, subspecies and other infraspecific-rank taxa should not be pruned until final output (if at all).
Ideally the system should be able to detect and handle species redescriptions. For example, if Aa bb and Aa cc are described as separate species in taxonomy 1, and then taxonomy 2 has Aa bb with subspecies bb and cc, then the appropriate biological correspondences should be inferred and represented. The namestring 'Aa bb' should be known to denote different circumscriptions in the two taxonomies, united only by their common reference to the original description (type series) of 'Aa bb'.
There should be some way to make use of parallelism. I think we're now getting only 2-3 fold parallelism, even on computers with 16 cores or more. This is a matter of figuring out how to split up the work (e.g. dividing nodes to be processed among multiple threads) and adding synchronization as needed to prevent races.
Before smasher was cleaned up so that it could be described in the BDJ paper, it ran much faster, maybe 30% or 40%. This suggests that there are some constant-factor changes, probably in the heart of the alignment logic, that would improve speed significantly.
I have been thinking about rewriting smasher, or parts of it, in some other language, for performance reasons.
There should be more careful treatment of taxonomies, possibly using a registry (a maintained list of all versions of all published taxonomies and the source taxonomies they're made from). Different versions of source taxonomies should be distinguished and processed using machine-processable metadata. Actually I think something like this is already used for the open tree taxonomy; it just needs to be untangled from open tree and documented better.
Smasher should have a general way to both use and publish Darwin Core archives, for better interop with GBIF and EOL. We might even want to replace GBIF with GBIF's sources, since this could bypass bugs in GBIF's synthesis code and would give us more specific metadata. (Although maybe not - GBIF has a large number of source taxonomies.)
We might want to put the alignment information (source taxonomies other than the first in the list) in a separate file, so that users of the taxonomy don't have to parse out this comma-separated field of limited value.
More use should be made of ad hoc taxonomies, rather than patch directives ('alignments' and 'adjustments'). So, Open Tree should have its own special source taxonomy / synonyms file expressing as much as possible of the information that is now in python code. Not everything can be handled this way, but many homonyms, synonyms, and taxonomic relationships could be.
We resisted doing this because we thought we could stay out of the curation business, but that was not to be. It should also contain the separation taxa.
It would include only the names occurring in edit directives (and in the separation taxa, and maybe a handful of others) - there is no need to repeat information found in other taxonomies, except for structural reasons. This requires designing a nice readable and editable textual format for the taxonomy (maybe reuse proposition.py somehow? but should be more concise).
It should probably contain all of the separation taxonomy.
The taxonomy would be written out in the usual form and could become the first taxonomy in the priority order. There could still be propositions as post-assembly edits if necessary, but when possible an edit should be made on the curation taxonomy, not as a patch.
As a starting point, you could just make a list of all the taxa mentioned in patches, then make a spanning tree from OTT somehow, to get the topology relating them correct (where to put them to the separation taxonomy).