Releases: AmyOlex/Chrono
Chrono BERT Release
New TimeML Annotation Output
Major updates to Chrono that enable it to output annotations of temporal expressions in both SCATE and TimeML formats.
ML Bug Fix
One section of the Machine Learning module was not properly updated to be compatible with the SVM and RF implementations and was causing errors for some files. In this version this have been corrected. No other changes were made form the major release v3.0.
Major algorithm improvements and bug fixes for parsing clinical data from the THYME corpus.
This release includes major algorithm and structural changes to Chrono to improve temporal parsing of clinical data in the THYME corpus. Modifications are listed below and arranged in 5 categories:
- Algorithmic - Changes to the parsing algorithms in Chrono.
- Implementation - Changes to how an algorithm is implemented, but the core algorithm is not changed.
- Lexical - Additions to Chrono's temporal lexicon.
- ML Training Data - Additions to the prepared ML training data input files.
- Structural - Changes to the structure and organization of Chrono.
ALGORITHMIC Modifications:
- 2-digit-year parsing improvements include utilizing regex groups, and removing colons from the formatted date regex so that only dashes and forward slashes are considered.
- Added parsing module for the Last entity with the lexical terms last, previously, pre, later, earlier, early, previous, past, lately, final, latest, prior, recent, and recently all being automatically tagged as a Last entity.
- AMPM-Of-Day is now capturing AM mentions.
- AMPM-Of-Day parsing now checks to see if an HourOfDay entity has already been added before adding another one, which was originally creating duplicate hour entities.
- Changed how an associated number to a Period or Calendar-Interval is identified. Previously, Chrono attempted to convert the entire string preceeding the tagged term into a number, however, that doesn’t work for phrases like “over six weeks”. Now Chrono identifies the single token prior to the tagged token only, and ignores the rest of the preceeding string. For example, “one-time daily” is now not associating the number mention ”one” with the Calendar-Interval “daily”.
- Custom implementation of converting a textual month representation to the correct number representations.
- Expanded formatted date parsing to identify alphanumeric formats like “21-SEP-2009”.
- Expanded special cases for Period and Calendar-Interval parsing. Tokens such as “tomorrow” or “today” are always classified as Calendar Intervals using rule-based methods, and not passed to the ML module.
- Expanded subinterval linking to include linking Modifiers.
- HourOfDay and MinuteOfHour updated to identify time phrases formatted as hh:mm instead of requiring hh:mm:ss.
- Implemented Random Forests ML algorithm.
- Improved identification of spans for Year entities.
- Phrases such as “four weeks time” are now identified and parsed correctly in the Period and Calendar-Interval methods.
- Prior to tokenization, all equal signs, “=“, are replaced by spaces. This is to separate temporal phrases in clinical header lines in the THYME data.
- Sentence boundary identification is now implemented. Prior to identifying all whitespace tokens, sentences are identified and sentence boundaries are identified. Text for each sentence is then tokenized based on whitespace. Note, the ML method still does not utilize the sentence boundaries.
- Special case parsing of the phrase “one-time daily” and similar phrases is now done when identifying Periods or Calendar Intervals.
- Sub-interval linking is now linking AMPM entities to the HourOfDay entities.
- Temporal phrase extraction now utilizes sentnece boundaries and will not allow a temporal phrase to cross a sentence boundary.
- Temporal phrase extraction now utilizes tagged linking terms when identifying temporal phrases. If a linking term is identified the temporal phrase is extended to include the linking term. Much longer temporal phrases are now being identified.
- Text is now lowercased before parsing out Periods and Calendar-Intervals.
- Text month and day parsing now utilizes flags to ensure entities are not duplicated.
- The entity type “Unknown” is now being identified for Periods and Calendar-Intervals for the tokens "time", "shortly", "soon", "briefly", "awhile", "future”, and "lately”.
IMPLEMENTATION Modifications
- Added re-indexing to list of entities in sub-interval linking when multiple entities are deleted.
- AMPM parsing was changed to use regular expressions instead of using intersecting lists of terms. This implementation now captures more AMPM instances in a larger variety of contexts.
- Changed implementation of Hour, Minute and Second parsing to utilize regex groups instead of re-parsing multiple times.
- Converted Calendar-Interval parsing to use regex expressions instead of list intersection.
- Fixed bug in hasTextYear() method where the regex match was not being referred to correctly.
LEXICAL Modifications
- Added “/min” and “/week” as valid terms for Calendar-Interval.
- DayOfMonth parsing now includes parsing for the token “may” only in alphanumeric formats such as 12-may-2010
- Expanded lexicon of text month representations to include additional abbreviations.
- Expanded Modifier entity lexicon to include “approximatly” and “beginning”
- Expanded PartOfDay lexicon to include “bedtime”, “eve”, and “midnight”
- Expanded Period and Calendar interval lexicon to include “tomorrow”, “tomorrows”,”half century", "quarter century”, "point", "long", "period", "lately", "future", "awhile", "briefly", "longstanding", "soon", "shortly" and "length”.
- Implemented tagging of “linking terms”. These currently include “a”, “an”, “of”, “in”, and “on”. A token is identified as a linking term iff it has temporal and/or numeric tokens on either side.
- Now able to identify abbreviated months without periods, such as “Dec” instead of requiring a period as in “Dec.”
STRUCTURAL Modifications
- Added a run.sh script for development purposes and to make running Chrono easier for different configurations and dataset locations.
- Added an xml2ann.py script that converts SCATE XML to .ann annotation format.
- Added ExtractParsedContext.py script that takes in a list of XML or .ann files and a list of text files and outputs the specified token with the surrounding context. The lexicon for a given entity can also be retirieved with this script if the context is set to zero.
- All “hasX” and “buildX” methods in BuildEntities.py have been moved to their own file to aid in modularity.
- Extracted all lexical term lists into separate dictionary files for ease of customization.
- Global renaming of functions and methods for clarity.
- README updates
ML TRAINING DATA Updates
- Created 2 new ML training data sets for Periods and Calendar Intervals. One with using just the THYME data (clinical), and another by merging THYME data and newswire data (newswire-clincial). The merged training data is used for evaluations.
Starting Release for THYME Data Analysis
This is the starting release for the THYME data analysis and is what was used for the initial submission to SemEval Post-Eval phase.
Bug fixes to temporal entity flagging
Includes temporal parsing bug fixes done after SemEval 2018 submission.
After conducting mutation testing on temporalTest, the following changes were made:
- Removed unused import statements
- Fixed flagging in 24 hour time to correctly identify values less than 24 for hours and less than 60 for minutes
- Removed unnecessary if statements
Official public release of Chrono
This is the official first public release of Chrono after submission of the systems description paper to SemEval 2018.
Added k-fold cross validation
K-fold cross validation was implemented in a new script. No existing files were changed, just added a new script and CSV file for the cross validation. Currently, settings for the cross-validation still must be manually set.
Logic corrections
-
We have separate parsing methods for various types of temporal components that are executed in a specific order. Isolated 4-digit years, like "2013", were supposed to be parsed before 24-hour time expressions like 0800; however, these two lines of code somehow got swapped at some point and we were prioritizing 24-hour times over years. Thus, a mention of "2013" was being parsed as 20:13 instead of a year since the minute and hour ranges were within 24-hour time parameters. The fix was to just swap the lines of code so years were parsed out first.
-
Our system is supposed to assume there is only one year, one month, and one day mentioned per temporal phrase (a limitation I know, but it will get fixed down the road). Thus, when the phrase "2013-03-22" is parsed it should have only found one year, one month, and one day. This worked for the month and day, however, it was coming back with a 4-digit year of "2013", and a 2-digit year of "22". This is because we missed checking a year flag when the 2-digit year parser runs (4-digit years take precedence over 2-digit years). Therefore, the fix was simply to insert a single control statement to only execute the 2-digit year parser if a 4-digit year was not found.
-
Finally, for most parsing methods we loop through each token in the temporal phrase; however, when identifying full numeric expressions, like "1953" or "08091998", we missed inserting this loop. Thus, phrases like "Last 1953" were not being counted as having any numeric values in them because it was testing to see if "Last 1953" was numeric all together. Therefore, the "1953" was missed as a year and got caught as a 24-hour time because this method did loop through all tokens correctly. The fix was just to insert the for loop through all tokens in a temporal phrase in the full numeric expression parser and it fixed the problem.