diff --git a/tuw_nlp/sem/hrg/Documentation.md b/tuw_nlp/sem/hrg/Documentation.md index 13ecd2d..1b2badb 100644 --- a/tuw_nlp/sem/hrg/Documentation.md +++ b/tuw_nlp/sem/hrg/Documentation.md @@ -4,16 +4,10 @@ First, we try to find a top estimate for our system in order to validate our concept. -Our system on the 23rd of Oct. 2024: - ### Train a grammar We [train](steps/train/train.py) a hyperedge replacement [grammar](pipeline/output/grammar) (HRG) using the [lsoie dataset](https://github.com/Jacobsolawetz/large-scale-oie/tree/master/dataset_creation/lsoie_data) on the triplet induced sub-graphs of the UD graph of a sentence. We create one rule per word and use the nonterminals `S`, `A`, `P` and `X` (no label). -We [create](steps/train/hrg.py) different cuts of this grammar using the top 100, 200 and 300 rules by keeping the original distribution of nonterminals and norming the weighs per nonterminal. - -#### Run the whole train pipeline - ```bash # Get the data export DATA_DIR=$HOME/data @@ -21,6 +15,22 @@ mkdir $DATA_DIR cd $DATA_DIR # Download and unzip the lsoie data into a folder called lsoie_data +# Preprocess the train data +python steps/preproc/preproc.py -d $DATA_DIR -c pipeline/config/preproc_train.json + +# Train the grammar +python steps/train/train.py -d $DATA_DIR -c pipeline/config/train_per_word.json +``` + +We [create](steps/train/hrg.py) different cuts of this grammar using the top 100, 200 and 300 rules by keeping the original distribution of nonterminals and norming the weighs per nonterminal. + +```bash +python steps/train/hrg.py -d $DATA_DIR -c pipeline/config/hrg.json +``` + +#### Run the whole train pipeline + +```bash python pipeline/pipeline.py -d $DATA_DIR -c pipeline/config/pipeline_train.json ``` @@ -29,38 +39,51 @@ python pipeline/pipeline.py -d $DATA_DIR -c pipeline/config/pipeline_train.json First, we [preprocess](steps/preproc/preproc.py) the dev data as well. ```bash -python preproc/preproc.py -d $DATA_DIR -c preproc/config/preproc_dev.json +python steps/preproc/preproc.py -d $DATA_DIR -c pipeline/config/preproc_dev.json ``` -Using the grammar, first we [parse](steps/bolinas/parse/parse.py) the UD graphs on the dev set, saving the resulting charts as an intermediary output. We prune the parsing above 10.000 steps for gr100 and gr200 and above 50.000 steps for gr300. The parsing takes from 1 hour to one day (gr300), see more [here](pipeline/log). +Using the grammar, first we [parse](steps/bolinas/parse/parse.py) the UD graphs on the dev set, saving the resulting charts as an intermediary output. We prune the parsing above 50.000 steps. The parsing takes from 1 hour to one day, see more [here](pipeline/log). ```bash - python bolinas/parse/parse.py -d $DATA_DIR -c bolinas/parse/config/parse_gr100.json +python steps/bolinas/parse/parse.py -d $DATA_DIR -c pipeline/config/parse_100.json ``` -We [search](bolinas/kbest/kbest.py) for the top k best derivations in the chart. We apply different filters on the chart: `basic` (no filtering), `max` (searching only among the largest derivations), or classic retrieval metrics `precision`, `recall` and `f1-score`, where we cheat by using the gold data and returning for each gold entry only the one derivation with the highest respective score. To calculate these scores we use the same triplet matching and scoring function as in the evaluation step. Our system returns at most k node-label maps for a sentence, where a label corresponds to a nonterminal symbol. This mapping requires some [postprocessing](postproc/postproc.py), a predicate resolution in case no predicate is found and an argument grouping and indexing step, since we only have `A` as nonterminal. For `precision`, `recall` and `f1-score` filters this postprocessing step has to be done before calculating the scores. We also try argument permutation, in which case we try all possible argument indexing for the identified argument groups. This search takes from 1 our to 2.5 days, see more [here](bolinas/kbest/log). +We [search](steps/bolinas/kbest/kbest.py) for the top k best derivations in the chart. We apply different filters on the chart: `basic` (no filtering), `max` (searching only among the largest derivations), or classic retrieval metrics `precision`, `recall` and `f1-score`, where we cheat by using the gold data and returning for each gold entry only the one derivation with the highest respective score. To calculate these scores we use the same triplet matching and scoring function as in the evaluation step. Our system returns at most k node-label maps for a sentence, where a label corresponds to a nonterminal symbol. This mapping requires some [postprocessing](steps/postproc/postproc.py), a predicate resolution in case no predicate is found and an argument grouping and indexing step, since we only have `A` as nonterminal. For `precision`, `recall` and `f1-score` filters this postprocessing step has to be done before calculating the scores. We also try argument permutation, in which case we try all possible argument indexing for the identified argument groups. This search takes from 1 our to 2.5 days, see more [here](pipeline/log). ```bash -python bolinas/kbest/kbest.py -d $DATA_DIR -c bolinas/kbest/config/kbest_gr100.json +python steps/bolinas/kbest/kbest.py -d $DATA_DIR -c pipeline/config/kbest_100.json ``` -After the k best derivations are found we [predict](predict/predict.py) the labels, where we apply the necessary [postprocessing](postproc/postproc.py) steps (for `basic` and `max`). There is a possibility to implement further postprocessing strategies, as for now `keep` (resolving predicate only if not present, forming argument groups as continuous A-label word spans and indexing these groups from left to right) is our only strategy. +After the k best derivations are found we [predict](steps/predict/predict.py) the labels, where we apply the necessary [postprocessing](steps/postproc/postproc.py) steps (for `basic` and `max`). There is a possibility to implement further postprocessing strategies, as for now `keep` (resolving predicate only if not present, forming argument groups as continuous A-label word spans and indexing these groups from left to right) is our only strategy. -```python -# TBD +```bash +python steps/predict/predict.py -d $DATA_DIR -c pipeline/config/predict_100.json ``` -Once all sentences are predicted, we [merge](predict/merge.py) them into one json per model. +Once all sentences are predicted, we [merge](steps/predict/merge.py) them into one json per model. -```python -# TBD +```bash + python steps/predict/merge.py -d $DATA_DIR -c pipeline/config/merge_100.json +``` + +#### Run the whole predict pipeline on dev + +```bash +# Hrg - 100 +python pipeline/pipeline.py -d $DATA_DIR -c pipeline/config/pipeline_dev_100.json + +# Hrg - 200 +python pipeline/pipeline.py -d $DATA_DIR -c pipeline/config/pipeline_dev_200.json + +# Hrg - 300 +python pipeline/pipeline.py -d $DATA_DIR -c pipeline/config/pipeline_dev_300.json ``` ### Create a random predictions for comparison We implement a [random extractor](random/random_extractor.py) that uses the [artefacts](random/train_stat) of the training dataset (distribution of the number of extractions per sentence, and distribution of labels per length of the sentence) and assures that the predicate is a verb. -```python +```bash # TBD ``` diff --git a/tuw_nlp/sem/hrg/pipeline/config/kbest_gr100.json b/tuw_nlp/sem/hrg/pipeline/config/kbest_100.json similarity index 73% rename from tuw_nlp/sem/hrg/pipeline/config/kbest_gr100.json rename to tuw_nlp/sem/hrg/pipeline/config/kbest_100.json index 1c4b2ac..9728821 100644 --- a/tuw_nlp/sem/hrg/pipeline/config/kbest_gr100.json +++ b/tuw_nlp/sem/hrg/pipeline/config/kbest_100.json @@ -1,34 +1,29 @@ { "in_dir": "dev_preproc", - "out_dir": "dev_gr100", + "out_dir": "dev_100", "arg_permutation": false, "filters": { "basic": { - "ignore": false, "chart_filter": "basic", "k": 10 }, "max": { - "ignore": false, "chart_filter": "max", "k": 10 }, "prec": { - "ignore": false, "pr_metric": "prec" }, "rec": { - "ignore": false, "pr_metric": "rec" }, "f1": { - "ignore": false, "pr_metric": "f1" } } diff --git a/tuw_nlp/sem/hrg/pipeline/config/kbest_gr200.json b/tuw_nlp/sem/hrg/pipeline/config/kbest_200.json similarity index 73% rename from tuw_nlp/sem/hrg/pipeline/config/kbest_gr200.json rename to tuw_nlp/sem/hrg/pipeline/config/kbest_200.json index 96e9637..d2c1bb1 100644 --- a/tuw_nlp/sem/hrg/pipeline/config/kbest_gr200.json +++ b/tuw_nlp/sem/hrg/pipeline/config/kbest_200.json @@ -1,34 +1,29 @@ { "in_dir": "dev_preproc", - "out_dir": "dev_gr200", + "out_dir": "dev_200", "arg_permutation": false, "filters": { "basic": { - "ignore": false, "chart_filter": "basic", "k": 10 }, "max": { - "ignore": false, "chart_filter": "max", "k": 10 }, "prec": { - "ignore": false, "pr_metric": "prec" }, "rec": { - "ignore": false, "pr_metric": "rec" }, "f1": { - "ignore": false, "pr_metric": "f1" } } diff --git a/tuw_nlp/sem/hrg/pipeline/config/kbest_gr300.json b/tuw_nlp/sem/hrg/pipeline/config/kbest_300.json similarity index 73% rename from tuw_nlp/sem/hrg/pipeline/config/kbest_gr300.json rename to tuw_nlp/sem/hrg/pipeline/config/kbest_300.json index b0750da..ebb89cb 100644 --- a/tuw_nlp/sem/hrg/pipeline/config/kbest_gr300.json +++ b/tuw_nlp/sem/hrg/pipeline/config/kbest_300.json @@ -1,34 +1,29 @@ { "in_dir": "dev_preproc", - "out_dir": "dev_gr300", + "out_dir": "dev_300", "arg_permutation": false, "filters": { "basic": { - "ignore": false, "chart_filter": "basic", "k": 10 }, "max": { - "ignore": false, "chart_filter": "max", "k": 10 }, "prec": { - "ignore": false, "pr_metric": "prec" }, "rec": { - "ignore": false, "pr_metric": "rec" }, "f1": { - "ignore": false, "pr_metric": "f1" } } diff --git a/tuw_nlp/sem/hrg/pipeline/config/merge_hrg200.json b/tuw_nlp/sem/hrg/pipeline/config/merge_100.json similarity index 88% rename from tuw_nlp/sem/hrg/pipeline/config/merge_hrg200.json rename to tuw_nlp/sem/hrg/pipeline/config/merge_100.json index 7f68c5e..ce8f12e 100644 --- a/tuw_nlp/sem/hrg/pipeline/config/merge_hrg200.json +++ b/tuw_nlp/sem/hrg/pipeline/config/merge_100.json @@ -1,5 +1,5 @@ { - "in_dir": "dev_gr200", + "in_dir": "dev_100", "k": 10, "bolinas_chart_filters": [ diff --git a/tuw_nlp/sem/hrg/pipeline/config/merge_hrg300.json b/tuw_nlp/sem/hrg/pipeline/config/merge_200.json similarity index 88% rename from tuw_nlp/sem/hrg/pipeline/config/merge_hrg300.json rename to tuw_nlp/sem/hrg/pipeline/config/merge_200.json index 7610c77..71defd9 100644 --- a/tuw_nlp/sem/hrg/pipeline/config/merge_hrg300.json +++ b/tuw_nlp/sem/hrg/pipeline/config/merge_200.json @@ -1,5 +1,5 @@ { - "in_dir": "dev_gr300", + "in_dir": "dev_200", "k": 10, "bolinas_chart_filters": [ diff --git a/tuw_nlp/sem/hrg/pipeline/config/merge_hrg100.json b/tuw_nlp/sem/hrg/pipeline/config/merge_300.json similarity index 88% rename from tuw_nlp/sem/hrg/pipeline/config/merge_hrg100.json rename to tuw_nlp/sem/hrg/pipeline/config/merge_300.json index e09e4c5..757cf60 100644 --- a/tuw_nlp/sem/hrg/pipeline/config/merge_hrg100.json +++ b/tuw_nlp/sem/hrg/pipeline/config/merge_300.json @@ -1,5 +1,5 @@ { - "in_dir": "dev_gr100", + "in_dir": "dev_300", "k": 10, "bolinas_chart_filters": [ diff --git a/tuw_nlp/sem/hrg/pipeline/config/parse_100.json b/tuw_nlp/sem/hrg/pipeline/config/parse_100.json new file mode 100644 index 0000000..6ffd11c --- /dev/null +++ b/tuw_nlp/sem/hrg/pipeline/config/parse_100.json @@ -0,0 +1,6 @@ +{ + "in_dir": "dev_preproc", + "grammar_file": "hrg_100.hrg", + "out_dir": "dev_100", + "max_steps": 50000 +} \ No newline at end of file diff --git a/tuw_nlp/sem/hrg/pipeline/config/parse_200.json b/tuw_nlp/sem/hrg/pipeline/config/parse_200.json new file mode 100644 index 0000000..d9a0351 --- /dev/null +++ b/tuw_nlp/sem/hrg/pipeline/config/parse_200.json @@ -0,0 +1,6 @@ +{ + "in_dir": "dev_preproc", + "grammar_file": "hrg_200.hrg", + "out_dir": "dev_200", + "max_steps": 50000 +} \ No newline at end of file diff --git a/tuw_nlp/sem/hrg/pipeline/config/parse_300.json b/tuw_nlp/sem/hrg/pipeline/config/parse_300.json new file mode 100644 index 0000000..fc9458f --- /dev/null +++ b/tuw_nlp/sem/hrg/pipeline/config/parse_300.json @@ -0,0 +1,6 @@ +{ + "in_dir": "dev_preproc", + "grammar_file": "hrg_300.hrg", + "out_dir": "dev_300", + "max_steps": 50000 +} \ No newline at end of file diff --git a/tuw_nlp/sem/hrg/pipeline/config/parse_gr100.json b/tuw_nlp/sem/hrg/pipeline/config/parse_gr100.json deleted file mode 100644 index 47d6441..0000000 --- a/tuw_nlp/sem/hrg/pipeline/config/parse_gr100.json +++ /dev/null @@ -1,6 +0,0 @@ -{ - "in_dir": "dev_preproc", - "grammar_file": "grammar_100.hrg", - "out_dir": "dev_gr100", - "max_steps": 50000 -} \ No newline at end of file diff --git a/tuw_nlp/sem/hrg/pipeline/config/parse_gr200.json b/tuw_nlp/sem/hrg/pipeline/config/parse_gr200.json deleted file mode 100644 index 157607f..0000000 --- a/tuw_nlp/sem/hrg/pipeline/config/parse_gr200.json +++ /dev/null @@ -1,6 +0,0 @@ -{ - "in_dir": "dev_preproc", - "grammar_file": "grammar_200.hrg", - "out_dir": "dev_gr200", - "max_steps": 50000 -} \ No newline at end of file diff --git a/tuw_nlp/sem/hrg/pipeline/config/parse_gr300.json b/tuw_nlp/sem/hrg/pipeline/config/parse_gr300.json deleted file mode 100644 index bad6d4c..0000000 --- a/tuw_nlp/sem/hrg/pipeline/config/parse_gr300.json +++ /dev/null @@ -1,6 +0,0 @@ -{ - "in_dir": "dev_preproc", - "grammar_file": "grammar_300.hrg", - "out_dir": "dev_gr300", - "max_steps": 50000 -} \ No newline at end of file diff --git a/tuw_nlp/sem/hrg/pipeline/config/pipeline_dev_100.json b/tuw_nlp/sem/hrg/pipeline/config/pipeline_dev_100.json new file mode 100644 index 0000000..0a19c75 --- /dev/null +++ b/tuw_nlp/sem/hrg/pipeline/config/pipeline_dev_100.json @@ -0,0 +1,25 @@ +{ + "steps": + [ + { + "step_name": "parse", + "script_name": "parse", + "config": "parse_100.json" + }, + { + "step_name": "kbest", + "script_name": "kbest", + "config": "kbest_100.json" + }, + { + "step_name": "predict", + "script_name": "predict", + "config": "predict_100.json" + }, + { + "step_name": "merge", + "script_name": "merge", + "config": "merge_100.json" + } + ] +} \ No newline at end of file diff --git a/tuw_nlp/sem/hrg/pipeline/config/pipeline_dev_preproc.json b/tuw_nlp/sem/hrg/pipeline/config/pipeline_dev_preproc.json deleted file mode 100644 index 023ae75..0000000 --- a/tuw_nlp/sem/hrg/pipeline/config/pipeline_dev_preproc.json +++ /dev/null @@ -1,10 +0,0 @@ -{ - "steps": - [ - { - "step_name": "preproc", - "script_name": "preproc", - "config": "preproc_dev.json" - } - ] -} \ No newline at end of file diff --git a/tuw_nlp/sem/hrg/pipeline/config/predict_hrg200.json b/tuw_nlp/sem/hrg/pipeline/config/predict_100.json similarity index 88% rename from tuw_nlp/sem/hrg/pipeline/config/predict_hrg200.json rename to tuw_nlp/sem/hrg/pipeline/config/predict_100.json index b43cc9c..4567e20 100644 --- a/tuw_nlp/sem/hrg/pipeline/config/predict_hrg200.json +++ b/tuw_nlp/sem/hrg/pipeline/config/predict_100.json @@ -1,6 +1,6 @@ { "preproc_dir": "dev_preproc", - "in_dir": "dev_gr200", + "in_dir": "dev_100", "bolinas_chart_filters": [ "basic", diff --git a/tuw_nlp/sem/hrg/pipeline/config/predict_hrg300.json b/tuw_nlp/sem/hrg/pipeline/config/predict_200.json similarity index 88% rename from tuw_nlp/sem/hrg/pipeline/config/predict_hrg300.json rename to tuw_nlp/sem/hrg/pipeline/config/predict_200.json index 04d7c17..2cbb1c6 100644 --- a/tuw_nlp/sem/hrg/pipeline/config/predict_hrg300.json +++ b/tuw_nlp/sem/hrg/pipeline/config/predict_200.json @@ -1,6 +1,6 @@ { "preproc_dir": "dev_preproc", - "in_dir": "dev_gr300", + "in_dir": "dev_200", "bolinas_chart_filters": [ "basic", diff --git a/tuw_nlp/sem/hrg/pipeline/config/predict_hrg100.json b/tuw_nlp/sem/hrg/pipeline/config/predict_300.json similarity index 88% rename from tuw_nlp/sem/hrg/pipeline/config/predict_hrg100.json rename to tuw_nlp/sem/hrg/pipeline/config/predict_300.json index 2f5f02b..ae5f512 100644 --- a/tuw_nlp/sem/hrg/pipeline/config/predict_hrg100.json +++ b/tuw_nlp/sem/hrg/pipeline/config/predict_300.json @@ -1,6 +1,6 @@ { "preproc_dir": "dev_preproc", - "in_dir": "dev_gr100", + "in_dir": "dev_300", "bolinas_chart_filters": [ "basic", diff --git a/tuw_nlp/sem/hrg/pipeline/log/pipeline_dev_preproc.log b/tuw_nlp/sem/hrg/pipeline/log/pipeline_dev_preproc.log deleted file mode 100644 index 97897ba..0000000 --- a/tuw_nlp/sem/hrg/pipeline/log/pipeline_dev_preproc.log +++ /dev/null @@ -1,16 +0,0 @@ -Execution start: 2024-11-14 15:26:56.870978 -{ - "steps": [ - { - "step_name": "preproc", - "script_name": "preproc", - "config": "preproc_dev.json" - } - ] -} - -Processing step preproc: 2024-11-14 15:26:56.871153 - -Execution finish: 2024-11-14 16:17:38.632683 -Elapsed time: 51 min 42 sec - diff --git a/tuw_nlp/sem/hrg/pipeline/pipeline.py b/tuw_nlp/sem/hrg/pipeline/pipeline.py index 21ef222..bcb05bf 100644 --- a/tuw_nlp/sem/hrg/pipeline/pipeline.py +++ b/tuw_nlp/sem/hrg/pipeline/pipeline.py @@ -17,7 +17,6 @@ class Pipeline(Script): def __init__(self, log=True, config=None): super().__init__("Script to run a pipeline.", log, config) - self.pipeline_name = self.config["name"] self.steps = self.config["steps"] self.name_to_class = { "preproc": Preproc, diff --git a/tuw_nlp/sem/hrg/steps/bolinas/kbest/kbest.py b/tuw_nlp/sem/hrg/steps/bolinas/kbest/kbest.py index 4003631..4cf5950 100644 --- a/tuw_nlp/sem/hrg/steps/bolinas/kbest/kbest.py +++ b/tuw_nlp/sem/hrg/steps/bolinas/kbest/kbest.py @@ -131,6 +131,8 @@ def _do_for_sen(self, sen_idx, sen_dir): top_order, self.config["arg_permutation"], ) + if len(k_best_unique_derivations) == 0: + sen_log_lines.append("No matches\n") else: print("Neither 'k' nor 'pr_metric' is set") continue diff --git a/tuw_nlp/sem/hrg/steps/bolinas/parse/parse.py b/tuw_nlp/sem/hrg/steps/bolinas/parse/parse.py index 5ed568a..f2f3ae8 100644 --- a/tuw_nlp/sem/hrg/steps/bolinas/parse/parse.py +++ b/tuw_nlp/sem/hrg/steps/bolinas/parse/parse.py @@ -1,8 +1,6 @@ -import os.path import fileinput import pickle -from tuw_nlp.sem.hrg.common.io import log_to_console_and_log_lines from tuw_nlp.sem.hrg.common.script.loop_on_sen_dirs import LoopOnSenDirs from tuw_nlp.sem.hrg.steps.bolinas.common.grammar import Grammar from tuw_nlp.sem.hrg.steps.bolinas.common.hgraph.hgraph import Hgraph diff --git a/tuw_nlp/sem/hrg/steps/predict/predict.py b/tuw_nlp/sem/hrg/steps/predict/predict.py index bd22d99..d119d78 100644 --- a/tuw_nlp/sem/hrg/steps/predict/predict.py +++ b/tuw_nlp/sem/hrg/steps/predict/predict.py @@ -12,7 +12,6 @@ class Predict(LoopOnSenDirs): def __init__(self, config=None): super().__init__(description="Script to create wire jsons from predicted bolinas labels.", config=config) - self.out_dir += self.in_dir self.preproc_dir = f"{self.data_dir}/{self.config['preproc_dir']}" self.chart_filters = self.config["bolinas_chart_filters"] self.postprocess = self.config["postprocess"]