Prepare dev_100 pipeline

recski · Nov 18, 2024 · 7c8ee65 · 7c8ee65
1 parent ef0e325
commit 7c8ee65
Show file tree

Hide file tree

Showing 23 changed files with 95 additions and 90 deletions.
diff --git a/tuw_nlp/sem/hrg/Documentation.md b/tuw_nlp/sem/hrg/Documentation.md
@@ -4,23 +4,33 @@
 
 First, we try to find a top estimate for our system in order to validate our concept.
 
-Our system on the 23rd of Oct. 2024:
-
 ### Train a grammar
 
  We [train](steps/train/train.py) a hyperedge replacement [grammar](pipeline/output/grammar) (HRG) using the [lsoie dataset](https://github.com/Jacobsolawetz/large-scale-oie/tree/master/dataset_creation/lsoie_data) on the triplet induced sub-graphs of the UD graph of a sentence. We create one rule per word and use the nonterminals `S`, `A`, `P` and `X` (no label). 
 
-We [create](steps/train/hrg.py) different cuts of this grammar using the top 100, 200 and 300 rules by keeping the original distribution of nonterminals and norming the weighs per nonterminal.
-
-#### Run the whole train pipeline
-
 ```bash
 # Get the data
 export DATA_DIR=$HOME/data
 mkdir $DATA_DIR
 cd $DATA_DIR
 # Download and unzip the lsoie data into a folder called lsoie_data
 
+# Preprocess the train data
+python steps/preproc/preproc.py  -d $DATA_DIR -c pipeline/config/preproc_train.json
+
+# Train the grammar
+python steps/train/train.py -d $DATA_DIR -c pipeline/config/train_per_word.json
+```
+
+We [create](steps/train/hrg.py) different cuts of this grammar using the top 100, 200 and 300 rules by keeping the original distribution of nonterminals and norming the weighs per nonterminal.
+
+```bash
+python steps/train/hrg.py -d $DATA_DIR -c pipeline/config/hrg.json
+```
+
+#### Run the whole train pipeline
+
+```bash
 python pipeline/pipeline.py -d $DATA_DIR -c pipeline/config/pipeline_train.json
 ```
 
@@ -29,38 +39,51 @@ python pipeline/pipeline.py -d $DATA_DIR -c pipeline/config/pipeline_train.json
 First, we [preprocess](steps/preproc/preproc.py) the dev data as well.
 
 ```bash
-python preproc/preproc.py -d $DATA_DIR -c preproc/config/preproc_dev.json
+python steps/preproc/preproc.py -d $DATA_DIR -c pipeline/config/preproc_dev.json
 ```
 
-Using the grammar, first we [parse](steps/bolinas/parse/parse.py) the UD graphs on the dev set, saving the resulting charts as an intermediary output. We prune the parsing above 10.000 steps for gr100 and gr200 and above 50.000 steps for gr300. The parsing takes from 1 hour to one day (gr300), see more [here](pipeline/log).
+Using the grammar, first we [parse](steps/bolinas/parse/parse.py) the UD graphs on the dev set, saving the resulting charts as an intermediary output. We prune the parsing above 50.000 steps. The parsing takes from 1 hour to one day, see more [here](pipeline/log).
 
 ```bash
- python bolinas/parse/parse.py -d $DATA_DIR -c bolinas/parse/config/parse_gr100.json
+python steps/bolinas/parse/parse.py -d $DATA_DIR -c pipeline/config/parse_100.json
 ```
 
-We [search](bolinas/kbest/kbest.py) for the top k best derivations in the chart. We apply different filters on the chart: `basic` (no filtering), `max` (searching only among the largest derivations), or classic retrieval metrics `precision`, `recall` and `f1-score`, where we cheat by using the gold data and returning for each gold entry only the one derivation with the highest respective score. To calculate these scores we use the same triplet matching and scoring function as in the evaluation step. Our system returns at most k node-label maps for a sentence, where a label corresponds to a nonterminal symbol. This mapping requires some [postprocessing](postproc/postproc.py), a predicate resolution in case no predicate is found and an argument grouping and indexing step, since we only have `A` as nonterminal. For `precision`, `recall` and `f1-score` filters this postprocessing step has to be done before calculating the scores. We also try argument permutation, in which case we try all possible argument indexing for the identified argument groups. This search takes from 1 our to 2.5 days, see more [here](bolinas/kbest/log).
+We [search](steps/bolinas/kbest/kbest.py) for the top k best derivations in the chart. We apply different filters on the chart: `basic` (no filtering), `max` (searching only among the largest derivations), or classic retrieval metrics `precision`, `recall` and `f1-score`, where we cheat by using the gold data and returning for each gold entry only the one derivation with the highest respective score. To calculate these scores we use the same triplet matching and scoring function as in the evaluation step. Our system returns at most k node-label maps for a sentence, where a label corresponds to a nonterminal symbol. This mapping requires some [postprocessing](steps/postproc/postproc.py), a predicate resolution in case no predicate is found and an argument grouping and indexing step, since we only have `A` as nonterminal. For `precision`, `recall` and `f1-score` filters this postprocessing step has to be done before calculating the scores. We also try argument permutation, in which case we try all possible argument indexing for the identified argument groups. This search takes from 1 our to 2.5 days, see more [here](pipeline/log).
 
 ```bash
-python bolinas/kbest/kbest.py -d $DATA_DIR -c bolinas/kbest/config/kbest_gr100.json
+python steps/bolinas/kbest/kbest.py -d $DATA_DIR -c pipeline/config/kbest_100.json
 ```
 
-After the k best derivations are found we [predict](predict/predict.py) the labels, where we apply the necessary [postprocessing](postproc/postproc.py) steps (for `basic` and `max`). There is a possibility to implement further postprocessing strategies, as for now `keep` (resolving predicate only if not present, forming argument groups as continuous A-label word spans and indexing these groups from left to right) is our only strategy.
+After the k best derivations are found we [predict](steps/predict/predict.py) the labels, where we apply the necessary [postprocessing](steps/postproc/postproc.py) steps (for `basic` and `max`). There is a possibility to implement further postprocessing strategies, as for now `keep` (resolving predicate only if not present, forming argument groups as continuous A-label word spans and indexing these groups from left to right) is our only strategy.
 
-```python
-# TBD
+```bash
+python steps/predict/predict.py -d $DATA_DIR -c pipeline/config/predict_100.json 
 ```
 
-Once all sentences are predicted, we [merge](predict/merge.py) them into one json per model.
+Once all sentences are predicted, we [merge](steps/predict/merge.py) them into one json per model.
 
-```python
-# TBD
+```bash
+ python steps/predict/merge.py -d $DATA_DIR -c pipeline/config/merge_100.json
+```
+
+#### Run the whole predict pipeline on dev
+
+```bash
+# Hrg - 100
+python pipeline/pipeline.py -d $DATA_DIR -c pipeline/config/pipeline_dev_100.json
+
+# Hrg - 200
+python pipeline/pipeline.py -d $DATA_DIR -c pipeline/config/pipeline_dev_200.json
+
+# Hrg - 300
+python pipeline/pipeline.py -d $DATA_DIR -c pipeline/config/pipeline_dev_300.json
 ```
 
 ### Create a random predictions for comparison
 
 We implement a [random extractor](random/random_extractor.py) that uses the [artefacts](random/train_stat) of the training dataset (distribution of the number of extractions per sentence, and distribution of labels per length of the sentence) and assures that the predicate is a verb.  
 
-```python
+```bash
 # TBD
 ```
 

diff --git a/.../sem/hrg/pipeline/config/kbest_gr100.json → ...lp/sem/hrg/pipeline/config/kbest_100.json b/.../sem/hrg/pipeline/config/kbest_gr100.json → ...lp/sem/hrg/pipeline/config/kbest_100.json
@@ -1,34 +1,29 @@
 {
     "in_dir": "dev_preproc",
-    "out_dir": "dev_gr100",
+    "out_dir": "dev_100",
     "arg_permutation": false,
     "filters":
     {
         "basic":
         {
-            "ignore": false,
             "chart_filter": "basic",
             "k": 10
         },
         "max":
         {
-            "ignore": false,
             "chart_filter": "max",
             "k": 10
         },
         "prec":
         {
-            "ignore": false,
             "pr_metric": "prec"
         },
         "rec":
         {
-            "ignore": false,
             "pr_metric": "rec"
         },
         "f1":
         {
-            "ignore": false,
             "pr_metric": "f1"
         }
     }

diff --git a/.../sem/hrg/pipeline/config/kbest_gr200.json → ...lp/sem/hrg/pipeline/config/kbest_200.json b/.../sem/hrg/pipeline/config/kbest_gr200.json → ...lp/sem/hrg/pipeline/config/kbest_200.json
@@ -1,34 +1,29 @@
 {
     "in_dir": "dev_preproc",
-    "out_dir": "dev_gr200",
+    "out_dir": "dev_200",
     "arg_permutation": false,
     "filters":
     {
         "basic":
         {
-            "ignore": false,
             "chart_filter": "basic",
             "k": 10
         },
         "max":
         {
-            "ignore": false,
             "chart_filter": "max",
             "k": 10
         },
         "prec":
         {
-            "ignore": false,
             "pr_metric": "prec"
         },
         "rec":
         {
-            "ignore": false,
             "pr_metric": "rec"
         },
         "f1":
         {
-            "ignore": false,
             "pr_metric": "f1"
         }
     }

diff --git a/.../sem/hrg/pipeline/config/kbest_gr300.json → ...lp/sem/hrg/pipeline/config/kbest_300.json b/.../sem/hrg/pipeline/config/kbest_gr300.json → ...lp/sem/hrg/pipeline/config/kbest_300.json
@@ -1,34 +1,29 @@
 {
     "in_dir": "dev_preproc",
-    "out_dir": "dev_gr300",
+    "out_dir": "dev_300",
     "arg_permutation": false,
     "filters":
     {
         "basic":
         {
-            "ignore": false,
             "chart_filter": "basic",
             "k": 10
         },
         "max":
         {
-            "ignore": false,
             "chart_filter": "max",
             "k": 10
         },
         "prec":
         {
-            "ignore": false,
             "pr_metric": "prec"
         },
         "rec":
         {
-            "ignore": false,
             "pr_metric": "rec"
         },
         "f1":
         {
-            "ignore": false,
             "pr_metric": "f1"
         }
     }

diff --git a/...sem/hrg/pipeline/config/merge_hrg200.json → ...lp/sem/hrg/pipeline/config/merge_100.json b/...sem/hrg/pipeline/config/merge_hrg200.json → ...lp/sem/hrg/pipeline/config/merge_100.json
@@ -1,5 +1,5 @@
 {
-    "in_dir": "dev_gr200",
+    "in_dir": "dev_100",
     "k": 10,
     "bolinas_chart_filters": 
     [

diff --git a/...sem/hrg/pipeline/config/merge_hrg300.json → ...lp/sem/hrg/pipeline/config/merge_200.json b/...sem/hrg/pipeline/config/merge_hrg300.json → ...lp/sem/hrg/pipeline/config/merge_200.json
@@ -1,5 +1,5 @@
 {
-    "in_dir": "dev_gr300",
+    "in_dir": "dev_200",
     "k": 10,
     "bolinas_chart_filters": 
     [

diff --git a/...sem/hrg/pipeline/config/merge_hrg100.json → ...lp/sem/hrg/pipeline/config/merge_300.json b/...sem/hrg/pipeline/config/merge_hrg100.json → ...lp/sem/hrg/pipeline/config/merge_300.json
@@ -1,5 +1,5 @@
 {
-    "in_dir": "dev_gr100",
+    "in_dir": "dev_300",
     "k": 10,
     "bolinas_chart_filters": 
     [

diff --git a/tuw_nlp/sem/hrg/pipeline/config/parse_100.json b/tuw_nlp/sem/hrg/pipeline/config/parse_100.json
@@ -0,0 +1,6 @@
+{
+    "in_dir": "dev_preproc",
+    "grammar_file": "hrg_100.hrg",
+    "out_dir": "dev_100",
+    "max_steps": 50000
+}
diff --git a/tuw_nlp/sem/hrg/pipeline/config/parse_200.json b/tuw_nlp/sem/hrg/pipeline/config/parse_200.json
@@ -0,0 +1,6 @@
+{
+    "in_dir": "dev_preproc",
+    "grammar_file": "hrg_200.hrg",
+    "out_dir": "dev_200",
+    "max_steps": 50000
+}
diff --git a/tuw_nlp/sem/hrg/pipeline/config/parse_300.json b/tuw_nlp/sem/hrg/pipeline/config/parse_300.json
@@ -0,0 +1,6 @@
+{
+    "in_dir": "dev_preproc",
+    "grammar_file": "hrg_300.hrg",
+    "out_dir": "dev_300",
+    "max_steps": 50000
+}
diff --git a/tuw_nlp/sem/hrg/pipeline/config/parse_gr100.json b/tuw_nlp/sem/hrg/pipeline/config/parse_gr100.json
diff --git a/tuw_nlp/sem/hrg/pipeline/config/parse_gr200.json b/tuw_nlp/sem/hrg/pipeline/config/parse_gr200.json
diff --git a/tuw_nlp/sem/hrg/pipeline/config/parse_gr300.json b/tuw_nlp/sem/hrg/pipeline/config/parse_gr300.json
diff --git a/tuw_nlp/sem/hrg/pipeline/config/pipeline_dev_100.json b/tuw_nlp/sem/hrg/pipeline/config/pipeline_dev_100.json
@@ -0,0 +1,25 @@
+{
+    "steps":
+    [
+        {
+            "step_name": "parse",
+            "script_name": "parse",
+            "config": "parse_100.json"
+        },
+        {
+            "step_name": "kbest",
+            "script_name": "kbest",
+            "config": "kbest_100.json"
+        },
+        {
+            "step_name": "predict",
+            "script_name": "predict",
+            "config": "predict_100.json"
+        },
+        {
+            "step_name": "merge",
+            "script_name": "merge",
+            "config": "merge_100.json"
+        }
+    ]
+}
diff --git a/tuw_nlp/sem/hrg/pipeline/config/pipeline_dev_preproc.json b/tuw_nlp/sem/hrg/pipeline/config/pipeline_dev_preproc.json
diff --git a/...m/hrg/pipeline/config/predict_hrg200.json → .../sem/hrg/pipeline/config/predict_100.json b/...m/hrg/pipeline/config/predict_hrg200.json → .../sem/hrg/pipeline/config/predict_100.json
@@ -1,6 +1,6 @@
 {
     "preproc_dir": "dev_preproc",
-    "in_dir": "dev_gr200",
+    "in_dir": "dev_100",
     "bolinas_chart_filters": 
     [
         "basic",

diff --git a/...m/hrg/pipeline/config/predict_hrg300.json → .../sem/hrg/pipeline/config/predict_200.json b/...m/hrg/pipeline/config/predict_hrg300.json → .../sem/hrg/pipeline/config/predict_200.json
@@ -1,6 +1,6 @@
 {
     "preproc_dir": "dev_preproc",
-    "in_dir": "dev_gr300",
+    "in_dir": "dev_200",
     "bolinas_chart_filters": 
     [
         "basic",

diff --git a/...m/hrg/pipeline/config/predict_hrg100.json → .../sem/hrg/pipeline/config/predict_300.json b/...m/hrg/pipeline/config/predict_hrg100.json → .../sem/hrg/pipeline/config/predict_300.json
@@ -1,6 +1,6 @@
 {
     "preproc_dir": "dev_preproc",
-    "in_dir": "dev_gr100",
+    "in_dir": "dev_300",
     "bolinas_chart_filters": 
     [
         "basic",

diff --git a/tuw_nlp/sem/hrg/pipeline/log/pipeline_dev_preproc.log b/tuw_nlp/sem/hrg/pipeline/log/pipeline_dev_preproc.log
diff --git a/tuw_nlp/sem/hrg/pipeline/pipeline.py b/tuw_nlp/sem/hrg/pipeline/pipeline.py
@@ -17,7 +17,6 @@
 class Pipeline(Script):
     def __init__(self, log=True, config=None):
         super().__init__("Script to run a pipeline.", log, config)
-        self.pipeline_name = self.config["name"]
         self.steps = self.config["steps"]
         self.name_to_class = {
             "preproc": Preproc,

diff --git a/tuw_nlp/sem/hrg/steps/bolinas/kbest/kbest.py b/tuw_nlp/sem/hrg/steps/bolinas/kbest/kbest.py
@@ -131,6 +131,8 @@ def _do_for_sen(self, sen_idx, sen_dir):
                     top_order,
                     self.config["arg_permutation"],
                 )
+                if len(k_best_unique_derivations) == 0:
+                    sen_log_lines.append("No matches\n")
             else:
                 print("Neither 'k' nor 'pr_metric' is set")
                 continue

diff --git a/tuw_nlp/sem/hrg/steps/bolinas/parse/parse.py b/tuw_nlp/sem/hrg/steps/bolinas/parse/parse.py
@@ -1,8 +1,6 @@
-import os.path
 import fileinput
 import pickle
 
-from tuw_nlp.sem.hrg.common.io import log_to_console_and_log_lines
 from tuw_nlp.sem.hrg.common.script.loop_on_sen_dirs import LoopOnSenDirs
 from tuw_nlp.sem.hrg.steps.bolinas.common.grammar import Grammar
 from tuw_nlp.sem.hrg.steps.bolinas.common.hgraph.hgraph import Hgraph

diff --git a/tuw_nlp/sem/hrg/steps/predict/predict.py b/tuw_nlp/sem/hrg/steps/predict/predict.py
@@ -12,7 +12,6 @@ class Predict(LoopOnSenDirs):
 
     def __init__(self, config=None):
         super().__init__(description="Script to create wire jsons from predicted bolinas labels.", config=config)
-        self.out_dir += self.in_dir
         self.preproc_dir = f"{self.data_dir}/{self.config['preproc_dir']}"
         self.chart_filters = self.config["bolinas_chart_filters"]
         self.postprocess = self.config["postprocess"]