Skip to content

Commit

Permalink
Prepare dev_100 pipeline
Browse files Browse the repository at this point in the history
  • Loading branch information
Eszti committed Nov 18, 2024
1 parent ef0e325 commit 7c8ee65
Show file tree
Hide file tree
Showing 23 changed files with 95 additions and 90 deletions.
59 changes: 41 additions & 18 deletions tuw_nlp/sem/hrg/Documentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,23 +4,33 @@

First, we try to find a top estimate for our system in order to validate our concept.

Our system on the 23rd of Oct. 2024:

### Train a grammar

We [train](steps/train/train.py) a hyperedge replacement [grammar](pipeline/output/grammar) (HRG) using the [lsoie dataset](https://github.com/Jacobsolawetz/large-scale-oie/tree/master/dataset_creation/lsoie_data) on the triplet induced sub-graphs of the UD graph of a sentence. We create one rule per word and use the nonterminals `S`, `A`, `P` and `X` (no label).

We [create](steps/train/hrg.py) different cuts of this grammar using the top 100, 200 and 300 rules by keeping the original distribution of nonterminals and norming the weighs per nonterminal.

#### Run the whole train pipeline

```bash
# Get the data
export DATA_DIR=$HOME/data
mkdir $DATA_DIR
cd $DATA_DIR
# Download and unzip the lsoie data into a folder called lsoie_data

# Preprocess the train data
python steps/preproc/preproc.py -d $DATA_DIR -c pipeline/config/preproc_train.json

# Train the grammar
python steps/train/train.py -d $DATA_DIR -c pipeline/config/train_per_word.json
```

We [create](steps/train/hrg.py) different cuts of this grammar using the top 100, 200 and 300 rules by keeping the original distribution of nonterminals and norming the weighs per nonterminal.

```bash
python steps/train/hrg.py -d $DATA_DIR -c pipeline/config/hrg.json
```

#### Run the whole train pipeline

```bash
python pipeline/pipeline.py -d $DATA_DIR -c pipeline/config/pipeline_train.json
```

Expand All @@ -29,38 +39,51 @@ python pipeline/pipeline.py -d $DATA_DIR -c pipeline/config/pipeline_train.json
First, we [preprocess](steps/preproc/preproc.py) the dev data as well.

```bash
python preproc/preproc.py -d $DATA_DIR -c preproc/config/preproc_dev.json
python steps/preproc/preproc.py -d $DATA_DIR -c pipeline/config/preproc_dev.json
```

Using the grammar, first we [parse](steps/bolinas/parse/parse.py) the UD graphs on the dev set, saving the resulting charts as an intermediary output. We prune the parsing above 10.000 steps for gr100 and gr200 and above 50.000 steps for gr300. The parsing takes from 1 hour to one day (gr300), see more [here](pipeline/log).
Using the grammar, first we [parse](steps/bolinas/parse/parse.py) the UD graphs on the dev set, saving the resulting charts as an intermediary output. We prune the parsing above 50.000 steps. The parsing takes from 1 hour to one day, see more [here](pipeline/log).

```bash
python bolinas/parse/parse.py -d $DATA_DIR -c bolinas/parse/config/parse_gr100.json
python steps/bolinas/parse/parse.py -d $DATA_DIR -c pipeline/config/parse_100.json
```

We [search](bolinas/kbest/kbest.py) for the top k best derivations in the chart. We apply different filters on the chart: `basic` (no filtering), `max` (searching only among the largest derivations), or classic retrieval metrics `precision`, `recall` and `f1-score`, where we cheat by using the gold data and returning for each gold entry only the one derivation with the highest respective score. To calculate these scores we use the same triplet matching and scoring function as in the evaluation step. Our system returns at most k node-label maps for a sentence, where a label corresponds to a nonterminal symbol. This mapping requires some [postprocessing](postproc/postproc.py), a predicate resolution in case no predicate is found and an argument grouping and indexing step, since we only have `A` as nonterminal. For `precision`, `recall` and `f1-score` filters this postprocessing step has to be done before calculating the scores. We also try argument permutation, in which case we try all possible argument indexing for the identified argument groups. This search takes from 1 our to 2.5 days, see more [here](bolinas/kbest/log).
We [search](steps/bolinas/kbest/kbest.py) for the top k best derivations in the chart. We apply different filters on the chart: `basic` (no filtering), `max` (searching only among the largest derivations), or classic retrieval metrics `precision`, `recall` and `f1-score`, where we cheat by using the gold data and returning for each gold entry only the one derivation with the highest respective score. To calculate these scores we use the same triplet matching and scoring function as in the evaluation step. Our system returns at most k node-label maps for a sentence, where a label corresponds to a nonterminal symbol. This mapping requires some [postprocessing](steps/postproc/postproc.py), a predicate resolution in case no predicate is found and an argument grouping and indexing step, since we only have `A` as nonterminal. For `precision`, `recall` and `f1-score` filters this postprocessing step has to be done before calculating the scores. We also try argument permutation, in which case we try all possible argument indexing for the identified argument groups. This search takes from 1 our to 2.5 days, see more [here](pipeline/log).

```bash
python bolinas/kbest/kbest.py -d $DATA_DIR -c bolinas/kbest/config/kbest_gr100.json
python steps/bolinas/kbest/kbest.py -d $DATA_DIR -c pipeline/config/kbest_100.json
```

After the k best derivations are found we [predict](predict/predict.py) the labels, where we apply the necessary [postprocessing](postproc/postproc.py) steps (for `basic` and `max`). There is a possibility to implement further postprocessing strategies, as for now `keep` (resolving predicate only if not present, forming argument groups as continuous A-label word spans and indexing these groups from left to right) is our only strategy.
After the k best derivations are found we [predict](steps/predict/predict.py) the labels, where we apply the necessary [postprocessing](steps/postproc/postproc.py) steps (for `basic` and `max`). There is a possibility to implement further postprocessing strategies, as for now `keep` (resolving predicate only if not present, forming argument groups as continuous A-label word spans and indexing these groups from left to right) is our only strategy.

```python
# TBD
```bash
python steps/predict/predict.py -d $DATA_DIR -c pipeline/config/predict_100.json
```

Once all sentences are predicted, we [merge](predict/merge.py) them into one json per model.
Once all sentences are predicted, we [merge](steps/predict/merge.py) them into one json per model.

```python
# TBD
```bash
python steps/predict/merge.py -d $DATA_DIR -c pipeline/config/merge_100.json
```

#### Run the whole predict pipeline on dev

```bash
# Hrg - 100
python pipeline/pipeline.py -d $DATA_DIR -c pipeline/config/pipeline_dev_100.json

# Hrg - 200
python pipeline/pipeline.py -d $DATA_DIR -c pipeline/config/pipeline_dev_200.json

# Hrg - 300
python pipeline/pipeline.py -d $DATA_DIR -c pipeline/config/pipeline_dev_300.json
```

### Create a random predictions for comparison

We implement a [random extractor](random/random_extractor.py) that uses the [artefacts](random/train_stat) of the training dataset (distribution of the number of extractions per sentence, and distribution of labels per length of the sentence) and assures that the predicate is a verb.

```python
```bash
# TBD
```

Expand Down
Original file line number Diff line number Diff line change
@@ -1,34 +1,29 @@
{
"in_dir": "dev_preproc",
"out_dir": "dev_gr100",
"out_dir": "dev_100",
"arg_permutation": false,
"filters":
{
"basic":
{
"ignore": false,
"chart_filter": "basic",
"k": 10
},
"max":
{
"ignore": false,
"chart_filter": "max",
"k": 10
},
"prec":
{
"ignore": false,
"pr_metric": "prec"
},
"rec":
{
"ignore": false,
"pr_metric": "rec"
},
"f1":
{
"ignore": false,
"pr_metric": "f1"
}
}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,34 +1,29 @@
{
"in_dir": "dev_preproc",
"out_dir": "dev_gr200",
"out_dir": "dev_200",
"arg_permutation": false,
"filters":
{
"basic":
{
"ignore": false,
"chart_filter": "basic",
"k": 10
},
"max":
{
"ignore": false,
"chart_filter": "max",
"k": 10
},
"prec":
{
"ignore": false,
"pr_metric": "prec"
},
"rec":
{
"ignore": false,
"pr_metric": "rec"
},
"f1":
{
"ignore": false,
"pr_metric": "f1"
}
}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,34 +1,29 @@
{
"in_dir": "dev_preproc",
"out_dir": "dev_gr300",
"out_dir": "dev_300",
"arg_permutation": false,
"filters":
{
"basic":
{
"ignore": false,
"chart_filter": "basic",
"k": 10
},
"max":
{
"ignore": false,
"chart_filter": "max",
"k": 10
},
"prec":
{
"ignore": false,
"pr_metric": "prec"
},
"rec":
{
"ignore": false,
"pr_metric": "rec"
},
"f1":
{
"ignore": false,
"pr_metric": "f1"
}
}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"in_dir": "dev_gr200",
"in_dir": "dev_100",
"k": 10,
"bolinas_chart_filters":
[
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"in_dir": "dev_gr300",
"in_dir": "dev_200",
"k": 10,
"bolinas_chart_filters":
[
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"in_dir": "dev_gr100",
"in_dir": "dev_300",
"k": 10,
"bolinas_chart_filters":
[
Expand Down
6 changes: 6 additions & 0 deletions tuw_nlp/sem/hrg/pipeline/config/parse_100.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"in_dir": "dev_preproc",
"grammar_file": "hrg_100.hrg",
"out_dir": "dev_100",
"max_steps": 50000
}
6 changes: 6 additions & 0 deletions tuw_nlp/sem/hrg/pipeline/config/parse_200.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"in_dir": "dev_preproc",
"grammar_file": "hrg_200.hrg",
"out_dir": "dev_200",
"max_steps": 50000
}
6 changes: 6 additions & 0 deletions tuw_nlp/sem/hrg/pipeline/config/parse_300.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"in_dir": "dev_preproc",
"grammar_file": "hrg_300.hrg",
"out_dir": "dev_300",
"max_steps": 50000
}
6 changes: 0 additions & 6 deletions tuw_nlp/sem/hrg/pipeline/config/parse_gr100.json

This file was deleted.

6 changes: 0 additions & 6 deletions tuw_nlp/sem/hrg/pipeline/config/parse_gr200.json

This file was deleted.

6 changes: 0 additions & 6 deletions tuw_nlp/sem/hrg/pipeline/config/parse_gr300.json

This file was deleted.

25 changes: 25 additions & 0 deletions tuw_nlp/sem/hrg/pipeline/config/pipeline_dev_100.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"steps":
[
{
"step_name": "parse",
"script_name": "parse",
"config": "parse_100.json"
},
{
"step_name": "kbest",
"script_name": "kbest",
"config": "kbest_100.json"
},
{
"step_name": "predict",
"script_name": "predict",
"config": "predict_100.json"
},
{
"step_name": "merge",
"script_name": "merge",
"config": "merge_100.json"
}
]
}
10 changes: 0 additions & 10 deletions tuw_nlp/sem/hrg/pipeline/config/pipeline_dev_preproc.json

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"preproc_dir": "dev_preproc",
"in_dir": "dev_gr200",
"in_dir": "dev_100",
"bolinas_chart_filters":
[
"basic",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"preproc_dir": "dev_preproc",
"in_dir": "dev_gr300",
"in_dir": "dev_200",
"bolinas_chart_filters":
[
"basic",
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"preproc_dir": "dev_preproc",
"in_dir": "dev_gr100",
"in_dir": "dev_300",
"bolinas_chart_filters":
[
"basic",
Expand Down
16 changes: 0 additions & 16 deletions tuw_nlp/sem/hrg/pipeline/log/pipeline_dev_preproc.log

This file was deleted.

1 change: 0 additions & 1 deletion tuw_nlp/sem/hrg/pipeline/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,6 @@
class Pipeline(Script):
def __init__(self, log=True, config=None):
super().__init__("Script to run a pipeline.", log, config)
self.pipeline_name = self.config["name"]
self.steps = self.config["steps"]
self.name_to_class = {
"preproc": Preproc,
Expand Down
2 changes: 2 additions & 0 deletions tuw_nlp/sem/hrg/steps/bolinas/kbest/kbest.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,8 @@ def _do_for_sen(self, sen_idx, sen_dir):
top_order,
self.config["arg_permutation"],
)
if len(k_best_unique_derivations) == 0:
sen_log_lines.append("No matches\n")
else:
print("Neither 'k' nor 'pr_metric' is set")
continue
Expand Down
2 changes: 0 additions & 2 deletions tuw_nlp/sem/hrg/steps/bolinas/parse/parse.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
import os.path
import fileinput
import pickle

from tuw_nlp.sem.hrg.common.io import log_to_console_and_log_lines
from tuw_nlp.sem.hrg.common.script.loop_on_sen_dirs import LoopOnSenDirs
from tuw_nlp.sem.hrg.steps.bolinas.common.grammar import Grammar
from tuw_nlp.sem.hrg.steps.bolinas.common.hgraph.hgraph import Hgraph
Expand Down
1 change: 0 additions & 1 deletion tuw_nlp/sem/hrg/steps/predict/predict.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@ class Predict(LoopOnSenDirs):

def __init__(self, config=None):
super().__init__(description="Script to create wire jsons from predicted bolinas labels.", config=config)
self.out_dir += self.in_dir
self.preproc_dir = f"{self.data_dir}/{self.config['preproc_dir']}"
self.chart_filters = self.config["bolinas_chart_filters"]
self.postprocess = self.config["postprocess"]
Expand Down

0 comments on commit 7c8ee65

Please sign in to comment.