How to Fine Tune the translation task #5

OWaheed · 2022-04-13T15:23:23Z

I need to fine tune the translation task should i prepare the data in specific format?
and how to fine tune your model for that task?

salma-elshafey · 2022-06-06T10:31:20Z

You can put your parallel data in a csv file separated by tabs.
Assuming the source sentence column is called 'input_text' and the target sentence column is called 'target_text', the fine-tuning script shall look similar to this:

!python araT5/examples/run_trainier_seq2seq_huggingface.py
--learning_rate 5e-5
--max_target_length 128 --max_source_length 128
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--model_name_or_path "UBC-NLP/AraT5-msa-small"
--source_lang "ar_AR" --target_lang "en_XX"
--output_dir "AraT5-translation" --overwrite_output_dir
--num_train_epochs 5
--train_file "train.tsv"
--validation_file "valid.tsv"
--test_file "test.tsv"
--task "translation" --text_column "input_text" --summary_column "target_text"
--load_best_model_at_end --metric_for_best_model "eval_bleu" --greater_is_better True
--evaluation_strategy epoch --logging_strategy epoch --predict_with_generate
--do_train --do_eval

hust-kevin · 2022-06-24T08:29:49Z

@salma-elshafey @Nagoudi how to use araT5 do machine translation? what's the script of infer

BarahFazili · 2022-06-30T05:28:20Z

@hust-kevin You simply add the argument --do_predict to record the predictions(test_preds_seq2seq.txt) for the provided test_file.

ss8319 · 2022-08-03T08:02:17Z

@hust-kevin @BarahFazili
If we were to just run inferencing( eg provide a dialectal arabic document to AraT5 and obtain english translation) without fine-tuning, could we reuse this script python araT5/examples/run_trainier_seq2seq_huggingface.py?

Do I need to specify -- source lang if I am using an dialectal arabic as input? For instance arz for Egyptian Arabic which was part of the training dataset? What should my source lang tag be if I am running inferencing for another dialectal arabic not part of the training set for eg. acw-Hijazi Arabic.

I am not sure if the way I am running inferencing is correct. I am using pipeline(). Is there an ideal way to run inferencing?

`
import datasets
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("UBC-NLP/AraT5-msa-base")
tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/AraT5-msa-base")
pipe=pipeline("translation_arz_to_en", model=model, tokenizer=tokenizer, max_length=60)
from datasets import Dataset
import pandas as pd

with open('/content/gdrive/Shareddrives/Gutenberg/MT/experiments/HuggingFace/AraT5-msa-base-acw-v2-en_JHN/acw_pred.txt','w',encoding='utf-8') as f:
for line in src_text:
for out in tqdm(pipe(line)):
for value in out.values(): #out is a dict, write takes str
f.write('%s\n' %value) `

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Fine Tune the translation task #5

How to Fine Tune the translation task #5

OWaheed commented Apr 13, 2022

salma-elshafey commented Jun 6, 2022

hust-kevin commented Jun 24, 2022

BarahFazili commented Jun 30, 2022

ss8319 commented Aug 3, 2022 •

edited

Loading

How to Fine Tune the translation task #5

How to Fine Tune the translation task #5

Comments

OWaheed commented Apr 13, 2022

salma-elshafey commented Jun 6, 2022

hust-kevin commented Jun 24, 2022

BarahFazili commented Jun 30, 2022

ss8319 commented Aug 3, 2022 • edited Loading

ss8319 commented Aug 3, 2022 •

edited

Loading