Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Fine Tune the translation task #5

Open
OWaheed opened this issue Apr 13, 2022 · 4 comments
Open

How to Fine Tune the translation task #5

OWaheed opened this issue Apr 13, 2022 · 4 comments

Comments

@OWaheed
Copy link

OWaheed commented Apr 13, 2022

I need to fine tune the translation task should i prepare the data in specific format?
and how to fine tune your model for that task?

@salma-elshafey
Copy link

You can put your parallel data in a csv file separated by tabs.
Assuming the source sentence column is called 'input_text' and the target sentence column is called 'target_text', the fine-tuning script shall look similar to this:

!python araT5/examples/run_trainier_seq2seq_huggingface.py
--learning_rate 5e-5
--max_target_length 128 --max_source_length 128
--per_device_train_batch_size 8 --per_device_eval_batch_size 8
--model_name_or_path "UBC-NLP/AraT5-msa-small"
--source_lang "ar_AR" --target_lang "en_XX"
--output_dir "AraT5-translation" --overwrite_output_dir
--num_train_epochs 5
--train_file "train.tsv"
--validation_file "valid.tsv"
--test_file "test.tsv"
--task "translation" --text_column "input_text" --summary_column "target_text"
--load_best_model_at_end --metric_for_best_model "eval_bleu" --greater_is_better True
--evaluation_strategy epoch --logging_strategy epoch --predict_with_generate
--do_train --do_eval

@hust-kevin
Copy link

@salma-elshafey @Nagoudi how to use araT5 do machine translation? what's the script of infer

@BarahFazili
Copy link

@hust-kevin You simply add the argument --do_predict to record the predictions(test_preds_seq2seq.txt) for the provided test_file.

@ss8319
Copy link

ss8319 commented Aug 3, 2022

@hust-kevin @BarahFazili
If we were to just run inferencing( eg provide a dialectal arabic document to AraT5 and obtain english translation) without fine-tuning, could we reuse this script python araT5/examples/run_trainier_seq2seq_huggingface.py?

Do I need to specify -- source lang if I am using an dialectal arabic as input? For instance arz for Egyptian Arabic which was part of the training dataset? What should my source lang tag be if I am running inferencing for another dialectal arabic not part of the training set for eg. acw-Hijazi Arabic.

I am not sure if the way I am running inferencing is correct. I am using pipeline(). Is there an ideal way to run inferencing?

`
import datasets
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("UBC-NLP/AraT5-msa-base")
tokenizer = AutoTokenizer.from_pretrained("UBC-NLP/AraT5-msa-base")
pipe=pipeline("translation_arz_to_en", model=model, tokenizer=tokenizer, max_length=60)
from datasets import Dataset
import pandas as pd

with open('/content/gdrive/Shareddrives/Gutenberg/MT/experiments/HuggingFace/AraT5-msa-base-acw-v2-en_JHN/acw_pred.txt','w',encoding='utf-8') as f:
for line in src_text:
for out in tqdm(pipe(line)):
for value in out.values(): #out is a dict, write takes str
f.write('%s\n' %value) `

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants