Skip to content

Commit

Permalink
Initial commit
Browse files Browse the repository at this point in the history
  • Loading branch information
pprobst committed Oct 13, 2023
1 parent 0f1ce93 commit 6a9b346
Show file tree
Hide file tree
Showing 10 changed files with 735 additions and 60 deletions.
72 changes: 12 additions & 60 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,68 +1,20 @@
<p align="center">
<img width="600" src=".github/logo.png" title="Project logo"><br /><br />
<img src="https://img.shields.io/maintenance/yes/2022?style=for-the-badge" title="Project status">
<img src="https://img.shields.io/github/workflow/status/iarahealth/template/CI?label=Build&logo=github&logoColor=white&style=for-the-badge" title="Build status">
</p>
# iaraugen
Data augmentation/generation utilities for Iara.

# Title
Can be used standalone (instructions below) or as part of other programs.

Project description goes here. This description is usually two to three lines long. It should give an overview of what the project is, eg technology used, philosophy of existence, what problem it is trying to solve, etc. If you need to write more than 3 lines of description, create subsections.

> **NOTICE:** put here a message that is very relevant to users of the project, if any.
## ✨Features

Here you can place screenshots of the project. Also describe your features using a list:

* ✔️ Easy integration;
* 🥢 Few dependencies;
* 🎨 Beautiful template with a nice `README`;
* 🖖 Great documentation and testing?

## 🚀 Getting started

### 1. First step to get started

Usually the first step to get started is to install dependencies to run the project. Run:

```
apt get install dependency
## Offline text augmentation
```
help: ./txt_aug.py -h
It is recommended to place each command on a different line:

example usage:
./txt_aug.py corpus_1br_10pt_15sept.tok --aug translate --maxs 10 --lang en --translate_mode local --append --output out.txt
```
apt get install something else
```

This way users can copy and paste without reading the documentation (which is what usually happens).

### 2. Other step(s)

Usually the next steps teach you how to install and configure the project for use / development. Run:

```
git clone https://github.com/iarahealth/template template
## Offline text generation
```
help: ./txt_gen.py -h
## 🤝 Contribute

Your help is most welcome regardless of form! Check out the [CONTRIBUTING.md](CONTRIBUTING.md) file for all ways you can contribute to the project. For example, [suggest a new feature](https://github.com/iarahealth/template/issues/new?assignees=&labels=&title=), [report a problem/bug](https://github.com/iarahealth/template/issues/new?assignees=&labels=bug&title=), [submit a pull request](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests), or simply use the project and comment your experience. You are encourage to participate as much as possible, but stay tuned to the [code of conduct](./CODE_OF_CONDUCT.md) before making any interaction with other community members.

See the [ROADMAP.md](ROADMAP.md) file for an idea of how the project should evolve.

## 🎫 License

This project is proprietary and confidential. Unauthorized copying of any file in this repository, via any medium is strictly prohibited. Contact [[email protected]](mailto:[email protected]) for inquiries or reports.

## 🧬 Changelog

See all changes to this project in the [CHANGELOG.md](CHANGELOG.md) file.

## 🧪 Similar projects

Below is a list of interesting links and similar projects:

* [Other project](https://github.com/project)
* [Project inspiration](https://github.com/project)
* [Similar tool](https://github.com/project)
example usage:
./txt_gen.py --input_file palavras.txt --context "radiologia médica" --num 2 --return_type "frases" --api_key "YOUR_OPENAI_API_KEY" --output query.txt
```
Empty file added __init__.py
Empty file.
10 changes: 10 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
torch
deep-translator
transformers
sentencepiece
tenacity
openai
nlpaug
nltk
num2words
tqdm
23 changes: 23 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
from setuptools import setup, find_packages

authors = ["Pedro Probst", "Bernardo Henz"]

setup(
name="iaraugen",
version="1.0.0",
author=", ".join(authors),
description="Data augmentation/generation functions used at Iara Health (speech-to-text).",
packages=find_packages(),
install_requires=[
"torch",
"deep-translator",
"transformers",
"sentencepiece",
"tenacity",
"openai",
"nlpaug",
"nltk",
"num2words",
"tqdm",
],
)
236 changes: 236 additions & 0 deletions txt_aug.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
#!/usr/bin/env python3
import argparse
import random
import torch
from typing import List
from deep_translator import GoogleTranslator
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from tqdm import tqdm
from util.text_augmenter import SentenceAugmenter
from util.text import (
post_process_sentences,
print_sentences_comparison,
remove_equal_sentences,
)
from util.files import read_sentences_corpus, append_sentences_to_file

"""
example usage:
./txt_aug.py corpus.tok --aug translate random --action delete --maxs 10 --lang en --append
"""


def backtranslate_sentences_api(
sentences: List[str], source_lang: str, target_lang: str
) -> List[str]:
"""
Backtranslates a list of sentences from the source language to the target language
using the Google Translator API.
Args:
sentences (List[str]): The list of sentences to be backtranslated.
source_lang (str): The source language code (e.g., "pt" for Portuguese).
target_lang (str): The target language code (e.g., "en" for English).
Returns:
List[str]: A list of backtranslated sentences.
"""
translator = GoogleTranslator(source=source_lang, target=target_lang)
translations = translator.translate_batch(sentences)
backtranslator = GoogleTranslator(source=target_lang, target=source_lang)
backtranslations = backtranslator.translate_batch(translations)

return backtranslations


def backtranslate_sentences_local(
sentences: List[str], source_lang: str, target_lang: str, device: str = "cpu"
) -> List[str]:
"""
Backtranslates a list of sentences from the source language to the target language,
and then back to the source language using a local model.
Args:
sentences (List[str]): The list of sentences to be backtranslated.
source_lang (str): The source language code (e.g., "pt" for Portuguese).
target_lang (str): The target language code (e.g., "en" for English).
device (str): The device to run the model on (e.g., "cpu" or "cuda").
Returns:
List[str]: A list of backtranslated sentences.
Note:
nlpaug has a backtranslation module, but it only officially supports Helsinki-NLP,
but we do not have a Helsinki model for Portuguese -> English. So we use the T5 model
directly from HuggingFace.
"""
tokenizer = AutoTokenizer.from_pretrained(
f"unicamp-dl/translation-{source_lang}-{target_lang}-t5"
)
model = AutoModelForSeq2SeqLM.from_pretrained(
f"unicamp-dl/translation-{source_lang}-{target_lang}-t5"
)
model.to(torch.device(device))
backtokenizer = AutoTokenizer.from_pretrained(
f"unicamp-dl/translation-{target_lang}-{source_lang}-t5"
)
backmodel = AutoModelForSeq2SeqLM.from_pretrained(
f"unicamp-dl/translation-{target_lang}-{source_lang}-t5"
)
backmodel.to(torch.device(device))
pten_pipeline = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
enpt_pipeline = pipeline(
"text2text-generation", model=backmodel, tokenizer=backtokenizer
)

print(f"Backtranslating {len(sentences)} sentences...")
translations: List[str] = []
for sentence in tqdm(sentences):
transl = pten_pipeline("translate Portuguese to English: " + sentence)[0][
"generated_text"
]
backtransl = enpt_pipeline("translate English to Portuguese: " + transl)[0][
"generated_text"
]
translations.append(backtransl)

return translations


def translation_pipeline(
sentences: List[str], translate_mode: str, lang: str, device: str
) -> List[str]:
"""
Runs the translation pipeline to backtranslate a list of sentences.
Args:
sentences (List[str]): The list of sentences to be translated.
translate_mode (str): Use local model or API to translate.
lang (str): The target language code (e.g., "en" for English).
device (str): The device to run the model on (e.g., "cpu" or "cuda").
Returns:
List[str]: A list of translated sentences.
"""
augmented_sentences: List[str] = []
print(f"Backtranslating sentences pt->{lang}->pt...")
if translate_mode == "local":
augmented_sentences = backtranslate_sentences_local(
sentences, "pt", lang, device
)
elif translate_mode == "google":
augmented_sentences = backtranslate_sentences_api(sentences, "pt", lang)
assert len(augmented_sentences)
return augmented_sentences


def create_augmentation_sequence(
augmentations: List[str], action: str, translate_mode: str, lang: str, device: str
) -> List[callable]:
augmentation_sequence = []
for aug in augmentations:
if aug == "random" or aug == "synonym":
augmenter = SentenceAugmenter(aug, action=action)
augmentation_sequence.append(lambda x: augmenter.augment_sentences(x))
elif aug == "translate":
augmentation_sequence.append(
lambda x: translation_pipeline(x, translate_mode, lang, device)
)
return augmentation_sequence


if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Sentence augmentation: back-translate, random delete, random swap, synonym replacement."
)
parser.add_argument("corpus", type=str, help="Input corpus file")
parser.add_argument(
"--aug",
nargs="+",
type=str,
required=True,
choices=["random", "translate", "synonym"],
help="Augmentation type to perform",
)
parser.add_argument(
"--action",
type=str,
choices=["delete", "swap"],
default="delete",
help="Action to perform",
)
parser.add_argument(
"--maxs",
type=str,
default="10",
help="Maximum number of sentences to process. Can be a percentage of the total, e.g., 5%% (default: 10)",
)
parser.add_argument(
"--seed",
type=int,
default=451,
help="Random seed (default: 451)",
)
parser.add_argument(
"--lang",
type=str,
default="en",
help="Target language for translation (default: en)",
)
parser.add_argument(
"--translate_mode",
type=str,
choices=["google", "local", "openai"],
default="local",
help="Target language for translation (default: local)",
)
parser.add_argument(
"--device",
type=str,
default="cpu",
help="Process on CPU or CUDA (default: cpu)",
)
parser.add_argument(
"--output",
type=str,
default=None,
help="Output file to write augmented sentences in addition to the input corpus",
)
parser.add_argument("--append", action="store_true", help="Append to corpus file")
args = parser.parse_args()

random.seed(args.seed)

sentences = read_sentences_corpus(args.corpus, max_sentences=args.maxs)
print(f"Read {len(sentences)} sentences from {args.corpus}")

augmentation_sequence = create_augmentation_sequence(
args.aug, args.action, args.translate_mode, args.lang, args.device
)

augmented_sentences = sentences
for i, aug_fn in enumerate(augmentation_sequence):
print(f"Augmentation step {i + 1} of {len(augmentation_sequence)}:")
augmented_sentences = aug_fn(augmented_sentences)

augmented_sentences = post_process_sentences(augmented_sentences)
sentences = post_process_sentences(sentences)
print_sentences_comparison(sentences, augmented_sentences)

print("Removing equal sentences...")
augmented_sentences = remove_equal_sentences(sentences, augmented_sentences)

print("\nFinal results:")
print("-------------------")
for sentence in augmented_sentences:
print(sentence)
print(f"\nTotal: {len(augmented_sentences)} sentences")
print("-------------------\n")

if args.append:
print(f"Appending augmented sentences to {args.corpus}...")
append_sentences_to_file(args.corpus, augmented_sentences)

if args.output:
print(f"Appending augmented sentences to {args.output}...")
append_sentences_to_file(args.output, augmented_sentences)
Loading

0 comments on commit 6a9b346

Please sign in to comment.