Initial commit

iarahealth · Oct 13, 2023 · 6a9b346 · 6a9b346
1 parent 0f1ce93
commit 6a9b346
Show file tree

Hide file tree

Showing 10 changed files with 735 additions and 60 deletions.
diff --git a/README.md b/README.md
@@ -1,68 +1,20 @@
-<p align="center">
-    <img width="600" src=".github/logo.png" title="Project logo"><br /><br />
-    <img src="https://img.shields.io/maintenance/yes/2022?style=for-the-badge" title="Project status">
-    <img src="https://img.shields.io/github/workflow/status/iarahealth/template/CI?label=Build&logo=github&logoColor=white&style=for-the-badge" title="Build status">
-</p>
+# iaraugen
+Data augmentation/generation utilities for Iara.
 
-# Title
+Can be used standalone (instructions below) or as part of other programs.
 
-Project description goes here. This description is usually two to three lines long. It should give an overview of what the project is, eg technology used, philosophy of existence, what problem it is trying to solve, etc. If you need to write more than 3 lines of description, create subsections.
-
-> **NOTICE:** put here a message that is very relevant to users of the project, if any.
-
-## ✨Features
-
-Here you can place screenshots of the project. Also describe your features using a list:
-
-* ✔️ Easy integration;
-* 🥢 Few dependencies;
-* 🎨 Beautiful template with a nice `README`;
-* 🖖 Great documentation and testing?
-
-## 🚀 Getting started
-
-### 1. First step to get started
-
-Usually the first step to get started is to install dependencies to run the project. Run:
-
-```
-apt get install dependency
+## Offline text augmentation
 ```
+help: ./txt_aug.py -h
 
-It is recommended to place each command on a different line:
-
+example usage:
+./txt_aug.py corpus_1br_10pt_15sept.tok --aug translate --maxs 10 --lang en --translate_mode local --append --output out.txt
 ```
-apt get install something else
-```
-
-This way users can copy and paste without reading the documentation (which is what usually happens).
-
-### 2. Other step(s)
 
-Usually the next steps teach you how to install and configure the project for use / development. Run:
-
-```
-git clone https://github.com/iarahealth/template template
+## Offline text generation
 ```
+help: ./txt_gen.py -h
 
-## 🤝 Contribute
-
-Your help is most welcome regardless of form! Check out the [CONTRIBUTING.md](CONTRIBUTING.md) file for all ways you can contribute to the project. For example, [suggest a new feature](https://github.com/iarahealth/template/issues/new?assignees=&labels=&title=), [report a problem/bug](https://github.com/iarahealth/template/issues/new?assignees=&labels=bug&title=), [submit a pull request](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests), or simply use the project and comment your experience. You are encourage to participate as much as possible, but stay tuned to the [code of conduct](./CODE_OF_CONDUCT.md) before making any interaction with other community members.
-
-See the [ROADMAP.md](ROADMAP.md) file for an idea of how the project should evolve.
-
-## 🎫 License
-
-This project is proprietary and confidential. Unauthorized copying of any file in this repository, via any medium is strictly prohibited. Contact [[email protected]](mailto:[email protected]) for inquiries or reports.
-
-## 🧬 Changelog
-
-See all changes to this project in the [CHANGELOG.md](CHANGELOG.md) file.
-
-## 🧪 Similar projects
-
-Below is a list of interesting links and similar projects:
-
-* [Other project](https://github.com/project)
-* [Project inspiration](https://github.com/project)
-* [Similar tool](https://github.com/project)
+example usage:
+./txt_gen.py --input_file palavras.txt --context "radiologia médica" --num 2 --return_type "frases" --api_key "YOUR_OPENAI_API_KEY" --output query.txt
+```
diff --git a/__init__.py b/__init__.py
diff --git a/requirements.txt b/requirements.txt
@@ -0,0 +1,10 @@
+torch
+deep-translator
+transformers
+sentencepiece
+tenacity
+openai
+nlpaug
+nltk
+num2words
+tqdm
diff --git a/setup.py b/setup.py
@@ -0,0 +1,23 @@
+from setuptools import setup, find_packages
+
+authors = ["Pedro Probst", "Bernardo Henz"]
+
+setup(
+    name="iaraugen",
+    version="1.0.0",
+    author=", ".join(authors),
+    description="Data augmentation/generation functions used at Iara Health (speech-to-text).",
+    packages=find_packages(),
+    install_requires=[
+        "torch",
+        "deep-translator",
+        "transformers",
+        "sentencepiece",
+        "tenacity",
+        "openai",
+        "nlpaug",
+        "nltk",
+        "num2words",
+        "tqdm",
+    ],
+)
diff --git a/txt_aug.py b/txt_aug.py
@@ -0,0 +1,236 @@
+#!/usr/bin/env python3
+import argparse
+import random
+import torch
+from typing import List
+from deep_translator import GoogleTranslator
+from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
+from tqdm import tqdm
+from util.text_augmenter import SentenceAugmenter
+from util.text import (
+    post_process_sentences,
+    print_sentences_comparison,
+    remove_equal_sentences,
+)
+from util.files import read_sentences_corpus, append_sentences_to_file
+
+"""
+example usage:
+./txt_aug.py corpus.tok --aug translate random --action delete --maxs 10 --lang en --append
+"""
+
+
+def backtranslate_sentences_api(
+    sentences: List[str], source_lang: str, target_lang: str
+) -> List[str]:
+    """
+    Backtranslates a list of sentences from the source language to the target language
+    using the Google Translator API.
+
+    Args:
+        sentences (List[str]): The list of sentences to be backtranslated.
+        source_lang (str): The source language code (e.g., "pt" for Portuguese).
+        target_lang (str): The target language code (e.g., "en" for English).
+
+    Returns:
+        List[str]: A list of backtranslated sentences.
+    """
+    translator = GoogleTranslator(source=source_lang, target=target_lang)
+    translations = translator.translate_batch(sentences)
+    backtranslator = GoogleTranslator(source=target_lang, target=source_lang)
+    backtranslations = backtranslator.translate_batch(translations)
+
+    return backtranslations
+
+
+def backtranslate_sentences_local(
+    sentences: List[str], source_lang: str, target_lang: str, device: str = "cpu"
+) -> List[str]:
+    """
+    Backtranslates a list of sentences from the source language to the target language,
+    and then back to the source language using a local model.
+
+    Args:
+        sentences (List[str]): The list of sentences to be backtranslated.
+        source_lang (str): The source language code (e.g., "pt" for Portuguese).
+        target_lang (str): The target language code (e.g., "en" for English).
+        device (str): The device to run the model on (e.g., "cpu" or "cuda").
+
+    Returns:
+        List[str]: A list of backtranslated sentences.
+
+    Note:
+        nlpaug has a backtranslation module, but it only officially supports Helsinki-NLP,
+        but we do not have a Helsinki model for Portuguese -> English. So we use the T5 model
+        directly from HuggingFace.
+    """
+    tokenizer = AutoTokenizer.from_pretrained(
+        f"unicamp-dl/translation-{source_lang}-{target_lang}-t5"
+    )
+    model = AutoModelForSeq2SeqLM.from_pretrained(
+        f"unicamp-dl/translation-{source_lang}-{target_lang}-t5"
+    )
+    model.to(torch.device(device))
+    backtokenizer = AutoTokenizer.from_pretrained(
+        f"unicamp-dl/translation-{target_lang}-{source_lang}-t5"
+    )
+    backmodel = AutoModelForSeq2SeqLM.from_pretrained(
+        f"unicamp-dl/translation-{target_lang}-{source_lang}-t5"
+    )
+    backmodel.to(torch.device(device))
+    pten_pipeline = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
+    enpt_pipeline = pipeline(
+        "text2text-generation", model=backmodel, tokenizer=backtokenizer
+    )
+
+    print(f"Backtranslating {len(sentences)} sentences...")
+    translations: List[str] = []
+    for sentence in tqdm(sentences):
+        transl = pten_pipeline("translate Portuguese to English: " + sentence)[0][
+            "generated_text"
+        ]
+        backtransl = enpt_pipeline("translate English to Portuguese: " + transl)[0][
+            "generated_text"
+        ]
+        translations.append(backtransl)
+
+    return translations
+
+
+def translation_pipeline(
+    sentences: List[str], translate_mode: str, lang: str, device: str
+) -> List[str]:
+    """
+    Runs the translation pipeline to backtranslate a list of sentences.
+
+    Args:
+        sentences (List[str]): The list of sentences to be translated.
+        translate_mode (str): Use local model or API to translate.
+        lang (str): The target language code (e.g., "en" for English).
+        device (str): The device to run the model on (e.g., "cpu" or "cuda").
+
+    Returns:
+        List[str]: A list of translated sentences.
+    """
+    augmented_sentences: List[str] = []
+    print(f"Backtranslating sentences pt->{lang}->pt...")
+    if translate_mode == "local":
+        augmented_sentences = backtranslate_sentences_local(
+            sentences, "pt", lang, device
+        )
+    elif translate_mode == "google":
+        augmented_sentences = backtranslate_sentences_api(sentences, "pt", lang)
+    assert len(augmented_sentences)
+    return augmented_sentences
+
+
+def create_augmentation_sequence(
+    augmentations: List[str], action: str, translate_mode: str, lang: str, device: str
+) -> List[callable]:
+    augmentation_sequence = []
+    for aug in augmentations:
+        if aug == "random" or aug == "synonym":
+            augmenter = SentenceAugmenter(aug, action=action)
+            augmentation_sequence.append(lambda x: augmenter.augment_sentences(x))
+        elif aug == "translate":
+            augmentation_sequence.append(
+                lambda x: translation_pipeline(x, translate_mode, lang, device)
+            )
+    return augmentation_sequence
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Sentence augmentation: back-translate, random delete, random swap, synonym replacement."
+    )
+    parser.add_argument("corpus", type=str, help="Input corpus file")
+    parser.add_argument(
+        "--aug",
+        nargs="+",
+        type=str,
+        required=True,
+        choices=["random", "translate", "synonym"],
+        help="Augmentation type to perform",
+    )
+    parser.add_argument(
+        "--action",
+        type=str,
+        choices=["delete", "swap"],
+        default="delete",
+        help="Action to perform",
+    )
+    parser.add_argument(
+        "--maxs",
+        type=str,
+        default="10",
+        help="Maximum number of sentences to process. Can be a percentage of the total, e.g., 5%% (default: 10)",
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=451,
+        help="Random seed (default: 451)",
+    )
+    parser.add_argument(
+        "--lang",
+        type=str,
+        default="en",
+        help="Target language for translation (default: en)",
+    )
+    parser.add_argument(
+        "--translate_mode",
+        type=str,
+        choices=["google", "local", "openai"],
+        default="local",
+        help="Target language for translation (default: local)",
+    )
+    parser.add_argument(
+        "--device",
+        type=str,
+        default="cpu",
+        help="Process on CPU or CUDA (default: cpu)",
+    )
+    parser.add_argument(
+        "--output",
+        type=str,
+        default=None,
+        help="Output file to write augmented sentences in addition to the input corpus",
+    )
+    parser.add_argument("--append", action="store_true", help="Append to corpus file")
+    args = parser.parse_args()
+
+    random.seed(args.seed)
+
+    sentences = read_sentences_corpus(args.corpus, max_sentences=args.maxs)
+    print(f"Read {len(sentences)} sentences from {args.corpus}")
+
+    augmentation_sequence = create_augmentation_sequence(
+        args.aug, args.action, args.translate_mode, args.lang, args.device
+    )
+
+    augmented_sentences = sentences
+    for i, aug_fn in enumerate(augmentation_sequence):
+        print(f"Augmentation step {i + 1} of {len(augmentation_sequence)}:")
+        augmented_sentences = aug_fn(augmented_sentences)
+
+    augmented_sentences = post_process_sentences(augmented_sentences)
+    sentences = post_process_sentences(sentences)
+    print_sentences_comparison(sentences, augmented_sentences)
+
+    print("Removing equal sentences...")
+    augmented_sentences = remove_equal_sentences(sentences, augmented_sentences)
+
+    print("\nFinal results:")
+    print("-------------------")
+    for sentence in augmented_sentences:
+        print(sentence)
+    print(f"\nTotal: {len(augmented_sentences)} sentences")
+    print("-------------------\n")
+
+    if args.append:
+        print(f"Appending augmented sentences to {args.corpus}...")
+        append_sentences_to_file(args.corpus, augmented_sentences)
+
+    if args.output:
+        print(f"Appending augmented sentences to {args.output}...")
+        append_sentences_to_file(args.output, augmented_sentences)