Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for parallel data curation #193

Open
wants to merge 50 commits into
base: main
Choose a base branch
from

Commits on Aug 8, 2024

  1. add data interface to read simple bitext

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    c7a6423 View commit details
    Browse the repository at this point in the history
  2. adding ParallelScoreFilter

    Signed-off-by: Shuoyang Ding <[email protected]>
    nverma1 authored and shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    4b3dc97 View commit details
    Browse the repository at this point in the history
  3. add test for ParallelScoreFilter, small style change for ParallelData…

    …set test, fix a few data and import bugs
    
    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    114716e View commit details
    Browse the repository at this point in the history
  4. allow ParallelScoreFilter to take different filters for source and ta…

    …rget
    
    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    cbab143 View commit details
    Browse the repository at this point in the history
  5. add JointScoreFilter and LengthRatioFilter

    Signed-off-by: Shuoyang Ding <[email protected]>
    nverma1 authored and shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    82f5486 View commit details
    Browse the repository at this point in the history
  6. [WIP] add heuristic filter w/o test

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    f9a0535 View commit details
    Browse the repository at this point in the history
  7. merge with main

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    8f25988 View commit details
    Browse the repository at this point in the history
  8. add test for histogram filter, fix a few bugs

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    612249c View commit details
    Browse the repository at this point in the history
  9. length ratio, joint score filter testing

    Signed-off-by: Shuoyang Ding <[email protected]>
    nverma1 authored and shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    2fe4973 View commit details
    Browse the repository at this point in the history
  10. fix typing in joint test

    Signed-off-by: Shuoyang Ding <[email protected]>
    nverma1 authored and shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    b61d7f1 View commit details
    Browse the repository at this point in the history
  11. add a fake comet qe filter as an initial step

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    f63a1f9 View commit details
    Browse the repository at this point in the history
  12. [WIP] adding bitext cleaning tutorial

    Signed-off-by: Shuoyang Ding <[email protected]>
    nverma1 authored and shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    76bced7 View commit details
    Browse the repository at this point in the history
  13. [WIP] fixing example

    Signed-off-by: Shuoyang Ding <[email protected]>
    nverma1 authored and shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    1a2bb1e View commit details
    Browse the repository at this point in the history
  14. fix slow histogram filter, fix faulty bitext loading

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    74698d5 View commit details
    Browse the repository at this point in the history
  15. tutorial running

    Signed-off-by: Shuoyang Ding <[email protected]>
    nverma1 authored and shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    bf2e6ac View commit details
    Browse the repository at this point in the history
  16. [WIP] documentation of bitext tutorial

    Signed-off-by: Shuoyang Ding <[email protected]>
    nverma1 authored and shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    62d1242 View commit details
    Browse the repository at this point in the history
  17. add tested version of comet-qe filter

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    c413ea2 View commit details
    Browse the repository at this point in the history
  18. fix ParallelDataset bug where single file name is not accepted, and d…

    …ataset is sometimes turned into its parent class by mistake, add write to simple bitext functionality, update bitext tutorial
    
    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    5a90038 View commit details
    Browse the repository at this point in the history
  19. add docstring to explain simple bitext format, fix a bug where file e…

    …xtensions are removed twice before writing
    
    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    f8046dd View commit details
    Browse the repository at this point in the history
  20. remove print line for debug

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    6c7aea4 View commit details
    Browse the repository at this point in the history
  21. add comet filter to tutorial

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    a457995 View commit details
    Browse the repository at this point in the history
  22. refactor COMET QE filter to decouple model from filter, make sure Joi…

    …ntScoreFilter can take more than one fields for source and target
    
    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    c5a6f1c View commit details
    Browse the repository at this point in the history
  23. use refactored qe filter

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    61713e4 View commit details
    Browse the repository at this point in the history
  24. wrap_qe_input should be a static method

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    a4d2bb3 View commit details
    Browse the repository at this point in the history
  25. use conditional import for comet, formatting changes

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    0674400 View commit details
    Browse the repository at this point in the history
  26. [WIP] add cometoid

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    6936f9a View commit details
    Browse the repository at this point in the history
  27. [WIP] attempt to resolve device conflict but is failing

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    da96d29 View commit details
    Browse the repository at this point in the history
  28. [WIP] playing with cometoid arguments

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    14b7d70 View commit details
    Browse the repository at this point in the history
  29. [WIP] -d 0 doesn't look necessary

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    b02b56d View commit details
    Browse the repository at this point in the history
  30. tested arguments for Cometoid

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    6c1e719 View commit details
    Browse the repository at this point in the history
  31. Configuration menu
    Copy the full SHA
    70a7fe8 View commit details
    Browse the repository at this point in the history
  32. falling back to comet for tutorial since that's easier to set up, upp…

    …date README
    
    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    c66d7f9 View commit details
    Browse the repository at this point in the history
  33. give credit to original fairseq implementation of histogram filtering…

    …, run black formatter
    
    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    861bd4d View commit details
    Browse the repository at this point in the history
  34. fix pre-commit complaint

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 8, 2024
    Configuration menu
    Copy the full SHA
    52ba08e View commit details
    Browse the repository at this point in the history

Commits on Aug 11, 2024

  1. fix small bug

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 11, 2024
    Configuration menu
    Copy the full SHA
    62c254b View commit details
    Browse the repository at this point in the history

Commits on Aug 13, 2024

  1. fix another occurrence of the same bug

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 13, 2024
    Configuration menu
    Copy the full SHA
    91ea9fa View commit details
    Browse the repository at this point in the history
  2. introduce shard limit to a single PyMarian API call to avoid memory l…

    …eakage
    
    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 13, 2024
    Configuration menu
    Copy the full SHA
    12783ec View commit details
    Browse the repository at this point in the history

Commits on Aug 16, 2024

  1. repartition after reading simple bitext data

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 16, 2024
    Configuration menu
    Copy the full SHA
    a65588a View commit details
    Browse the repository at this point in the history
  2. -d 0 is actually needed for pymarian

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Aug 16, 2024
    Configuration menu
    Copy the full SHA
    3f1d09b View commit details
    Browse the repository at this point in the history

Commits on Sep 5, 2024

  1. remove duplicate LengthRatioFilter definition

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Sep 5, 2024
    Configuration menu
    Copy the full SHA
    102429a View commit details
    Browse the repository at this point in the history

Commits on Sep 20, 2024

  1. refactor repeated code segment in file writing, change classifier to …

    …accomodate custom field names, pause doc repartition since it causes problems
    
    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Sep 20, 2024
    Configuration menu
    Copy the full SHA
    8a367dd View commit details
    Browse the repository at this point in the history
  2. [WIP] addressed comments in NVIDIA#193 apart from resolving .iloc pat…

    …tern, test currently failing
    
    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Sep 20, 2024
    Configuration menu
    Copy the full SHA
    396d7ba View commit details
    Browse the repository at this point in the history

Commits on Oct 1, 2024

  1. refactor to resolve .loc pattern, test passing

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    eb4f4df View commit details
    Browse the repository at this point in the history
  2. add missing file

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    3addf44 View commit details
    Browse the repository at this point in the history
  3. revert changes in setup.py

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    a14a78a View commit details
    Browse the repository at this point in the history
  4. fix a small bug in parallel dataset, explain why repartition is disab…

    …led, fix tutorial
    
    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    6b8dfa0 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    bb4f148 View commit details
    Browse the repository at this point in the history
  6. fix read_simple_bitext test issues

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Oct 1, 2024
    Configuration menu
    Copy the full SHA
    d309744 View commit details
    Browse the repository at this point in the history

Commits on Oct 2, 2024

  1. Merge branch 'main' into main

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Oct 2, 2024
    Configuration menu
    Copy the full SHA
    21676bd View commit details
    Browse the repository at this point in the history
  2. reinstate dependencies lost during merging

    Signed-off-by: Shuoyang Ding <[email protected]>
    shuoyangd committed Oct 2, 2024
    Configuration menu
    Copy the full SHA
    7797925 View commit details
    Browse the repository at this point in the history