-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for parallel data curation #193
base: main
Are you sure you want to change the base?
Commits on Aug 8, 2024
-
add data interface to read simple bitext
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c7a6423 - Browse repository at this point
Copy the full SHA c7a6423View commit details -
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 4b3dc97 - Browse repository at this point
Copy the full SHA 4b3dc97View commit details -
add test for ParallelScoreFilter, small style change for ParallelData…
…set test, fix a few data and import bugs Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 114716e - Browse repository at this point
Copy the full SHA 114716eView commit details -
allow ParallelScoreFilter to take different filters for source and ta…
…rget Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for cbab143 - Browse repository at this point
Copy the full SHA cbab143View commit details -
add JointScoreFilter and LengthRatioFilter
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 82f5486 - Browse repository at this point
Copy the full SHA 82f5486View commit details -
[WIP] add heuristic filter w/o test
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f9a0535 - Browse repository at this point
Copy the full SHA f9a0535View commit details -
Configuration menu - View commit details
-
Copy full SHA for 8f25988 - Browse repository at this point
Copy the full SHA 8f25988View commit details -
add test for histogram filter, fix a few bugs
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 612249c - Browse repository at this point
Copy the full SHA 612249cView commit details -
length ratio, joint score filter testing
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 2fe4973 - Browse repository at this point
Copy the full SHA 2fe4973View commit details -
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b61d7f1 - Browse repository at this point
Copy the full SHA b61d7f1View commit details -
add a fake comet qe filter as an initial step
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f63a1f9 - Browse repository at this point
Copy the full SHA f63a1f9View commit details -
[WIP] adding bitext cleaning tutorial
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 76bced7 - Browse repository at this point
Copy the full SHA 76bced7View commit details -
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 1a2bb1e - Browse repository at this point
Copy the full SHA 1a2bb1eView commit details -
fix slow histogram filter, fix faulty bitext loading
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 74698d5 - Browse repository at this point
Copy the full SHA 74698d5View commit details -
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for bf2e6ac - Browse repository at this point
Copy the full SHA bf2e6acView commit details -
[WIP] documentation of bitext tutorial
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 62d1242 - Browse repository at this point
Copy the full SHA 62d1242View commit details -
add tested version of comet-qe filter
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c413ea2 - Browse repository at this point
Copy the full SHA c413ea2View commit details -
fix ParallelDataset bug where single file name is not accepted, and d…
…ataset is sometimes turned into its parent class by mistake, add write to simple bitext functionality, update bitext tutorial Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 5a90038 - Browse repository at this point
Copy the full SHA 5a90038View commit details -
add docstring to explain simple bitext format, fix a bug where file e…
…xtensions are removed twice before writing Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for f8046dd - Browse repository at this point
Copy the full SHA f8046ddView commit details -
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6c7aea4 - Browse repository at this point
Copy the full SHA 6c7aea4View commit details -
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a457995 - Browse repository at this point
Copy the full SHA a457995View commit details -
refactor COMET QE filter to decouple model from filter, make sure Joi…
…ntScoreFilter can take more than one fields for source and target Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c5a6f1c - Browse repository at this point
Copy the full SHA c5a6f1cView commit details -
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 61713e4 - Browse repository at this point
Copy the full SHA 61713e4View commit details -
wrap_qe_input should be a static method
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a4d2bb3 - Browse repository at this point
Copy the full SHA a4d2bb3View commit details -
use conditional import for comet, formatting changes
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 0674400 - Browse repository at this point
Copy the full SHA 0674400View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6936f9a - Browse repository at this point
Copy the full SHA 6936f9aView commit details -
[WIP] attempt to resolve device conflict but is failing
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for da96d29 - Browse repository at this point
Copy the full SHA da96d29View commit details -
[WIP] playing with cometoid arguments
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 14b7d70 - Browse repository at this point
Copy the full SHA 14b7d70View commit details -
[WIP] -d 0 doesn't look necessary
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for b02b56d - Browse repository at this point
Copy the full SHA b02b56dView commit details -
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6c1e719 - Browse repository at this point
Copy the full SHA 6c1e719View commit details -
use proper safe import, make sure test doesn't crash sans comet/pymarian
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 70a7fe8 - Browse repository at this point
Copy the full SHA 70a7fe8View commit details -
falling back to comet for tutorial since that's easier to set up, upp…
…date README Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for c66d7f9 - Browse repository at this point
Copy the full SHA c66d7f9View commit details -
give credit to original fairseq implementation of histogram filtering…
…, run black formatter Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 861bd4d - Browse repository at this point
Copy the full SHA 861bd4dView commit details -
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 52ba08e - Browse repository at this point
Copy the full SHA 52ba08eView commit details
Commits on Aug 11, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 62c254b - Browse repository at this point
Copy the full SHA 62c254bView commit details
Commits on Aug 13, 2024
-
fix another occurrence of the same bug
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 91ea9fa - Browse repository at this point
Copy the full SHA 91ea9faView commit details -
introduce shard limit to a single PyMarian API call to avoid memory l…
…eakage Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 12783ec - Browse repository at this point
Copy the full SHA 12783ecView commit details
Commits on Aug 16, 2024
-
repartition after reading simple bitext data
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a65588a - Browse repository at this point
Copy the full SHA a65588aView commit details -
-d 0 is actually needed for pymarian
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 3f1d09b - Browse repository at this point
Copy the full SHA 3f1d09bView commit details
Commits on Sep 5, 2024
-
remove duplicate LengthRatioFilter definition
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 102429a - Browse repository at this point
Copy the full SHA 102429aView commit details
Commits on Sep 20, 2024
-
refactor repeated code segment in file writing, change classifier to …
…accomodate custom field names, pause doc repartition since it causes problems Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 8a367dd - Browse repository at this point
Copy the full SHA 8a367ddView commit details -
[WIP] addressed comments in NVIDIA#193 apart from resolving .iloc pat…
…tern, test currently failing Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 396d7ba - Browse repository at this point
Copy the full SHA 396d7baView commit details
Commits on Oct 1, 2024
-
refactor to resolve .loc pattern, test passing
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for eb4f4df - Browse repository at this point
Copy the full SHA eb4f4dfView commit details -
Configuration menu - View commit details
-
Copy full SHA for 3addf44 - Browse repository at this point
Copy the full SHA 3addf44View commit details -
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for a14a78a - Browse repository at this point
Copy the full SHA a14a78aView commit details -
fix a small bug in parallel dataset, explain why repartition is disab…
…led, fix tutorial Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 6b8dfa0 - Browse repository at this point
Copy the full SHA 6b8dfa0View commit details -
add api guide, small change on bitext/parallel score filter docstring
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for bb4f148 - Browse repository at this point
Copy the full SHA bb4f148View commit details -
fix read_simple_bitext test issues
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for d309744 - Browse repository at this point
Copy the full SHA d309744View commit details
Commits on Oct 2, 2024
-
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 21676bd - Browse repository at this point
Copy the full SHA 21676bdView commit details -
reinstate dependencies lost during merging
Signed-off-by: Shuoyang Ding <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 7797925 - Browse repository at this point
Copy the full SHA 7797925View commit details