Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TPS-free 2D bucket estimation and filtering #11738

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Jan 2, 2025

What does this PR do ?

Improves the UX of estimating 2D buckets and preparing the training configuration. Key changes:

  • 2D bucket estimation auto-finds the right max-token-per-second setting for each bucket separately instead of assuming a global constant max-TPS. The per-bucket max-TPS is determined as 4*stddev(bucket_tps).
  • Training with 2D bucketing works without setting the max_tps filter.
  • The data that doesn't fit the 2D buckets (either 1st or 2nd dim) is filtered out automatically during sampling.
    • strict mode (default): first allocate the duration bucket, and then try to allocate to max_tokens sub-bucket. If an example doesn't fit, discard it.
    • lenient mode (model.train_ds.bucketing_2d_strict_mode=False): find any bucket that will fit a given example. That means token-per-second outliers are pushed to buckets with higher durations increasing the padding but reducing the amount of discarded data. Use at your own risk - for the setups I tested it so far with, it may cause training instability (likely due to inclusion of lower-quality data).
  • Fixes some issues with estimate_duration_bins_2d.py script.
  • For data with complex TPS distributions: max-TPS can be defined as a list with the same length as the list of buckets for fine-grained control. A best-guess setting is suggested by estimate_duration_bins_2d.py.
  • Renamed tarred_random_access to skip_missing_manifest_entries and adjusted the logic to reduce CPU memory usage and accelerate loading time.

Collection: ASR, TTS, SpeechLLM

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added the common label Jan 2, 2025
Signed-off-by: Piotr Żelasko <[email protected]>
@@ -15,9 +15,11 @@
import bisect
import logging
import math
from bisect import bisect_left, bisect_right

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'bisect_right' is not used.
Signed-off-by: Piotr Żelasko <[email protected]>
from lhotse.testing.dummies import DummyManifest, dummy_cut
from nemo.collections.common.data.lhotse.sampling import FixedBucketBatchSizeConstraint2D
from lhotse.testing.dummies import dummy_cut
from lhotse.testing.random import deterministic_rng

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'deterministic_rng' is not used.
filter_2d = BucketingFilter(constraint)
cut = make_cut(duration=2.0, num_tokens=20)
assert filter_2d(cut) == False
assert constraint.select_bucket(constraint.max_seq_len_buckets, cut) == None

Check notice

Code scanning / CodeQL

Testing equality to None Note test

Testing for None should use the 'is' operator.
@pzelasko pzelasko requested a review from nithinraok January 9, 2025 18:04
@pzelasko pzelasko marked this pull request as ready for review January 9, 2025 18:04
Signed-off-by: Piotr Żelasko <[email protected]>
Copy link
Collaborator

@nithinraok nithinraok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -698,28 +709,32 @@ def make_structured_with_schema_warnings(config: DictConfig | dict) -> DictConfi
if not isinstance(config, DictConfig):
config = DictConfig(config)

if config.get("tarred_random_access", False):
logging.warning(
"Option 'tarred_random_access' is deprecated and replaced with 'skip_missing_manifest_entries'.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be also add version from which this would be removed

nithinraok
nithinraok previously approved these changes Jan 10, 2025
pzelasko and others added 2 commits January 12, 2025 17:22
Signed-off-by: Piotr Żelasko <[email protected]>
nithinraok
nithinraok previously approved these changes Jan 14, 2025
Signed-off-by: Piotr Żelasko <[email protected]>
@pzelasko pzelasko requested a review from nithinraok January 17, 2025 01:24
@github-actions github-actions bot added core Changes to NeMo Core NLP CI Multi Modal labels Jan 17, 2025
@pzelasko pzelasko force-pushed the 2d-bucketing-and-tps-improvements-2 branch from 49a1536 to a238da2 Compare January 17, 2025 01:27
@github-actions github-actions bot removed core Changes to NeMo Core NLP CI Multi Modal labels Jan 17, 2025
@pzelasko pzelasko enabled auto-merge (squash) January 17, 2025 01:28
Signed-off-by: Piotr Żelasko <[email protected]>
Copy link
Contributor

beep boop 🤖: 🚨 The following files must be fixed before merge!


Your code was analyzed with PyLint. The following annotations have been identified:

************* Module nemo.collections.common.data.lhotse.sampling
nemo/collections/common/data/lhotse/sampling.py:15:0: W0611: Unused import bisect (unused-import)
nemo/collections/common/data/lhotse/sampling.py:16:0: W0611: Unused import logging (unused-import)
nemo/collections/common/data/lhotse/sampling.py:18:0: W0611: Unused bisect_right imported from bisect (unused-import)

-----------------------------------
Your code has been rated at 9.87/10

Mitigation guide:

  • Add sensible and useful docstrings to functions and methods
  • For trivial methods like getter/setters, consider adding # pylint: disable=C0116 inside the function itself
  • To disable multiple functions/methods at once, put a # pylint: disable=C0116 before the first and a # pylint: enable=C0116 after the last.

By applying these rules, we reduce the occurance of this message in future.

Thank you for improving NeMo's documentation!

Copy link
Contributor

beep boop 🤖: 🙏 The following files have warnings. In case you are familiar with these, please try helping us to improve the code base.


Your code was analyzed with PyLint. The following annotations have been identified:

************* Module nemo.collections.asr.data.audio_to_text_lhotse_prompted
nemo/collections/asr/data/audio_to_text_lhotse_prompted.py:62:0: C0301: Line too long (133/119) (line-too-long)
nemo/collections/asr/data/audio_to_text_lhotse_prompted.py:29:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/asr/data/audio_to_text_lhotse_prompted.py:114:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/asr/data/audio_to_text_lhotse_prompted.py:15:0: W0611: Unused Callable imported from typing (unused-import)
nemo/collections/asr/data/audio_to_text_lhotse_prompted.py:24:0: W0611: Unused CanaryPromptFormatter imported from nemo.collections.common.prompts (unused-import)
************* Module nemo.collections.asr.parts.mixins.mixins
nemo/collections/asr/parts/mixins/mixins.py:614:0: C0301: Line too long (127/119) (line-too-long)
nemo/collections/asr/parts/mixins/mixins.py:617:0: C0301: Line too long (200/119) (line-too-long)
nemo/collections/asr/parts/mixins/mixins.py:618:0: C0301: Line too long (129/119) (line-too-long)
nemo/collections/asr/parts/mixins/mixins.py:623:0: C0301: Line too long (124/119) (line-too-long)
nemo/collections/asr/parts/mixins/mixins.py:624:0: C0301: Line too long (120/119) (line-too-long)
nemo/collections/asr/parts/mixins/mixins.py:629:0: C0301: Line too long (127/119) (line-too-long)
nemo/collections/asr/parts/mixins/mixins.py:639:0: C0301: Line too long (120/119) (line-too-long)
nemo/collections/asr/parts/mixins/mixins.py:676:0: C0301: Line too long (134/119) (line-too-long)
nemo/collections/asr/parts/mixins/mixins.py:748:0: C0301: Line too long (122/119) (line-too-long)
nemo/collections/asr/parts/mixins/mixins.py:870:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/asr/parts/mixins/mixins.py:886:0: C0115: Missing class docstring (missing-class-docstring)
************* Module nemo.collections.common.data.lhotse.cutset
nemo/collections/common/data/lhotse/cutset.py:626:0: C0301: Line too long (139/119) (line-too-long)
nemo/collections/common/data/lhotse/cutset.py:64:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/common/data/lhotse/cutset.py:203:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/cutset.py:224:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/cutset.py:239:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/cutset.py:258:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/cutset.py:273:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/cutset.py:289:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/cutset.py:296:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/cutset.py:342:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/cutset.py:439:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/cutset.py:488:0: C0116: Missing function or method docstring (missing-function-docstring)
************* Module nemo.collections.common.data.lhotse.dataloader
nemo/collections/common/data/lhotse/dataloader.py:144:0: C0301: Line too long (123/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:160:0: C0301: Line too long (125/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:168:0: C0301: Line too long (148/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:340:0: C0301: Line too long (120/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:350:0: C0301: Line too long (206/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:387:0: C0301: Line too long (178/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:469:0: C0301: Line too long (121/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:470:0: C0301: Line too long (126/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:471:0: C0301: Line too long (123/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:733:0: C0301: Line too long (122/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:808:0: C0301: Line too long (213/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:815:0: C0301: Line too long (143/119) (line-too-long)
nemo/collections/common/data/lhotse/dataloader.py:203:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/dataloader.py:444:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/dataloader.py:740:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/dataloader.py:752:0: C0116: Missing function or method docstring (missing-function-docstring)
************* Module nemo.collections.common.data.lhotse.nemo_adapters
nemo/collections/common/data/lhotse/nemo_adapters.py:48:0: C0301: Line too long (124/119) (line-too-long)
nemo/collections/common/data/lhotse/nemo_adapters.py:281:0: C0301: Line too long (125/119) (line-too-long)
nemo/collections/common/data/lhotse/nemo_adapters.py:282:0: C0301: Line too long (133/119) (line-too-long)
nemo/collections/common/data/lhotse/nemo_adapters.py:351:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/nemo_adapters.py:480:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/common/data/lhotse/nemo_adapters.py:484:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/nemo_adapters.py:493:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/nemo_adapters.py:498:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/nemo_adapters.py:502:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/nemo_adapters.py:506:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/common/data/lhotse/nemo_adapters.py:528:0: C0115: Missing class docstring (missing-class-docstring)
nemo/collections/common/data/lhotse/nemo_adapters.py:550:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/nemo_adapters.py:572:0: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/data/lhotse/nemo_adapters.py:23:0: W0611: Unused import lhotse.serialization (unused-import)
************* Module nemo.collections.common.tokenizers.canary_tokenizer
nemo/collections/common/tokenizers/canary_tokenizer.py:133:0: C0301: Line too long (120/119) (line-too-long)
nemo/collections/common/tokenizers/canary_tokenizer.py:56:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/tokenizers/canary_tokenizer.py:60:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/tokenizers/canary_tokenizer.py:64:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/tokenizers/canary_tokenizer.py:68:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/tokenizers/canary_tokenizer.py:158:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/tokenizers/canary_tokenizer.py:164:4: C0116: Missing function or method docstring (missing-function-docstring)
nemo/collections/common/tokenizers/canary_tokenizer.py:209:4: C0116: Missing function or method docstring (missing-function-docstring)
************* Module scripts.speech_recognition.estimate_duration_bins_2d
scripts/speech_recognition/estimate_duration_bins_2d.py:61:0: C0301: Line too long (162/119) (line-too-long)
scripts/speech_recognition/estimate_duration_bins_2d.py:68:0: C0301: Line too long (227/119) (line-too-long)
scripts/speech_recognition/estimate_duration_bins_2d.py:128:0: C0301: Line too long (137/119) (line-too-long)
scripts/speech_recognition/estimate_duration_bins_2d.py:137:0: C0301: Line too long (141/119) (line-too-long)
scripts/speech_recognition/estimate_duration_bins_2d.py:215:0: C0301: Line too long (234/119) (line-too-long)
scripts/speech_recognition/estimate_duration_bins_2d.py:263:0: C0301: Line too long (161/119) (line-too-long)
scripts/speech_recognition/estimate_duration_bins_2d.py:42:0: C0116: Missing function or method docstring (missing-function-docstring)
scripts/speech_recognition/estimate_duration_bins_2d.py:142:0: C0116: Missing function or method docstring (missing-function-docstring)
scripts/speech_recognition/estimate_duration_bins_2d.py:250:0: C0116: Missing function or method docstring (missing-function-docstring)
scripts/speech_recognition/estimate_duration_bins_2d.py:257:0: C0116: Missing function or method docstring (missing-function-docstring)
scripts/speech_recognition/estimate_duration_bins_2d.py:272:0: C0116: Missing function or method docstring (missing-function-docstring)
scripts/speech_recognition/estimate_duration_bins_2d.py:283:0: C0115: Missing class docstring (missing-class-docstring)
scripts/speech_recognition/estimate_duration_bins_2d.py:297:4: C0116: Missing function or method docstring (missing-function-docstring)
scripts/speech_recognition/estimate_duration_bins_2d.py:302:0: C0116: Missing function or method docstring (missing-function-docstring)
************* Module scripts.speech_recognition.oomptimizer
scripts/speech_recognition/oomptimizer.py:253:0: C0301: Line too long (125/119) (line-too-long)
scripts/speech_recognition/oomptimizer.py:264:0: C0301: Line too long (121/119) (line-too-long)
scripts/speech_recognition/oomptimizer.py:273:0: C0301: Line too long (178/119) (line-too-long)
scripts/speech_recognition/oomptimizer.py:281:0: C0301: Line too long (135/119) (line-too-long)
scripts/speech_recognition/oomptimizer.py:316:0: C0301: Line too long (126/119) (line-too-long)
scripts/speech_recognition/oomptimizer.py:468:0: C0301: Line too long (183/119) (line-too-long)
scripts/speech_recognition/oomptimizer.py:490:0: C0301: Line too long (185/119) (line-too-long)
scripts/speech_recognition/oomptimizer.py:507:0: C0301: Line too long (139/119) (line-too-long)
scripts/speech_recognition/oomptimizer.py:226:4: C0116: Missing function or method docstring (missing-function-docstring)
scripts/speech_recognition/oomptimizer.py:20:0: W0611: Unused Iterable imported from typing (unused-import)
scripts/speech_recognition/oomptimizer.py:20:0: W0611: Unused Literal imported from typing (unused-import)

-----------------------------------
Your code has been rated at 9.47/10

Mitigation guide:

  • Add sensible and useful docstrings to functions and methods
  • For trivial methods like getter/setters, consider adding # pylint: disable=C0116 inside the function itself
  • To disable multiple functions/methods at once, put a # pylint: disable=C0116 before the first and a # pylint: enable=C0116 after the last.

By applying these rules, we reduce the occurance of this message in future.

Thank you for improving NeMo's documentation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants