Add `distribution_strategy` and `all_reduce_alg` flags to TensorFlow BERT pretraining #745

rapsealk · 2024-06-05T00:51:41Z

Hello mlcommons team!

I've noticed that some utilities, such as additional flags, are missing in BERT pretraining, unlike ResNet50 image classification. This pull request is expected to be helpful to run BERT training on distributed environment.

References are below:

training/image_classification/tensorflow2/tf2_common/utils/flags/_base.py

Lines 140 to 150 in f0a7d0c

    
           if distribution_strategy: 
        
             flags.DEFINE_string( 
        
                 name="distribution_strategy", short_name="ds", default="mirrored", 
        
                 help=help_wrap("The Distribution Strategy to use for training. " 
        
                                "Accepted values are 'off', 'one_device', " 
        
                                "'mirrored', 'parameter_server', 'collective', " 
        
                                "case insensitive. 'off' means not to use " 
        
                                "Distribution Strategy; 'default' means to choose " 
        
                                "from `MirroredStrategy` or `OneDeviceStrategy` " 
        
                                "according to the number of GPUs.") 
        
             )

training/image_classification/tensorflow2/tf2_common/utils/flags/_performance.py

Lines 224 to 234 in f0a7d0c

    
           if all_reduce_alg: 
        
             flags.DEFINE_string( 
        
                 name="all_reduce_alg", short_name="ara", default=None, 
        
                 help=help_wrap("Defines the algorithm to use for performing all-reduce." 
        
                                "When specified with MirroredStrategy for single " 
        
                                "worker, this controls " 
        
                                "tf.contrib.distribute.AllReduceCrossTowerOps.  When " 
        
                                "specified with MultiWorkerMirroredStrategy, this " 
        
                                "controls " 
        
                                "tf.distribute.experimental.CollectiveCommunication; " 
        
                                "valid options are `ring` and `nccl`."))

Refs

[BERT] Multi-gpu, eval and step counting #384

…BERT pretraining

github-actions · 2024-06-05T00:51:52Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

Add distribution_strategy and all_reduce_alg flags to TensorFlow …

16ca431

…BERT pretraining

rapsealk requested a review from a team as a code owner June 5, 2024 00:51

rapsealk marked this pull request as draft June 5, 2024 03:35

fix: Use 'FLAGS' instead of 'flags'

77934d0

rapsealk marked this pull request as ready for review June 5, 2024 04:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `distribution_strategy` and `all_reduce_alg` flags to TensorFlow BERT pretraining #745

Add `distribution_strategy` and `all_reduce_alg` flags to TensorFlow BERT pretraining #745

rapsealk commented Jun 5, 2024 •

edited

Loading

github-actions bot commented Jun 5, 2024 •

edited

Loading

	if distribution_strategy:
	flags.DEFINE_string(
	name="distribution_strategy", short_name="ds", default="mirrored",
	help=help_wrap("The Distribution Strategy to use for training. "
	"Accepted values are 'off', 'one_device', "
	"'mirrored', 'parameter_server', 'collective', "
	"case insensitive. 'off' means not to use "
	"Distribution Strategy; 'default' means to choose "
	"from `MirroredStrategy` or `OneDeviceStrategy` "
	"according to the number of GPUs.")
	)

	if all_reduce_alg:
	flags.DEFINE_string(
	name="all_reduce_alg", short_name="ara", default=None,
	help=help_wrap("Defines the algorithm to use for performing all-reduce."
	"When specified with MirroredStrategy for single "
	"worker, this controls "
	"tf.contrib.distribute.AllReduceCrossTowerOps. When "
	"specified with MultiWorkerMirroredStrategy, this "
	"controls "
	"tf.distribute.experimental.CollectiveCommunication; "
	"valid options are `ring` and `nccl`."))

Add distribution_strategy and all_reduce_alg flags to TensorFlow BERT pretraining #745

Are you sure you want to change the base?

Add distribution_strategy and all_reduce_alg flags to TensorFlow BERT pretraining #745

Conversation

rapsealk commented Jun 5, 2024 • edited Loading

github-actions bot commented Jun 5, 2024 • edited Loading

Add `distribution_strategy` and `all_reduce_alg` flags to TensorFlow BERT pretraining #745

Add `distribution_strategy` and `all_reduce_alg` flags to TensorFlow BERT pretraining #745

rapsealk commented Jun 5, 2024 •

edited

Loading

github-actions bot commented Jun 5, 2024 •

edited

Loading