You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, FRNN reports the following metrics related to computational speed/efficiency during training: Per step, per epoch:
Examples/sec
sec/batch
% of batch time spent in calculation vs. synchronization
overall batch size = batch size per GPU x N_GPU
As we become more cognizant about our performance expectations for the code on various architectures, I think it would be valuable to make these metrics more informative.
At end of each epoch: summarize Min/Max, Mean, Std-dev of Examples/sec, sec/batch, % calc, % sync of all steps within that epoch
At end of all epochs: same statistics across epochs?
Add greater granularity of timing information within an epoch?
Currently, FRNN reports the following metrics related to computational speed/efficiency during training:
Per step, per epoch:
Examples/sec
sec/batch
As we become more cognizant about our performance expectations for the code on various architectures, I think it would be valuable to make these metrics more informative.
Examples/sec
,sec/batch
,% calc
,% sync
of all steps within that epochFor the final performance metrics (at the end of all epochs), we should probably exclude the first epoch or so, due to TensorFlow invocation of cuDNN autotuner on the first invocation of
tf.Session.run()
when the undocumented environment variableTF_CUDNN_USE_AUTOTUNE=1
. See https://github.com/tensorflow/tensorflow/blob/fddd829a0795a98b1bdac63c5acaed2c3d8122ff/tensorflow/core/util/use_cudnn.cc#L36 and https://stackoverflow.com/questions/45063489/first-tf-session-run-performs-dramatically-different-from-later-runs-why for an explanationAlthough I wonder if the initial run that is thrown away in order to force compilation already accomplishes this?
plasma-python/plasma/models/mpi_runner.py
Lines 549 to 564 in c82ba61
since it calls Keras
train_on_batch()
.Note, I have tested the effects of the
TF_CUDNN_USE_AUTOTUNE
variable on the https://github.com/tensorflow/benchmarks , specificallyon Traverse V100s and TigerGPU P100s, and disabling the autotuner leads to a loss of about 10% performance:
P100:
Click to expand!
V100:
Click to expand!
The text was updated successfully, but these errors were encountered: