Skip to content

Latest commit

 

History

History
85 lines (67 loc) · 2.55 KB

File metadata and controls

85 lines (67 loc) · 2.55 KB

Hands on for Data Parallel Deep Learning on ThetaGPU

  1. Modify tensorflow2_mnist_orig.py and instrument the code with Horoovd. [This can be done on the login node!]

  2. Request an interactive session on ThetaGPU:

# Login to theta
ssh -CY [email protected]
# Login to ThetaGPU login node
ssh -CY thetagpusn1 
# Requesting 1 node  
qsub -n 1 -q training -A ALCFAITP -I -t 15 --attrs=pubnet

You can also login in

  1. Setup the Python environment to include TensorFlow, Keras, PyTorch, and Horovod:

    . /etc/profile.d/z00_lmod.sh
    module load conda
    conda activate

    Notice that the first line is needed if you are setting up the environment in a submission script. It is not needed if you are running in interactive mode.

  2. Run examples on a single node

    • TensorFlow MNIST - 8 GPUs

      mpirun -np 1 python tensorflow2_mnist.py --device gpu
      mpirun -np 2 python tensorflow2_mnist.py --device gpu
      mpirun -np 4 python tensorflow2_mnist.py --device gpu
      mpirun -np 8 python tensorflow2_mnist.py --device gpu
    • PyTorch MNIST - 8 GPUs

      mpirun -np 1 python pytorch_mnist.py --device gpu
      mpirun -np 2 python pytorch_mnist.py --device gpu
      mpirun -np 4 python pytorch_mnist.py --device gpu
      mpirun -np 8 python pytorch_mnist.py --device gpu

    We have prepared some (non-interactive) submission scripts in ./submissions/qsub_*

    Time for 16 epochs

    GPUs Total time (s)
    1 27.59
    2 23.31
    4 12.80
    8 14.41

Time per epoch

GPUs Total time (s)
1 1.31
2 1.01
4 0.37
8 0.21
  1. Profiling

    • MPI profiling
    LD_PRELOAD=/soft/perftools/hpctw/lib/libmpitrace.so
    • Horovod timeline trace
    HOROVOD_TIMELINE=timeline.json 
    LD_PRELOAD=/soft/perftools/hpctw/lib/libmpitrace.so HOROVOD_TIMELINE=timeline.json mpirun -np 8 python pytorch_mnist.py 

    One can then open timeline.json with Chrome acc