Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEBUG {2023.06}[foss/2023a] TensorFlow v2.15.1 w/ CUDA 12.1.1 #808

Open
wants to merge 3 commits into
base: 2023.06-software.eessi.io
Choose a base branch
from

Conversation

trz42
Copy link
Collaborator

@trz42 trz42 commented Nov 12, 2024

PR to debug issues building TensorFlow v2.15.1 with CUDA v12.1.1

  1. Uses an updated tensorflow.py easyblock that solves an ImportError issue with libnccl.so.2. See tweak libpaths in TensorFlow easyblock by adding directory containing libnccl.so.2 easybuilders/easybuild-easyblocks#3497

Notes:

  • we should build Bazel, ml_dtypes and tensorboard first and install them in the directory for CPU-only software (double-check if and why there are not there yet)
    • Bazel/6.3.1 is installed but not Bazel/6.1.0 which is a dependency for this PR
    • ml_dtypes is not installed ... not sure if it should be (see comment/question for tensorboard below) ... OR it's a new dependency for TensorFlow (check easyconfig for CPU-only version)
    • tensorboard/2.13.0 is available as an extension of the CPU-only installation of TensorFlow/2.13.0-foss-2023a ... we might want to install the extension under the GPU directory?
  • we should check why cuDNN is installed again (in directory for CPU-only software) ... maybe related to switching to EESSI-extend/2023.06-easybuild and the installation path not being configured correctly

@trz42 trz42 added 2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia labels Nov 12, 2024
@riscv-eessi-io-bot
Copy link

Instance eessi-bot-riscv is configured to build for:

  • architectures: riscv64/generic
  • repositories: riscv.eessi.io-20240402

Copy link

eessi-bot bot commented Nov 12, 2024

Instance eessi-bot-mc-aws is configured to build for:

  • architectures: x86_64/generic, x86_64/intel/haswell, x86_64/intel/skylake_avx512, x86_64/amd/zen2, x86_64/amd/zen3, aarch64/generic, aarch64/neoverse_n1, aarch64/neoverse_v1
  • repositories: eessi.io-2023.06-compat, eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software

Copy link

eessi-bot bot commented Nov 12, 2024

Instance eessi-bot-mc-azure is configured to build for:

  • architectures: x86_64/amd/zen4
  • repositories: eessi-hpc.org-2023.06-software, eessi-hpc.org-2023.06-compat, eessi.io-2023.06-software, eessi.io-2023.06-compat

@trz42
Copy link
Collaborator Author

trz42 commented Nov 12, 2024

Just build for a single CPU architecture...

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

@riscv-eessi-io-bot
Copy link

Updates by the bot instance eessi-bot-riscv (click for details)
  • account trz42 has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Nov 12, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from trz42

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

Copy link

eessi-bot bot commented Nov 12, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from trz42

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Nov 12, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.11/pr_808/28458

date job status comment
Nov 12 11:32:37 UTC 2024 submitted job id 28458 awaits release by job manager
Nov 12 11:32:49 UTC 2024 released job awaits launch by Slurm scheduler
Nov 12 11:37:52 UTC 2024 running job 28458 is running
Nov 12 11:52:06 UTC 2024 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-28458.out
❌ found message matching ERROR:
✅ no message matching FAILED:
✅ no message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.gz created!
Artefacts
eessi-2023.06-software-linux-x86_64-amd-zen2-1731411645.tar.gzsize: 698 MiB (732482400 bytes)
entries: 71
modules under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/modules/all
no module files in tarball
software under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80/software
no software packages in tarball
other under 2023.06/software/linux/x86_64/amd/zen2/accel/nvidia/cc80
2023.06/software/linux/x86_64/amd/zen2/modules/all/cuDNN/8.9.2.26-CUDA-12.1.1.lua
2023.06/software/linux/x86_64/amd/zen2/modules/numlib/cuDNN/8.9.2.26-CUDA-12.1.1.lua
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/easybuild/
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/easybuild/cuDNN-8.9.2.26-CUDA-12.1.1-easybuild-devel
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/easybuild/cuDNN-8.9.2.26-CUDA-12.1.1.eb
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/easybuild/easybuild-cuDNN-8.9.2.26-20241112.113857.log.bz2
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/easybuild/easybuild-cuDNN-8.9.2.26-20241112.113857_test_report.md
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/easybuild/reprod/
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/easybuild/reprod/cuDNN-8.9.2.26-CUDA-12.1.1.eb
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/easybuild/reprod/cuDNN-8.9.2.26-CUDA-12.1.1.env
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/easybuild/reprod/easyblocks/
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/easybuild/reprod/easyblocks/cudnn.py
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/easybuild/reprod/easyblocks/tarball.py
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/easybuild/reprod/hooks/
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/easybuild/reprod/hooks/eb_hooks.py
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/include/
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/include/cudnn_adv_infer.h
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/include/cudnn_adv_infer_v8.h
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/include/cudnn_adv_train.h
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/include/cudnn_adv_train_v8.h
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/include/cudnn_backend.h
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/include/cudnn_backend_v8.h
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/include/cudnn_cnn_infer.h
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/include/cudnn_cnn_infer_v8.h
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/include/cudnn_cnn_train.h
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/include/cudnn_cnn_train_v8.h
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/include/cudnn.h
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/include/cudnn_ops_infer.h
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/include/cudnn_ops_infer_v8.h
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/include/cudnn_ops_train.h
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/include/cudnn_ops_train_v8.h
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/include/cudnn_v8.h
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/include/cudnn_version.h
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/include/cudnn_version_v8.h
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib64
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_adv_infer.so
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_adv_infer.so.8
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_adv_infer.so.8.9.2
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_adv_infer_static.a
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_adv_infer_static_v8.a
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_adv_train.so
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_adv_train.so.8
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_adv_train.so.8.9.2
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_adv_train_static.a
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_adv_train_static_v8.a
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_cnn_infer.so
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_cnn_infer.so.8
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_cnn_infer.so.8.9.2
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_cnn_infer_static.a
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_cnn_infer_static_v8.a
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_cnn_train.so
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_cnn_train.so.8
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_cnn_train.so.8.9.2
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_cnn_train_static.a
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_cnn_train_static_v8.a
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_ops_infer.so
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_ops_infer.so.8
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_ops_infer.so.8.9.2
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_ops_infer_static.a
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_ops_infer_static_v8.a
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_ops_train.so
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_ops_train.so.8
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_ops_train.so.8.9.2
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_ops_train_static.a
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn_ops_train_static_v8.a
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn.so
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn.so.8
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/lib/libcudnn.so.8.9.2
2023.06/software/linux/x86_64/amd/zen2/software/cuDNN/8.9.2.26-CUDA-12.1.1/LICENSE
Nov 12 11:52:06 UTC 2024 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] ( 1/10) EESSI_LAMMPS_lj %scale=1_node %device_type=cpu %module_name=LAMMPS/29Aug2024-foss-2023b-kokkos /aeb2d9df @BotBuildTests:x86-64-amd-zen2-node+default
P: perf: 440.373 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 2/10) EESSI_LAMMPS_lj %scale=1_node %device_type=cpu %module_name=LAMMPS/2Aug2023_update2-foss-2023a-kokkos /04ff9ece @BotBuildTests:x86-64-amd-zen2-node+default
P: perf: 433.835 timesteps/s (r:0, l:None, u:None)
[ OK ] ( 3/10) EESSI_OSU_Micro_Benchmarks_coll %benchmark_info=mpi.collective.osu_allreduce %scale=1_node %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %device_type=cpu /31ac6ab9 @BotBuildTests:x86-64-amd-zen2-node+default
P: latency: 4.69 us (r:0, l:None, u:None)
[ OK ] ( 4/10) EESSI_OSU_Micro_Benchmarks_coll %benchmark_info=mpi.collective.osu_allreduce %scale=1_node %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %device_type=cpu /f3be40a2 @BotBuildTests:x86-64-amd-zen2-node+default
P: latency: 4.52 us (r:0, l:None, u:None)
[ OK ] ( 5/10) EESSI_OSU_Micro_Benchmarks_coll %benchmark_info=mpi.collective.osu_alltoall %scale=1_node %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %device_type=cpu /10e66fba @BotBuildTests:x86-64-amd-zen2-node+default
P: latency: 8.96 us (r:0, l:None, u:None)
[ OK ] ( 6/10) EESSI_OSU_Micro_Benchmarks_coll %benchmark_info=mpi.collective.osu_alltoall %scale=1_node %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %device_type=cpu /5be57ae7 @BotBuildTests:x86-64-amd-zen2-node+default
P: latency: 8.28 us (r:0, l:None, u:None)
[ OK ] ( 7/10) EESSI_OSU_Micro_Benchmarks_pt2pt %benchmark_info=mpi.pt2pt.osu_latency %scale=1_node %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %device_type=cpu /c8c9aff5 @BotBuildTests:x86-64-amd-zen2-node+default
P: latency: 0.33 us (r:0, l:None, u:None)
[ OK ] ( 8/10) EESSI_OSU_Micro_Benchmarks_pt2pt %benchmark_info=mpi.pt2pt.osu_latency %scale=1_node %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %device_type=cpu /9795e491 @BotBuildTests:x86-64-amd-zen2-node+default
P: latency: 0.31 us (r:0, l:None, u:None)
[ OK ] ( 9/10) EESSI_OSU_Micro_Benchmarks_pt2pt %benchmark_info=mpi.pt2pt.osu_bw %scale=1_node %module_name=OSU-Micro-Benchmarks/7.2-gompi-2023b %device_type=cpu /48da21c5 @BotBuildTests:x86-64-amd-zen2-node+default
P: bandwidth: 7853.4 MB/s (r:0, l:None, u:None)
[ OK ] (10/10) EESSI_OSU_Micro_Benchmarks_pt2pt %benchmark_info=mpi.pt2pt.osu_bw %scale=1_node %module_name=OSU-Micro-Benchmarks/7.1-1-gompi-2023a %device_type=cpu /1b8c1ca2 @BotBuildTests:x86-64-amd-zen2-node+default
P: bandwidth: 7703.42 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 10/10 test case(s) from 10 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-28458.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case
  • The building failed with
    Usage: eb [options] easyconfig [...]
    
    eb: error: no such option: --include-easyblock-from-commit
    
  • NOTE, it seems odd that cuDNN is "built" again. Something must be odd with the build settings.

@trz42
Copy link
Collaborator Author

trz42 commented Nov 12, 2024

Rebuilding after arg typo got fixed...

bot: build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80

@riscv-eessi-io-bot
Copy link

Updates by the bot instance eessi-bot-riscv (click for details)
  • account trz42 has NO permission to send commands to the bot

Copy link

eessi-bot bot commented Nov 12, 2024

Updates by the bot instance eessi-bot-mc-aws (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from trz42

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

Copy link

eessi-bot bot commented Nov 12, 2024

Updates by the bot instance eessi-bot-mc-azure (click for details)
  • received bot command build repo:eessi.io-2023.06-software arch:x86_64/amd/zen2 accel:nvidia/cc80 from trz42

    • expanded format: build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80
  • handling command build repository:eessi.io-2023.06-software architecture:x86_64/amd/zen2 accelerator:nvidia/cc80 resulted in:

    • no jobs were submitted

Copy link

eessi-bot bot commented Nov 12, 2024

New job on instance eessi-bot-mc-aws for CPU micro-architecture x86_64-amd-zen2 and accelerator nvidia/cc80 for repository eessi.io-2023.06-software in job dir /project/def-users/SHARED/jobs/2024.11/pr_808/28459

date job status comment
Nov 12 13:10:47 UTC 2024 submitted job id 28459 awaits release by job manager
Nov 12 13:11:15 UTC 2024 released job awaits launch by Slurm scheduler
Nov 12 13:17:18 UTC 2024 running job 28459 is running
  • job failed and job manager crashed when trying to update the above table with a too large update (~ 300 KB) ... might be related to that the install path is wrong (CPU vs GPU directory)

…-layer into 2023.06-software.eessi.io-TensorFlow-2.15.1-2023a-CUDA-12.1.1-debug
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2023.06-software.eessi.io 2023.06 version of software.eessi.io accel:nvidia
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants