Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Partitions aren't created, but getting "MIG configuration applied successfully" message #18

Open
alekraus opened this issue May 2, 2023 · 8 comments

Comments

@alekraus
Copy link

alekraus commented May 2, 2023

I have installed the mig-parted tool as root on a node. I am able to run the sample commands listed in the readme page, getting the "MIG configuration applied successfully" message after applying different configurations from the config.yaml file. However, the partitions do not seem to be created, as checked by both nvidia-smi and nvidia-mig-parted export (mig-devices returns "{}"). Do you have any guidance on what could be going on here?

@elezar
Copy link
Member

elezar commented May 3, 2023

@alekraus did applying the configuration require a mig mode change? On A100 and A30 devices, this would require a reboot. Also note that the MIG configuration (after a mode change) does not persist across reboots and would require a config to be applied at startup.

@alekraus
Copy link
Author

alekraus commented May 8, 2023

Apologies for the delay in response @elezar . The node automatically enables mig mode on reboot. When applying the configuration, no mig mode change or reboot was requested by the node, which has two A100 GPUs.

The first block below has the output when attempting to apply the all-2g.10gb configuration from the example config file. The second block below has the output of nvidia-smi on both reboot and after attempting to apply the configuration. The third block below has the output of nvidia-mig-parted export. Note that the same output is obtained when attempting to apply any other configuration from the example config file.

Output after running nvidia-mig-parted -d apply -f examples/config.yaml -c all-2g.10gb :

DEBU[0000] Parsing config file...                       
DEBU[0000] Selecting specific MIG config...             
DEBU[0000] Running apply-start hook                     
DEBU[0000] Checking current MIG mode...                 
DEBU[0000] Walking MigConfig for (devices=all)          
DEBU[0000]   GPU 0: 0x20B510DE                          
DEBU[0000]     Asserting MIG mode: Enabled              
DEBU[0000]     MIG capable: true                        
DEBU[0000]     Current MIG mode: Enabled                
DEBU[0000]   GPU 1: 0x20B510DE                          
DEBU[0000]     Asserting MIG mode: Enabled              
DEBU[0000]     MIG capable: true                        
DEBU[0000]     Current MIG mode: Enabled                
DEBU[0000] Checking current MIG device configuration... 
DEBU[0000] Walking MigConfig for (devices=all)          
DEBU[0000]   GPU 0: 0x20B510DE                          
DEBU[0000]     Asserting MIG config: map[2g.10gb:3]     
DEBU[0000]   GPU 1: 0x20B510DE                          
DEBU[0000]     Asserting MIG config: map[2g.10gb:3]     
DEBU[0000] Running pre-apply-config hook                
DEBU[0000] Applying MIG device configuration...         
DEBU[0000] Walking MigConfig for (devices=all)          
DEBU[0000]   GPU 0: 0x20B510DE                          
DEBU[0000]     MIG capable: true                        
DEBU[0000]     Updating MIG config: map[2g.10gb:3]      
DEBU[0000]   GPU 1: 0x20B510DE                          
DEBU[0000]     MIG capable: true                        
DEBU[0000]     Updating MIG config: map[2g.10gb:3]      
DEBU[0000] Running apply-exit hook                      
MIG configuration applied successfully

Output of nvidia-smi both after reboot and after attempting to apply configuration (both outputs are the same):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04    Driver Version: 515.43.04    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:65:00.0 Off |                   On |
| N/A   35C    P0    45W / 300W |      0MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:CA:00.0 Off |                   On |
| N/A   34C    P0    45W / 300W |      0MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  No MIG devices found                                                       |
+-----------------------------------------------------------------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Output of nvidia-mig-parted export:

version: v1
mig-configs:
  current:
  - devices: all
    mig-enabled: true
    mig-devices: {}

@elezar
Copy link
Member

elezar commented May 11, 2023

Hi @alekraus. Which version of mig-parted are you using in this case? I have just done a sanity check on my side with an executable built off 9ab5c663d6570cb1bd4979b66e831391fae5d265 (v0.5.2) and this seems to apply the config correctly. This is on a 525.85.12 driver though and on NVIDIA A100-SXM4-40GB devices.

Note that for an 80GB device, the 2g.10gb profile does not exist and should be 2g.20gb. Could you create a config file with the following contents:

$ cat config.yaml
version: v1
mig-configs:
  all-2g.20gb:
  - devices: all
    mig-enabled: true
    mig-devices:
     "2g.20gb": 3

and check whether applying this works as expected.

If it does, then we have to improve our checks around valid profile names.

@alekraus
Copy link
Author

Hi @elezar, your suggestion appears to have made the process work as expected. I created a file called config2.yaml with the content for 2g.20gb. Below is the resulting output after running nvidia-mig-parted -d apply -f examples/config2.yaml -c all-2g.20gb, followed by the output from running nvidia-mig-parted export and nvidia-smi right after:

DEBU[0000] Parsing config file...                       
DEBU[0000] Selecting specific MIG config...             
DEBU[0000] Running apply-start hook                     
DEBU[0000] Checking current MIG mode...                 
DEBU[0000] Walking MigConfig for (devices=all)          
DEBU[0000]   GPU 0: 0x20B510DE                          
DEBU[0000]     Asserting MIG mode: Enabled              
DEBU[0000]     MIG capable: true                        
DEBU[0000]     Current MIG mode: Enabled                
DEBU[0000]   GPU 1: 0x20B510DE                          
DEBU[0000]     Asserting MIG mode: Enabled              
DEBU[0000]     MIG capable: true                        
DEBU[0000]     Current MIG mode: Enabled                
DEBU[0000] Checking current MIG device configuration... 
DEBU[0000] Walking MigConfig for (devices=all)          
DEBU[0000]   GPU 0: 0x20B510DE                          
DEBU[0000]     Asserting MIG config: map[2g.20gb:3]     
DEBU[0000]   GPU 1: 0x20B510DE                          
DEBU[0000]     Asserting MIG config: map[2g.20gb:3]     
DEBU[0000] Running pre-apply-config hook                
DEBU[0000] Applying MIG device configuration...         
DEBU[0000] Walking MigConfig for (devices=all)          
DEBU[0000]   GPU 0: 0x20B510DE                          
DEBU[0000]     MIG capable: true                        
DEBU[0000]     Updating MIG config: map[2g.20gb:3]      
DEBU[0003]   GPU 1: 0x20B510DE                          
DEBU[0003]     MIG capable: true                        
DEBU[0003]     Updating MIG config: map[2g.20gb:3]      
DEBU[0006] Running apply-exit hook                      
MIG configuration applied successfully

Output from running nvidia-mig-parted export right after:

version: v1
mig-configs:
  current:
  - devices: all
    mig-enabled: true
    mig-devices:
      2g.20gb: 3

Output from running nvidia-smi:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04    Driver Version: 515.43.04    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100 80G...  On   | 00000000:65:00.0 Off |                   On |
| N/A   34C    P0    45W / 300W |     39MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100 80G...  On   | 00000000:CA:00.0 Off |                   On |
| N/A   34C    P0    45W / 300W |     39MiB / 81920MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    3   0   0  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    4   0   1  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    5   0   2  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    3   0   0  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    5   0   1  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    6   0   2  |     13MiB / 19968MiB | 28      0 |  2   0    1    0    0 |
|                  |      0MiB / 32767MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Thank you very much for your help! Much appreciated.

@elezar
Copy link
Member

elezar commented May 15, 2023

Thanks for confirming that this works. I have created an internal ticket to track adding a more verbose error if an unsupprted profile name is requested.

Could you please confirm the version of mig-parted that you were using?

@klueska
Copy link
Contributor

klueska commented May 15, 2023

As of v0.5.2 (i.e. the very latest) mig-parted should already error out if the requested MIG profile is not valid for the current platform. @alekraus can you verify which version of mig parted you were using that didn't do this?

@klueska
Copy link
Contributor

klueska commented May 15, 2023

Also note that this is not quite accurate:

did applying the configuration require a mig mode change? On A100 and A30 devices, this would require a reboot.

A GPU reset is necessary, not a reboot, and MIG parted should automatically take care to bring all GPU clients down and back up to allow the reset to happen when necessary. This is actually one of the major value-adds over using raw nvidia-smi, because this is not possible with nvidia-smi alone.

@alekraus
Copy link
Author

Hi @elezar and @klueska, thank you for the follow-up. These errors occurred while using nvidia-mig-parted v0.5.2 - let me know if you'd like any further information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants