Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The VM dies and never comes back to life. #110

Open
jclab-joseph opened this issue Nov 16, 2022 · 0 comments
Open

The VM dies and never comes back to life. #110

jclab-joseph opened this issue Nov 16, 2022 · 0 comments

Comments

@jclab-joseph
Copy link

jclab-joseph commented Nov 16, 2022

The VM died suddenly during operation.
Log at the time of death:

[111183.875832] [nvidia-vgpu-vfio] 3b5b5055-de1f-4535-8af3-b443eb10c6bb: ERESTARTSYS received during open, waiting for 25000 milliseconds for operation to complete
[111183.891179] [nvidia-vgpu-vfio] 3b5b5055-de1f-4535-8af3-b443eb10c6bb: vGPU migration disabled
[111304.691470] [nvidia-vgpu-vfio] 3b5b5055-de1f-4535-8af3-b443eb10c6bb: vGPU migration disabled

And I can't start that VM anymore. It works again only after a complete reboot of the host.
Restart tried:

[173695.627192] [nvidia-vgpu-vfio] 3b5b5055-de1f-4535-8af3-b443eb10c6bb: ERESTARTSYS received during open, waiting for 25000 milliseconds for operation to complete
[173695.659145] [nvidia-vgpu-vfio] 3b5b5055-de1f-4535-8af3-b443eb10c6bb: start failed. status: 0x0 

nvidia-vgpu-mgr Log:

11월 15 16:00:51 nvidia-vgpu-mgr[1740]: VgpuStart {
                                                           uuid: {3b5b5055-de1f-4535-8af3-b443eb10c6bb},
                                                           config_params: "vgpu_type_id=46",
                                                           qemu_pid: 3472883,
                                                           unknown_414: [0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0],
                                                       }
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 3b5b5055-de1f-4535-8af3-b443eb10c6bb GPU PCI id 00:01:00.0 config params vgpu_type_id=46
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=46
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: notice: vmiop_env_log: Successfully updated env symbols!
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: cmd: 0x20801322 failed.
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: cmd: 0x2080014b failed.
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: VgpuConfig {
                                                              vgpu_type: 46,
                                                              vgpu_name: "GRID P40-1Q",
                                                              vgpu_class: "Quadro",
                                                              vgpu_signature: [],
                                                              features: "Quadro-Virtual-DWS,5.0;GRID-Virtual-WS,2.0;GRID-Virtual-WS-Ext,2.0",
                                                              max_instances: 24,
                                                              num_heads: 4,
                                                              max_resolution_x: 5120,
                                                              max_resolution_y: 2880,
                                                              max_pixels: 17694720,
                                                              frl_config: 60,
                                                              cuda_enabled: 1,
                                                              ecc_supported: 1,
                                                              mig_instance_size: 0,
                                                              multi_vgpu_supported: 0,
                                                              vdev_id: 0x1b3811e8,
                                                              pdev_id: 0x1b38,
                                                              fb_length: 0x38000000,
                                                              mappable_video_size: 0x400000,
                                                              fb_reservation: 0x8000000,
                                                              encoder_capacity: 0x64,
                                                              bar1_length: 0x100,
                                                              frl_enable: 1,
                                                              adapter_name: "GRID P40-1Q",
                                                              adapter_name_unicode: "GRID P40-1Q",
                                                              short_gpu_name_string: "GP107-A",
                                                              licensed_product_name: "NVIDIA RTX Virtual Workstation",
                                                              vgpu_extra_params: [],
                                                          }
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: Applying profile nvidia-46 overrides
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: Patching nvidia-46/num_heads: 4 -> 1
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: Patching nvidia-46/max_resolution_x: 5120 -> 1920
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: Patching nvidia-46/max_resolution_y: 2880 -> 1080
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: Patching nvidia-46/max_pixels: 17694720 -> 2073600
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: Patching nvidia-46/cuda_enabled: 1 -> 1
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: Patching nvidia-46/frl_enable: 1 -> 0
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: cmd: 0xa0810115 failed.
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: notice: vmiop_log: (0x0): Setting mappable_cpu_host_aperture to 10M
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: notice: vmiop_log: (0x0): gpu-pci-id : 0x100
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: notice: vmiop_log: (0x0): vgpu_type : Quadro
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: notice: vmiop_log: (0x0): Framebuffer: 0x38000000
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: notice: vmiop_log: (0x0): Virtual Device Id: 0x1b38:0x11e8
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: notice: vmiop_log: ######## vGPU Manager Information: ########
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: notice: vmiop_log: Driver Version: 510.85.03
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0xd0001)
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: cmd: 0x2080012f failed.
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: notice: vmiop_log: (0x0): Cannot query ECC status. vGPU ECC support will be disabled.
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: notice: vmiop_log: (0x0): Init frame copy engine: syncing...
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: notice: vmiop_log: (0x0): vGPU migration enabled
11월 15 16:00:51 nvidia-vgpu-mgr[3472905]: notice: vmiop_log: display_init inst: 0 successful
11월 15 16:01:08 nvidia-vgpu-mgr[3472905]: notice: vmiop_log: ######## Guest NVIDIA Driver Information: ########
11월 15 16:01:08 nvidia-vgpu-mgr[3472905]: notice: vmiop_log: Driver Version: 496.49
11월 15 16:01:08 nvidia-vgpu-mgr[3472905]: notice: vmiop_log: vGPU version: 0xc0001
11월 15 16:01:08 nvidia-vgpu-mgr[3472905]: notice: vmiop_log: (0x0): Current max guest pfn = 0x46fe6e!
11월 15 16:01:12 nvidia-vgpu-mgr[3472905]: notice: vmiop_log: (0x0): Current max guest pfn = 0x47c113!
11월 15 16:05:38 nvidia-vgpu-mgr[3472905]: notice: vmiop_log: (0x0): Current max guest pfn = 0x47ffea!
11월 16 09:20:39 nvidia-vgpu-mgr[1740]: VgpuStart {
                                                           uuid: {3b5b5055-de1f-4535-8af3-b443eb10c6bb},
                                                           config_params: "vgpu_type_id=46",
                                                           qemu_pid: 1199000,
                                                           unknown_414: [0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0],
                                                       }
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 3b5b5055-de1f-4535-8af3-b443eb10c6bb GPU PCI id 00:01:00.0 config params vgpu_type_id=46
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=46
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: notice: vmiop_env_log: Successfully updated env symbols!
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: cmd: 0x20801322 failed.
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: cmd: 0x2080014b failed.
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: VgpuConfig {
                                                              vgpu_type: 46,
                                                              vgpu_name: "GRID P40-1Q",
                                                              vgpu_class: "Quadro",
                                                              vgpu_signature: [],
                                                              features: "Quadro-Virtual-DWS,5.0;GRID-Virtual-WS,2.0;GRID-Virtual-WS-Ext,2.0",
                                                              max_instances: 24,
                                                              num_heads: 4,
                                                              max_resolution_x: 5120,
                                                              max_resolution_y: 2880,
                                                              max_pixels: 17694720,
                                                              frl_config: 60,
                                                              cuda_enabled: 1,
                                                              ecc_supported: 1,
                                                              mig_instance_size: 0,
                                                              multi_vgpu_supported: 0,
                                                              vdev_id: 0x1b3811e8,
                                                              pdev_id: 0x1b38,
                                                              fb_length: 0x38000000,
                                                              mappable_video_size: 0x400000,
                                                              fb_reservation: 0x8000000,
                                                              encoder_capacity: 0x64,
                                                              bar1_length: 0x100,
                                                              frl_enable: 1,
                                                              adapter_name: "GRID P40-1Q",
                                                              adapter_name_unicode: "GRID P40-1Q",
                                                              short_gpu_name_string: "GP107-A",
                                                              licensed_product_name: "NVIDIA RTX Virtual Workstation",
                                                              vgpu_extra_params: [],
                                                          }
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: Applying profile nvidia-46 overrides
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: Patching nvidia-46/num_heads: 4 -> 1
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: Patching nvidia-46/max_resolution_x: 5120 -> 1920
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: Patching nvidia-46/max_resolution_y: 2880 -> 1080
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: Patching nvidia-46/max_pixels: 17694720 -> 2073600
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: Patching nvidia-46/cuda_enabled: 1 -> 1
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: Patching nvidia-46/frl_enable: 1 -> 0
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: cmd: 0xa0810115 failed.
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: notice: vmiop_log: (0x0): Setting mappable_cpu_host_aperture to 10M
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: notice: vmiop_log: (0x0): gpu-pci-id : 0x100
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: notice: vmiop_log: (0x0): vgpu_type : Quadro
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: notice: vmiop_log: (0x0): Framebuffer: 0x38000000
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: notice: vmiop_log: (0x0): Virtual Device Id: 0x1b38:0x11e8
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: notice: vmiop_log: ######## vGPU Manager Information: ########
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: notice: vmiop_log: Driver Version: 510.85.03
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: notice: vmiop_log: (0x0): vGPU supported range: (0x70001, 0xd0001)
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: cmd: 0x2080012f failed.
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: notice: vmiop_log: (0x0): Cannot query ECC status. vGPU ECC support will be disabled.
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: NVOS status 0x51
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: Assertion Failed at 0xf92b27a8:112
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: 13 frames returned by backtrace
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(_nv007217vgpu+0x35) [0x7f99f92e5805]
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(_nv007253vgpu+0x16c) [0x7f99f929ee3c]
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(_nv007328vgpu+0xf8) [0x7f99f92b27a8]
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(+0xb47b6) [0x7f99f92b47b6]
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(+0xb62c8) [0x7f99f92b62c8]
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: vgpu(+0x13b7e) [0x563dddc13b7e]
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: vgpu(+0x14c79) [0x563dddc14c79]
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: vgpu(+0xeb43) [0x563dddc0eb43]
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: vgpu(+0xc3b6) [0x563dddc0c3b6]
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: vgpu(+0x3bda) [0x563dddc03bda]
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f99f9a1dd90]
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f99f9a1de40]
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: vgpu(+0x3c1d) [0x563dddc03c1d]
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: (0x0): Failed to alloc guest FB memory
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: (0x0): init_device_instance failed for inst 0 with error 2 (vmiop-display: error allocating framebuffer)
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: (0x0): Initialization: init_device_instance failed error 2
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: (0x0): Thread for engine 0x0 could not join with error 0x5
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: (0x0): Failed to free thread event for engine 0x0. Error: 0x5
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: (0x0): Thread for engine 0x4 could not join with error 0x5
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: (0x0): Failed to free thread event for engine 0x4. Error: 0x5
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: (0x0): Thread for engine 0x5 could not join with error 0x5
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: (0x0): Failed to free thread event for engine 0x5. Error: 0x5
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: (0x0): Thread for engine 0x6 could not join with error 0x5
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: (0x0): Failed to free thread event for engine 0x6. Error: 0x5
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_log: display_init failed for inst: 0
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_env_log: (0x0): vmiope_process_configuration: plugin registration error
11월 16 09:20:39 nvidia-vgpu-mgr[1199023]: error: vmiop_env_log: (0x0): vmiope_process_configuration failed with 0x1a

Host GPU : NVidia GTX 1050 Ti

nvidia-smi:

$ sudo nvidia-smi
Wed Nov 16 11:22:19 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02    Driver Version: 510.85.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:01:00.0  On |                  N/A |
| 35%   40C    P0    N/A /  75W |   1743MiB /  4096MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

$ sudo nvidia-smi vgpu
Wed Nov 16 11:22:22 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02              Driver Version: 510.85.03                 |
|---------------------------------+------------------------------+------------+
| GPU  Name                       | Bus-Id                       | GPU-Util   |
|      vGPU ID     Name           | VM ID     VM Name            | vGPU-Util  |
|=================================+==============================+============|
|   0  NVIDIA GeForce GTX 105...  | 00000000:01:00.0             |   1%       |
+---------------------------------+------------------------------+------------+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant