Skip to content

Commit

Permalink
Merge pull request #392 from tomstocker-ethz/main
Browse files Browse the repository at this point in the history
added template_nvidia-smi_integration_v7_vGPU.yaml, added trigger des…
  • Loading branch information
abakaldin authored Oct 22, 2024
2 parents d616b3c + 0134e88 commit 2a5a14d
Show file tree
Hide file tree
Showing 3 changed files with 377 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# NVidia Sensors

## Overview

This template integrates NVidia SMI for a single graphics card with Zabbix.

The template adds monitoring of:

* GPU Utilisation
* GPU Power Consumption
* GPU Memory (Used, Free, Total)
* GPU Temperature
* GPU Fan Speed

The following agent parameters can be used to add the metrics into Zabbix.

UserParameter=gpu.temp,nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits -i 0
UserParameter=gpu.memtotal,nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits -i 0
UserParameter=gpu.used,nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0
UserParameter=gpu.free,nvidia-smi --query-gpu=memory.free --format=csv,noheader,nounits -i 0
UserParameter=gpu.fanspeed,nvidia-smi --query-gpu=fan.speed --format=csv,noheader,nounits -i 0
UserParameter=gpu.utilisation,nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits -i 0
UserParameter=gpu.power,nvidia-smi --query-gpu=power.draw --format=csv,noheader,nounits -i 0

## Author

Richard Kavanagh
Updated & prettyfied by Tom Stocker, [email protected]

## Macros used

There are no macros links in this template.

## Template links

There are no template links in this template.

## Discovery rules

There are no discovery rules in this template.

## Items collected

|Name|Description|Type|Key and additional info|
|----|-----------|----|----|
|GPU Power|<p>-</p>|`Zabbix agent`|gpu.power<p>Update: 30</p>|
|GPU Free Memory|<p>-</p>|`Zabbix agent`|gpu.free<p>Update: 30</p>|
|GPU Utilisation|<p>-</p>|`Zabbix agent`|gpu.utilisation<p>Update: 30</p>|
|GPU Total Memory|<p>-</p>|`Zabbix agent`|gpu.memtotal<p>Update: 30</p>|
|GPU Temperature|<p>-</p>|`Zabbix agent`|gpu.temp<p>Update: 30</p>|
|GPU Used Memory|<p>-</p>|`Zabbix agent`|gpu.used<p>Update: 30</p>|
|GPU Fan Speed|<p>-</p>|`Zabbix agent`|gpu.fanspeed<p>Update: 30</p>|


## Triggers

GPU Temperature over 95c {HOSTNAME}
last(/NVidia Sensors/gpu.temp,#2)>95

## Notes about GRID driver for virtual GPUs, or directly import template_nvidia-smi_integration_v7_vGPU.yaml where those are already removed

You may want to disable those items. as you won't see any output:

|Name|Description|Type|Key and additional info|
|----|-----------|----|----|
|GPU Power|<p>-</p>|`Zabbix agent`|gpu.power<p>Update: 30</p>|
|GPU Temperature|<p>-</p>|`Zabbix agent`|gpu.temp<p>Update: 30</p>|
|GPU Fan Speed|<p>-</p>|`Zabbix agent`|gpu.fanspeed<p>Update: 30</p>|
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
zabbix_export:
version: '7.0'
template_groups:
- uuid: 5bb2de64036c44f793c9f82c25ea9fdc
name: Templates
templates:
-
uuid: 499019c3dfce41dfa20c6052b49e6eea
template: 'NVidia Sensors'
name: 'NVidia Sensors'
description: |
## Overview
This template integrates NVidia SMI for a single graphics card with Zabbix.
The template adds monitoring of:
* GPU Utilisation
* GPU Power Consumption
* GPU Memory (Used, Free, Total)
* GPU Temperature
* GPU Fan Speed
The following agent parameters can be used to add the metrics into Zabbix.
UserParameter=gpu.temp,nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader,nounits -i 0
UserParameter=gpu.memtotal,nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits -i 0
UserParameter=gpu.used,nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0
UserParameter=gpu.free,nvidia-smi --query-gpu=memory.free --format=csv,noheader,nounits -i 0
UserParameter=gpu.fanspeed,nvidia-smi --query-gpu=fan.speed --format=csv,noheader,nounits -i 0
UserParameter=gpu.utilisation,nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits -i 0
UserParameter=gpu.power,nvidia-smi --query-gpu=power.draw --format=csv,noheader,nounits -i 0
## Author
Richard Kavanagh
Updated & prettyfied by Tom Stocker, [email protected]
groups:
-
name: Templates
items:
-
uuid: 847765be9f1b45358653eb1561e69876
name: 'GPU Fan Speed'
key: gpu.fanspeed
delay: '30'
value_type: FLOAT
units: '%'
tags:
-
tag: Application
value: Nvidia
-
uuid: f512bb1996184861a37119aa81c260aa
name: 'GPU Free Memory'
key: gpu.free
delay: '30'
value_type: FLOAT
units: B
preprocessing:
-
type: MULTIPLIER
parameters:
- '1048576'
tags:
-
tag: Application
value: Nvidia
-
uuid: 1d82cd528b15453ab7433b91f6dd1b29
name: 'GPU Total Memory'
key: gpu.memtotal
delay: '30'
value_type: FLOAT
units: B
preprocessing:
-
type: MULTIPLIER
parameters:
- '1048576'
tags:
-
tag: Application
value: Nvidia
-
uuid: ef1723bd932a4efdb6bb98feabad8b9d
name: 'GPU Power'
key: gpu.power
delay: '30'
value_type: FLOAT
units: W
tags:
-
tag: Application
value: Nvidia
-
uuid: c866b39f748a471c97511940ae637db9
name: 'GPU Temperature'
key: gpu.temp
delay: '30'
value_type: FLOAT
units: C
tags:
-
tag: Application
value: Nvidia
triggers:
-
uuid: ce8aabec9b4345ca9beeba9075901d78
expression: 'last(/NVidia Sensors/gpu.temp,#2)>95'
name: 'GPU Temperature over 95c {HOSTNAME}'
priority: AVERAGE
-
uuid: 79c04191119a4390a98aa1a97f17bc21
name: 'GPU Used Memory'
key: gpu.used
delay: '30'
value_type: FLOAT
units: B
preprocessing:
-
type: MULTIPLIER
parameters:
- '1048576'
tags:
-
tag: Application
value: Nvidia
-
uuid: d8cf86331a54458e8a9a09ffea7f295e
name: 'GPU Utilisation'
key: gpu.utilisation
delay: '30'
value_type: FLOAT
units: '%'
tags:
-
tag: Application
value: Nvidia
graphs:
-
uuid: 0b1890f24cff4e29b32ac6d6e94b4590
name: 'GPU Memory'
graph_items:
-
color: C80000
item:
host: 'NVidia Sensors'
key: gpu.free
-
sortorder: '1'
color: 00C800
item:
host: 'NVidia Sensors'
key: gpu.memtotal
-
sortorder: '2'
color: 0000C8
item:
host: 'NVidia Sensors'
key: gpu.used
-
uuid: fac5f402ae1345e2a9102a0d470167f7
name: 'GPU Power'
graph_items:
-
color: C80000
item:
host: 'NVidia Sensors'
key: gpu.power
-
uuid: 07f45984487a4d82a487fdba9b73f2d4
name: 'GPU Temperature'
graph_items:
-
color: C80000
item:
host: 'NVidia Sensors'
key: gpu.temp
-
sortorder: '1'
color: 0000EE
yaxisside: RIGHT
item:
host: 'NVidia Sensors'
key: gpu.fanspeed
-
uuid: ce35cca2d64f46bdaff181aad5bbd4d5
name: 'GPU Utilisation'
graph_items:
-
color: C80000
item:
host: 'NVidia Sensors'
key: gpu.utilisation
-
sortorder: '1'
color: 33FF33
yaxisside: RIGHT
item:
host: 'NVidia Sensors'
key: gpu.power
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
zabbix_export:
version: '7.0'
template_groups:
- uuid: 0fc75b67ced9466abe64c1a8c83d46e9
name: Templates
templates:
- uuid: 0794f363201240ab92c4daf13c12f1b5
template: 'NVidia Sensors vGPU'
name: 'NVidia Sensors vGPU'
description: |
## Overview
This template integrates NVidia SMI for a single virtual graphics card with Zabbix.
The template adds monitoring of:
* GPU Utilisation
* GPU Memory (Used, Free, Total)
The following agent parameters can be used to add the metrics into Zabbix.
UserParameter=gpu.memtotal,nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits -i 0
UserParameter=gpu.used,nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits -i 0
UserParameter=gpu.free,nvidia-smi --query-gpu=memory.free --format=csv,noheader,nounits -i 0
UserParameter=gpu.utilisation,nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits -i 0
## Author
Richard Kavanagh
Updated & prettyfied by Tom Stocker, [email protected]
groups:
- name: Templates
items:
- uuid: 742df99113b44f149539ec5fb7eb1f2e
name: 'GPU Free Memory'
key: gpu.free
delay: '30'
value_type: FLOAT
units: B
preprocessing:
- type: MULTIPLIER
parameters:
- '1048576'
tags:
- tag: Application
value: Nvidia
- uuid: 9a8ed162aae342f1b63986b6a156bff1
name: 'GPU Total Memory'
key: gpu.memtotal
delay: '30'
value_type: FLOAT
units: B
preprocessing:
- type: MULTIPLIER
parameters:
- '1048576'
tags:
- tag: Application
value: Nvidia
- uuid: 649f37f9b0fd4daea820a980ad5b1a54
name: 'GPU Used Memory'
key: gpu.used
delay: '30'
value_type: FLOAT
units: B
preprocessing:
- type: MULTIPLIER
parameters:
- '1048576'
tags:
- tag: Application
value: Nvidia
- uuid: 244a7ef6b66e4e02b535914ebf78a96f
name: 'GPU Utilisation'
key: gpu.utilisation
delay: '30'
value_type: FLOAT
units: '%'
tags:
- tag: Application
value: Nvidia
graphs:
- uuid: 14a164bdb28648b7a9f57aa9c0b3d78c
name: 'GPU Memory'
graph_items:
- color: C80000
item:
host: 'NVidia Sensors vGPU'
key: gpu.free
- sortorder: '1'
color: 00C800
item:
host: 'NVidia Sensors vGPU'
key: gpu.memtotal
- sortorder: '2'
color: 0000C8
item:
host: 'NVidia Sensors vGPU'
key: gpu.used
- uuid: d3f857f7b7444c4faff21b736bdca130
name: 'GPU Utilisation'
graph_items:
- color: C80000
item:
host: 'NVidia Sensors vGPU'
key: gpu.utilisation

0 comments on commit 2a5a14d

Please sign in to comment.