Took from https://github.com/opiproject/otel
OPI standardized on OTEL, but is agnostic to actual collector or agent implementation that each vendor decides to run inside.
For example, Nvidia is using Nvidia Doca Telemetry Service, unfortanutely it does not support OTEL as of writing this paragraph.
Create telegraf.conf
file, see example here
- change
172.22.0.1
inoutputs.opentelemetry
to the correct management server name/ip - change
192.168.240.1
and credentails to the internal DPU/IPU AMC/BMC for redfish collection
Run telegraf container:
sudo docker run -d --restart=always --network=host -v ./telegraf.d/telegraf.conf.bf2:/etc/telegraf/telegraf.conf docker.io/library/telegraf:1.31
To monitor SPDK storage metrics, make sure correct service is running:
systemctl stop mlnx_snap
systemctl start spdk_tgt
And few block devices exist to monitor, like:
spdk_rpc.py bdev_malloc_create -b Malloc0 64 512
spdk_rpc.py bdev_malloc_create -b Malloc1 64 512
And Proxy script is running:
# TODO: make it a service
spdk_rpc_http_proxy.py 0.0.0.0 9009 spdkuser spdkpass
And add this to your config file:
[[inputs.http]]
urls = ["http://localhost:9009"]
headers = {"Content-Type" = "application/json"}
method = "POST"
username = "spdkuser"
password = "spdkpass"
body = '{"id":1, "method": "bdev_get_iostat"}'
data_format = "json"
name_override = "spdk"
json_strict = true
tag_keys = ["name"]
json_query = "result.bdevs"
For regular Servers, add to your config file:
[[inputs.temp]]
# no configuration
For Nvidia BlueField
cards, to monitor temperature, add to your telegraf config file:
[[inputs.file]]
files = ["/run/emu_param/bluefield_temp"]
name_override = "temp"
value_field_name="temp"
data_format = "value"
data_type = "integer"
file_tag = "sensor"
and add to your docker run command:
-v /run/emu_param:/run/emu_param
and make sure emulation service is running:
systemctl start set_emu_param
For Intel MEV
cards the temperature is on the ICC chip, no easy access to it, see #72:
[[inputs.exec]]
commands = ["iset-cli get-temperature"]
name_override = "temp"
data_format = "json"
See management server details here
Run docker compose up -d
or docker-compose up -d
❗ docker-compose
is deprecated. For details, see Migrate to Compose V2.
This will start those services:
- OTEL Gateway Collector to aggregate telemetry from all DPUs and IPUs.
- Prometheus Monitoring system & time series database
- Grafana Open source analytics & monitoring solution for every database.
- http://172.22.0.1:13133 - health check
- http://172.22.0.1:8888/metrics - my own metrics
- http://172.22.0.1:8889/metrics - real metrics
- Open http://172.22.0.1:9091 to explore Prometheus UI
- or via API examples:
curl --fail http://172.22.0.1:9091/api/v1/query?query=mem_free | grep mem_free
curl --fail http://172.22.0.1:9091/api/v1/query?query=cpu_usage_user | grep cpu_usage_user
curl --fail http://172.22.0.1:9091/api/v1/query?query=spdk_num_read_ops | grep spdk_num_read_ops
curl --fail http://172.22.0.1:9091/api/v1/query?query=nstat_TcpActiveOpens | grep nstat_TcpActiveOpens
curl --fail http://172.22.0.1:9091/api/v1/query?query=redfish_power_powercontrol_interval_in_min | grep redfish_power_powercontrol_interval_in_min
- Open http://172.22.0.1:3000 to explore Grafana UI
- or via API examples:
curl -s http://172.22.0.1:3000/api/datasources | jq .