Releases: leptonai/gpud
Releases · leptonai/gpud
gpud-v0.3.9
GPUd release notes (2025-01-10T14:50:02Z)
Welcome to this new release!
What's Changed
- fix(ci): bump up linux header deps by @gyuho in #292
- fix(nvml): handle "not supported" error to not fail-fast for NVML get calls by @gyuho in #291
Full Changelog: v0.3.8...v0.3.9
gpud-v0.3.8
GPUd release notes (2025-01-08T13:27:48Z)
Welcome to this new release!
What's Changed
- fix(pkg/process): gracefully handle read operations on aborted process, Read to return error if not started by @gyuho in #276
- fix(package-controller): invoke process start before process read by @cardyok in #277
- fix(os): fetch system manufacturer once for linux by @gyuho in #274
- fix(disk/lsblk): support older lsblk without JSON mode, using --pairs by @gyuho in #278
- feat(nvml): include xid events JSON, dmesg xid/sxid to include device UUID field, fix flaky tests, clean up lsblk logs by @gyuho in #279
- feat(fuse): track connections with /meterics (for waiting/congested FUSE connection, per fuse device), lower hw-slowdown event level from warning to info by @gyuho in #268
- fix(systemd): set shorter context timeout for dbus calls by @gyuho in #280
- fix(pkg/disk): skip usage table output render if unmounted by @gyuho in #283
- fix(dmesg): "journalctl" as fallback, when older dmesg does not support "--since" flag (<2.37) by @sunhailin-Leo in #282
- feat(cpu/dmesg): add regex to catch hung tasks, soft lockup by @gyuho in #285
- nit(nvidia/xid-sxid-state): make purge tests less flaky by @gyuho in #286
- feat(go module): upgrade dependencies fsnotify, grpc, k8s*, prom by @gyuho in #289
- feat(nvidia/peermem): explicitly skip "invalid context" errors by @gyuho in #288
- feat(cpu,memory): return hung task, soft lockup, oom from dmesg via /events, fix log item error type to "*string" by @gyuho in #287
- feat(state): separate read-only sqlite instance for better concurrency by @gyuho in #281
New Contributors
- @sunhailin-Leo made their first contribution in #282
Full Changelog: v0.3.7...v0.3.8
gpud-v0.3.7
GPUd release notes (2024-12-27T00:56:51Z)
Welcome to this new release!
What's Changed
- fix(disk): exit on lsblks success during retries by @gyuho in #263
- feat(nvidia/query): bump up nvidia-smi cmd timeout, better debugging info by @gyuho in #261
- feat(pkg/process): label process for better debugging info by @gyuho in #264
- fix(query/log/tail): fix time parser for initial lines, use correct time for fabric manager /events by @gyuho in #260
- feat(dmesg): log watch command only up to 1 hour by @gyuho in #266
- feat(nvidia/xid, sxid): support query by event type by @gyuho in #267
- feat(nvidia): use last successful data in shared poller, shared nvidia-smi/nvml poller to still return data if one operation fails by @gyuho in #265
- fix(containerd): skip podsandboxstatus failure by @cardyok in #269
- fix(containerd/pod): add missing import line by @gyuho in #270
- fix(nvidia/query): poller to return error on nvidia-smi failure by @gyuho in #271
Full Changelog: v0.3.6...v0.3.7
gpud-v0.3.6
GPUd release notes (2024-12-18T09:14:10Z)
Welcome to this new release!
What's Changed
- feat(poll): set default "get" operation timeout, higher timeout for latency checks by @gyuho in #247
- feat(process): read stderr in case of command failures, improve disk get error handling by @gyuho in #248
- feat(disk): use "findmnt --target" to find filesystem usage by @gyuho in #249
- feat(lsblk): add more test case, clarify parse error by @gyuho in #251
- nit(disk): rename state key to disk_ext_partition by @gyuho in #254
- fix(controller): only read stdout for run command by @gyuho in #253
- fix(nvidia): report installed when nvml return unknown error on device by @cardyok in #255
- fix(os): run machine/boot id get calls only for linux, gpud run exit 1 on non-linux platform by @gyuho in #250
- fix(containerd): use consistent state name by @cardyok in #258
- nit(containerd/pod): use id package for state name by @gyuho in #259
- feat(query): support getErrHandler func, log/ignore disk component error by @gyuho in #257
- fix(disk): add retries for lsblk by @gyuho in #256
Full Changelog: v0.3.5...v0.3.6
gpud-v0.3.5
GPUd release notes (2024-12-13T02:34:12Z)
Welcome to this new release!
What's Changed
- nit(gpud): fix flag description --expected-port-states-nvidia-infiniband by @gyuho in #231
- feat(components/os): detect virt environment, system manufacturer by @gyuho in #235
- nit(diagnose): print matched dmesg line in scan command by @gyuho in #237
- feat(components): add missing event type in /events by @gyuho in #233
- nit(dmesg): add more regex OOM matcher test cases with timestamps by @gyuho in #238
- feat(components/dmesg): simplify /events fields by @gyuho in #234
- nit(containerd/pod): rename state keys by @gyuho in #239
- feat(components/disk): track total mounted ext partitions, block "disk" devices, "scan --diskcheck" by @gyuho in #232
- feat(go.mod): upgrade go sqlite3 by @gyuho in #241
- feat(components/pci): check PCI access control services for baremetal systems by @gyuho in #236
- feat(components/os): use os machine id for uuid as fallback, support reboot events using boot id by @gyuho in #240
- feat(components/dmesg): catch EDAC correctable errorrs in dmesg by @gyuho in #242
- chore(deps): bump golang.org/x/crypto from 0.25.0 to 0.31.0 by @dependabot in #244
- fix(process/virt): handle systemd-detect-virt exit code 1, simplify process calls by @gyuho in #243
- feat(pci): move /states to /events for acs srv-valid checks by @gyuho in #245
- feat(components): define event type enum, fix os component context setup, adjust hw slowdown event type, simplify PCI reason message by @gyuho in #246
New Contributors
- @dependabot made their first contribution in #244
Full Changelog: v0.3.4...v0.3.5
gpud-v0.3.4
GPUd release notes (2024-12-05T00:55:41Z)
Welcome to this new release!
What's Changed
- feat(nvidia/hw-slowdown): include GPU UUID in /events, persist smi for /events, dedup hw slowdown events by data source and nearest minutes, do not return hw slowdown clock events in /states, fix nvidia query get function context timeout by @gyuho in #229
- feat(nvidia/infiniband): better ib ports/rate checking based on port physical/state by @gyuho in #230
Full Changelog: v0.3.3...v0.3.4
gpud-v0.3.3
GPUd release notes (2024-12-03T14:25:05Z)
Welcome to this new release!
What's Changed
Full Changelog: v0.3.2...v0.3.3
gpud-v0.3.2
GPUd release notes (2024-12-03T11:27:28Z)
Welcome to this new release!
What's Changed
- fix(nvidia/hw-slowdown): rename from "clock" to only expose hardware slowdown issues, convert to events by @gyuho in #225
- feat(server): send components in gossip by @cardyok in #226
- feat(nvidia): set components/events timestamp in UTC explicitly by @gyuho in #227
Full Changelog: v0.3.1...v0.3.2
gpud-v0.3.1
GPUd release notes (2024-12-02T11:43:12Z)
Welcome to this new release!
What's Changed
- fix(nvidia/clock): use nvml clock events, fall back to nvidia-smi parsing by @gyuho in #220
- fix(nvidia/query): only evaluate memory error management capabilities when product name found, add missing GPU ID in nvidia-smi parsing remapped rows by @gyuho in #221
- fix(nvidia): derive product name using NVML results first by @gyuho in #222
- feat(session): make context local to each session for flexibility by @cardyok in #223
- feat(fd): monitor VFS file-max limit with allocated file handles on Linux by @gyuho in #224
Full Changelog: v0.3.0...v0.3.1
gpud-v0.3.0
GPUd release notes (2024-11-30T13:06:09Z)
Welcome to this new release!
What's Changed
- fix(nvidia/nvml): correct boolean checks on whether clock events supported by @gyuho in #215
- fix(session): close reader channel on fast return by @cardyok in #214
- fix(cmd/gpud): handle "run --expected-port-states-nvidia-infiniband" flag by @gyuho in #212
- fix(nvidia/remapped-rows): surface product name as reason regardless of its healthy-ness by @gyuho in #216
- fix(nvml/clock_events): enable clock events component when a single GPU device supports it by @gyuho in #218
- feat(components/memory): track current jit alloc buffer size, vm alloc status by @gyuho in #213
- fix(client): adding get states decode call, status command to check local gpud "/states", add sub-command aliases by @gyuho in #211
- feat(nvidia/xid-sxid): increase xid/sxid table retention period to 3-hour by @gyuho in #217
- fix(nvidia): remove error count "8" threshold for row remapping failures to qualify for RMA by @gyuho in #219
Full Changelog: v0.2.5...v0.3.0