Skip to content

Releases: leptonai/gpud

gpud-v0.3.9

10 Jan 14:51
b794199
Compare
Choose a tag to compare

GPUd release notes (2025-01-10T14:50:02Z)

Welcome to this new release!

What's Changed

  • fix(ci): bump up linux header deps by @gyuho in #292
  • fix(nvml): handle "not supported" error to not fail-fast for NVML get calls by @gyuho in #291

Full Changelog: v0.3.8...v0.3.9

gpud-v0.3.8

08 Jan 13:26
cdebb0a
Compare
Choose a tag to compare

GPUd release notes (2025-01-08T13:27:48Z)

Welcome to this new release!

What's Changed

  • fix(pkg/process): gracefully handle read operations on aborted process, Read to return error if not started by @gyuho in #276
  • fix(package-controller): invoke process start before process read by @cardyok in #277
  • fix(os): fetch system manufacturer once for linux by @gyuho in #274
  • fix(disk/lsblk): support older lsblk without JSON mode, using --pairs by @gyuho in #278
  • feat(nvml): include xid events JSON, dmesg xid/sxid to include device UUID field, fix flaky tests, clean up lsblk logs by @gyuho in #279
  • feat(fuse): track connections with /meterics (for waiting/congested FUSE connection, per fuse device), lower hw-slowdown event level from warning to info by @gyuho in #268
  • fix(systemd): set shorter context timeout for dbus calls by @gyuho in #280
  • fix(pkg/disk): skip usage table output render if unmounted by @gyuho in #283
  • fix(dmesg): "journalctl" as fallback, when older dmesg does not support "--since" flag (<2.37) by @sunhailin-Leo in #282
  • feat(cpu/dmesg): add regex to catch hung tasks, soft lockup by @gyuho in #285
  • nit(nvidia/xid-sxid-state): make purge tests less flaky by @gyuho in #286
  • feat(go module): upgrade dependencies fsnotify, grpc, k8s*, prom by @gyuho in #289
  • feat(nvidia/peermem): explicitly skip "invalid context" errors by @gyuho in #288
  • feat(cpu,memory): return hung task, soft lockup, oom from dmesg via /events, fix log item error type to "*string" by @gyuho in #287
  • feat(state): separate read-only sqlite instance for better concurrency by @gyuho in #281

New Contributors

Full Changelog: v0.3.7...v0.3.8

gpud-v0.3.7

27 Dec 00:56
fa9c0e5
Compare
Choose a tag to compare

GPUd release notes (2024-12-27T00:56:51Z)

Welcome to this new release!

What's Changed

  • fix(disk): exit on lsblks success during retries by @gyuho in #263
  • feat(nvidia/query): bump up nvidia-smi cmd timeout, better debugging info by @gyuho in #261
  • feat(pkg/process): label process for better debugging info by @gyuho in #264
  • fix(query/log/tail): fix time parser for initial lines, use correct time for fabric manager /events by @gyuho in #260
  • feat(dmesg): log watch command only up to 1 hour by @gyuho in #266
  • feat(nvidia/xid, sxid): support query by event type by @gyuho in #267
  • feat(nvidia): use last successful data in shared poller, shared nvidia-smi/nvml poller to still return data if one operation fails by @gyuho in #265
  • fix(containerd): skip podsandboxstatus failure by @cardyok in #269
  • fix(containerd/pod): add missing import line by @gyuho in #270
  • fix(nvidia/query): poller to return error on nvidia-smi failure by @gyuho in #271

Full Changelog: v0.3.6...v0.3.7

gpud-v0.3.6

18 Dec 09:14
7b51852
Compare
Choose a tag to compare

GPUd release notes (2024-12-18T09:14:10Z)

Welcome to this new release!

What's Changed

  • feat(poll): set default "get" operation timeout, higher timeout for latency checks by @gyuho in #247
  • feat(process): read stderr in case of command failures, improve disk get error handling by @gyuho in #248
  • feat(disk): use "findmnt --target" to find filesystem usage by @gyuho in #249
  • feat(lsblk): add more test case, clarify parse error by @gyuho in #251
  • nit(disk): rename state key to disk_ext_partition by @gyuho in #254
  • fix(controller): only read stdout for run command by @gyuho in #253
  • fix(nvidia): report installed when nvml return unknown error on device by @cardyok in #255
  • fix(os): run machine/boot id get calls only for linux, gpud run exit 1 on non-linux platform by @gyuho in #250
  • fix(containerd): use consistent state name by @cardyok in #258
  • nit(containerd/pod): use id package for state name by @gyuho in #259
  • feat(query): support getErrHandler func, log/ignore disk component error by @gyuho in #257
  • fix(disk): add retries for lsblk by @gyuho in #256

Full Changelog: v0.3.5...v0.3.6

gpud-v0.3.5

13 Dec 02:35
32430a7
Compare
Choose a tag to compare

GPUd release notes (2024-12-13T02:34:12Z)

Welcome to this new release!

What's Changed

  • nit(gpud): fix flag description --expected-port-states-nvidia-infiniband by @gyuho in #231
  • feat(components/os): detect virt environment, system manufacturer by @gyuho in #235
  • nit(diagnose): print matched dmesg line in scan command by @gyuho in #237
  • feat(components): add missing event type in /events by @gyuho in #233
  • nit(dmesg): add more regex OOM matcher test cases with timestamps by @gyuho in #238
  • feat(components/dmesg): simplify /events fields by @gyuho in #234
  • nit(containerd/pod): rename state keys by @gyuho in #239
  • feat(components/disk): track total mounted ext partitions, block "disk" devices, "scan --diskcheck" by @gyuho in #232
  • feat(go.mod): upgrade go sqlite3 by @gyuho in #241
  • feat(components/pci): check PCI access control services for baremetal systems by @gyuho in #236
  • feat(components/os): use os machine id for uuid as fallback, support reboot events using boot id by @gyuho in #240
  • feat(components/dmesg): catch EDAC correctable errorrs in dmesg by @gyuho in #242
  • chore(deps): bump golang.org/x/crypto from 0.25.0 to 0.31.0 by @dependabot in #244
  • fix(process/virt): handle systemd-detect-virt exit code 1, simplify process calls by @gyuho in #243
  • feat(pci): move /states to /events for acs srv-valid checks by @gyuho in #245
  • feat(components): define event type enum, fix os component context setup, adjust hw slowdown event type, simplify PCI reason message by @gyuho in #246

New Contributors

Full Changelog: v0.3.4...v0.3.5

gpud-v0.3.4

05 Dec 00:56
a0cb519
Compare
Choose a tag to compare

GPUd release notes (2024-12-05T00:55:41Z)

Welcome to this new release!

What's Changed

  • feat(nvidia/hw-slowdown): include GPU UUID in /events, persist smi for /events, dedup hw slowdown events by data source and nearest minutes, do not return hw slowdown clock events in /states, fix nvidia query get function context timeout by @gyuho in #229
  • feat(nvidia/infiniband): better ib ports/rate checking based on port physical/state by @gyuho in #230

Full Changelog: v0.3.3...v0.3.4

gpud-v0.3.3

03 Dec 14:22
75deab4
Compare
Choose a tag to compare

GPUd release notes (2024-12-03T14:25:05Z)

Welcome to this new release!

What's Changed

  • fix(components): separate timeout for poller get function calls by @gyuho in #228

Full Changelog: v0.3.2...v0.3.3

gpud-v0.3.2

03 Dec 11:28
e46a8f0
Compare
Choose a tag to compare

GPUd release notes (2024-12-03T11:27:28Z)

Welcome to this new release!

What's Changed

  • fix(nvidia/hw-slowdown): rename from "clock" to only expose hardware slowdown issues, convert to events by @gyuho in #225
  • feat(server): send components in gossip by @cardyok in #226
  • feat(nvidia): set components/events timestamp in UTC explicitly by @gyuho in #227

Full Changelog: v0.3.1...v0.3.2

gpud-v0.3.1

02 Dec 11:38
76a8775
Compare
Choose a tag to compare

GPUd release notes (2024-12-02T11:43:12Z)

Welcome to this new release!

What's Changed

  • fix(nvidia/clock): use nvml clock events, fall back to nvidia-smi parsing by @gyuho in #220
  • fix(nvidia/query): only evaluate memory error management capabilities when product name found, add missing GPU ID in nvidia-smi parsing remapped rows by @gyuho in #221
  • fix(nvidia): derive product name using NVML results first by @gyuho in #222
  • feat(session): make context local to each session for flexibility by @cardyok in #223
  • feat(fd): monitor VFS file-max limit with allocated file handles on Linux by @gyuho in #224

Full Changelog: v0.3.0...v0.3.1

gpud-v0.3.0

30 Nov 13:07
6d1d318
Compare
Choose a tag to compare

GPUd release notes (2024-11-30T13:06:09Z)

Welcome to this new release!

What's Changed

  • fix(nvidia/nvml): correct boolean checks on whether clock events supported by @gyuho in #215
  • fix(session): close reader channel on fast return by @cardyok in #214
  • fix(cmd/gpud): handle "run --expected-port-states-nvidia-infiniband" flag by @gyuho in #212
  • fix(nvidia/remapped-rows): surface product name as reason regardless of its healthy-ness by @gyuho in #216
  • fix(nvml/clock_events): enable clock events component when a single GPU device supports it by @gyuho in #218
  • feat(components/memory): track current jit alloc buffer size, vm alloc status by @gyuho in #213
  • fix(client): adding get states decode call, status command to check local gpud "/states", add sub-command aliases by @gyuho in #211
  • feat(nvidia/xid-sxid): increase xid/sxid table retention period to 3-hour by @gyuho in #217
  • fix(nvidia): remove error count "8" threshold for row remapping failures to qualify for RMA by @gyuho in #219

Full Changelog: v0.2.5...v0.3.0