Skip to content

tool to help troubleshoot hpc cluster (slurm for now). for lazy sys admins :)

License

Notifications You must be signed in to change notification settings

nicolaschan/bofhbot

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

master branch: status1 // nodelist branch: status2

bofhbot output screenshot

BofhBot

An enhanced sinfo -R for lazy sys admins--aka bofh :)

BOFH is known to be a lazy slacker. Laziness is a virtue. which is why a bot is born to do the job for BOFH. In this case, look at sinfo -R and help repair slurm nodes that are sick. At first it would just help with some of the initial troubleshooting manial work, such as see if node:

  • is pingable
  • can be ssh to
  • nhc report status
  • some manual mount and checks not corrently configured in nhc cuz of histerical reason
  • ipmi reachability (eg can ipmi power status/cycle) revive the node
  • slurm pid
  • node status (presumably no running job on the node)
  • recommended action
  • confidence of recommendation

Example output

sinfo -R -S %E --format="%9u %19H %6t %N %E"                                    # + extra columns


USER      TIMESTAMP           STATE  NODELIST REASON                            # fixid ping  ssh nhc slurm-pid ipmi-powerstatus recommendation  confidence
Node unexpectedly re root      2018-07-31T20:06:16 n0181.savio2                   hash1 ok    ok  ok  nnnn      on               scontrol... state=resume 80%

### this last one would benefit from historical record of how many times node upped to avoid getting stuck in a loop.

root      2018-08-11T10:22:31 down*  n0152.savio1 Not responding                # hashB no    no  na  na        on               ipmi cycle 99%
root      2018-07-24T09:38:02 down*  n0165.savio1 Not responding                # hashC ok    cant login na     not responding   wwsh ipmi cycle  80%

root      2018-05-21T10:20:52 down*  n0030.savio1 HW. Tin. need open case.   wont power on, even after reseating blade
root      2018-06-26T10:08:33 down*  n0107.savio2 HW.ReseatedManyTime_WontPowerOn.Tin
root      2018-05-21T10:48:18 down*  n0000.savio2 HW: Tin. check console/bios boot settings.  when boot again check hw health/wonkiness
root      2018-05-21T11:20:32 drain* n0096.savio2 HW: need HT off. Tin
root      2018-06-26T10:50:56 down*  n0114.savio2 HW:54gRAMonly
root      2018-07-30T15:57:35 drain* n0133.savio1 HW:replace sda,io err-Sn
root      2018-07-30T12:26:45 drain* n0108.savio2 HW:sda missing after reseat-Sn
root      2018-06-26T10:44:57 down*  n0113.savio2 HW:shutdownAgainWontPowerOn_openCase
root      2018-05-11T14:44:33 down   n0166.savio2 INC0661212_nick
root      2018-07-26T15:53:16 drain  n0124.savio2 NHC: check_fs_mount:  /global/scratch not mounted
root      2018-07-26T15:53:16 drain  n0117.savio2 NHC: check_hw_physmem:  Actual RAM size (57459408 kB) less than minimum allowed (67108864 kB).
root      2018-04-06T13:04:50 down*  n0234.savio2 NoBootDev_Tin
root      2018-03-15T18:37:30 drain  n0182.savio2 NoIB_Tin


root      2018-08-06T07:47:26 down*  n0035.savio2,
n0087.savio2,
n0091.savio2 Not responding

Known issues

  • std out redirect may truncate output. Use | tee filename.out instead. Affects v 0.2 with parallelized ssh.
./bofhbot.py > result.out      # this will get last few lines truncated
./bofhbot.py | tee result.out  # this works correctly.

TL;DR

Future output possibilities

append several columns to it:

ping  ssh   nhc   slurm-pid  ipmi-powerStatus   recommed-action
3ms   5min  na    na         on                 wwsh ipmi cycle
no    no    na    na         time-out           check net cable

Instead of report ssh up or down, just have it run uptime command:

uptime
1day
ssh-DOWN

output as quickly as node responses are gathered. probably avoid ncurses/top interface, so easier to capture output to file.

each row add an index number? short hash? so that future can do --fixit option to such record?

this for now don't have a database of recommended actions. but still allow for cron every few hours to see its recommended action.

add an action confidence% :)

colored output maybe nice. future default to --color best if output to console and TERM support color, automatically do it. But if redirecting or piping, then color is off automagically. If too much work, start with just supporting --color to explicitly trigger coloring of output (think of ls and grep)

Motivation

a bot that fixes slurm problem reported by sinfo -R (and possibly other source of complains).

some problems should be fairly non-harmful to fix.

  • resume node.
  • login to node, do a few sanity check.
    • nhc
    • expected mounts (esp those not checked by nhc)
    • df /global/software
    • ibstat
    • df /global/scratch
  • if pass check, can resume.

Infrastructure:

  • history check for each node.
    • if repeatedly needing same fix, say within last 24 hours. then get sys admin help
    • this also allow for query history of repair of each node
      • bofh node=n0000.testbed history...
      • something after wwsh :)
  • state machine of each node's health
    • to help determine if reboot, etc action would impact user.
    • similar to nhc?
      • checked ib
      • checked eth
    • any job running on the node? (squeue -p PART | grep node; more direct way to query?)
  • history of (recommended/emailed) fixes:
    • email report at first with recomended action for sys admin
    • exact cmd for cut-n-paste (prefixed with sudo when needed)
    • email cannot nag. send out only once
      • setting to remember 4 days? 7 days?
        • do not nag about a problem if reported before
        • reminder of old problems if not fixed after X remembered days?
      • command to clear out all alerts
      • command to reduce "remembered" days, ie, if change remember from 4 to 3 days, would trim db records for anything older than 3 days. (is this really needed?)
      • commands to re-list all pending fix recommendations on demand
        • for last 1, 2, 3 days.

EXAMPLE cmd

bofhbot -R
a better output than sinfo -R add basic troublshoot info as extra columns as defined above
bofhbot --list
show more extensive problem. eg,
NHC: check_fs_mount
becomes
/tmp not mounted. sda missing.

but nhc may have the info? or need more extensive config than nhc?

no, can look at fstab, just do things sys admin would do...

show recommended actions from its history db

bofhbot -i sinfo-RSE.txt
Use input file containing list of problem nodes. This allow for offline development without needing a cluster Also allow new user to have a 'safe mode' to run bofhbot to gain familiarity to its functioning without worry of it wrecking havoc on the production HPC.

Low hanging fruits

example of sinfo -R that are easy to fix:

Node unexpectedly re slurm 2017-11-21T09:23:16 n0012.etna0,n0016.etna0,n0017.etna0
scontrol update node=... state=resume
batch job complete f root 2018-07-22T15:10:04 n0032.savio2
scontrol update node=n0032.savio2 state=resume

Not responding root 2018-07-24T10:48:02 n0283.savio2

  • if not pingable (param to set CanRelyOnPing=True)
  • not ssh-able
  • can then ipmi power cycle the node
  • NodeBootWaitTime=180 (seconds)
  • beyond this, email sys admin and ask for manual intervention.

API vs CLI

  • Programmatically, getting input data from API provided by Slurm or DRMAA should provide for a more stable input interface.
  • But I want to minimize on requirement needed to run tool, thus handling output from cli tool like "sinfo -RSE" is easier on the user
  • Would be nice if sinfo or qhost can output in machne format like json, xml or even comma- or pipe- delimited.
  • since using cli output, may need a way to specify what input tool and version is using. may need to handle --input-format="slurm_17.1" and the like in case format changes. things to worry when that problem arises...

Division of labors

  • cli parser
  • case statement of all sinfo -R message and dispatch what module to call
  • function for each kind of check:
    • nodePingable()
    • nodeSshable()
    • nodeIpmiable()
  • node status health state machine
    • maybe needed before can create a recommended fix action
    • confidence level. should actually start out small, not very confident :)
  • history of recommended actions db (as sqlite db file? in a high level $HOME/.bofhbot/ dir??)

Branches

Don't know... I suppose should have a dev branch that is less stable than master...

License

BSD 3-clause, as indicated in the github license choice for this project.

other names

  • slurmbot
  • sinfobot
  • hpcbot
  • bofhbot - yeah, i like this!

.rst reference

apparently boxing title with ===== above and below a line could throw off validator. was that a .md feature? but it had worked on short rst... validate rst as:

pip install rstvalidator
python -m rstvalidator README.rst

About

tool to help troubleshoot hpc cluster (slurm for now). for lazy sys admins :)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 89.2%
  • HTML 8.9%
  • Jupyter Notebook 1.5%
  • Other 0.4%