Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include CI for loadbalancer and hyperqueue testing #67

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions .github/workflows/hpc-load-balancer.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: hpc-load-balancer

on:
push:
pull_request:
branches:
- 'main'


jobs:

build-and-setup:
runs-on: ubuntu-latest
container: ubuntu:latest

steps:
- name: Checkout
uses: actions/checkout@v3

- name: Dependencies
run: |
apt update -qq && DEBIAN_FRONTEND="noninteractive" apt install -yq g++ make wget curl tar

- name: Build load balancer binary
run: |
cd hpc && make build-load-balancer

- name: Download and setup hq binary
run: |
url=$(curl -sSL https://api.github.com/repos/It4innovations/hyperqueue/releases/latest | \
grep -o "\"browser_download_url\": \"https://[^\"]*-linux-x64.tar.gz\"" | \
cut -d '"' -f 4)
if [ -z "$url" ]; then
echo "Error: URL not found"
exit 1
fi

filename="hq-linux-x64.tar.gz"
wget -q $url -O $filename
tar xzf $filename
./hq --version

15 changes: 13 additions & 2 deletions hpc/LoadBalancer.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,15 +25,19 @@ void clear_url(std::string directory) {
}

void launch_hq_with_alloc_queue() {
std::system("hq server stop &> /dev/null");
std::system("./hq server stop &> /dev/null");

std::system("hq server start &");
std::system("./hq server start &");
sleep(1); // Workaround: give the HQ server enough time to start.

// Create HQ allocation queue
std::system("hq_scripts/allocation_queue.sh");
}

bool file_exists(const std::string& path) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kind of redundant?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah true - habit, thought I would use it more

return std::filesystem::exists(path);
}

const std::vector<std::string> get_model_names() {
// Don't start a client, always use the default job submission script.
HyperQueueJob hq_job("", false, true);
Expand All @@ -49,6 +53,13 @@ int main(int argc, char *argv[])
create_directory_if_not_existing("sub-jobs");
clear_url("urls");

// Check if the hq binary exists
std::string hq_binary_path = "./hq";
if (!file_exists(hq_binary_path)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! Could do similar checks for job.sh and alloc script as well!

std::cerr << "Error: hq binary does not exist at " << hq_binary_path << std::endl;
return 1;
}

launch_hq_with_alloc_queue();

// Read environment variables for configuration
Expand Down
6 changes: 3 additions & 3 deletions hpc/LoadBalancer.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@ class HyperQueueJob
~HyperQueueJob()
{
// Cancel the SLURM job
std::system(("hq job cancel " + job_id).c_str());
std::system(("./hq job cancel " + job_id).c_str());

// Delete the url text file
std::system(("rm ./urls/url-" + job_id + ".txt").c_str());
Expand All @@ -113,7 +113,7 @@ class HyperQueueJob
const std::filesystem::path submission_script_generic("job.sh");
const std::filesystem::path submission_script_model_specific("job_" + model_name + ".sh");

std::string hq_command = "hq submit --output-mode=quiet ";
std::string hq_command = "./hq submit --output-mode=quiet ";
hq_command += "--priority=" + std::to_string(job_count) + " ";
if (std::filesystem::exists(submission_script_dir / submission_script_model_specific) && !force_default_submission_script)
{
Expand Down Expand Up @@ -154,7 +154,7 @@ class HyperQueueJob
// state = ["WAITING", "RUNNING", "FINISHED", "CANCELED"]
bool waitForHQJobState(const std::string &job_id, const std::string &state)
{
const std::string command = "hq job info " + job_id + " | grep State | awk '{print $4}'";
const std::string command = "./hq job info " + job_id + " | grep State | awk '{print $4}'";
// std::cout << "Checking runtime: " << command << std::endl;
std::string job_status;

Expand Down
8 changes: 6 additions & 2 deletions hpc/hq_scripts/allocation_queue.sh
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
#! /bin/bash

# Note: For runs on systems without SLURM, replace the slurm allocator by
# hq worker start &
# ./hq worker start &

if [[ ! -f "./hq" ]]; then
echo "Error: hq binary does not exist at ./hq"
exit 1
fi

hq alloc add slurm --time-limit 10m \
./hq alloc add slurm --time-limit 10m \
--idle-timeout 3m \
--backlog 1 \
--workers-per-alloc 1 \
Expand Down
Loading