Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPC: Fix port checking #86

Merged
merged 1 commit into from
Oct 21, 2024
Merged

HPC: Fix port checking #86

merged 1 commit into from
Oct 21, 2024

Conversation

Schlevidon
Copy link
Collaborator

@Schlevidon Schlevidon commented Oct 16, 2024

Summary

Fixed the issue where a job would sometimes fail to start the model server due to the port being occupied.

Details

  • Previously, the command lsof was used to determine whether a port is free or not. However, without root permissions lsof can only show open connections for the user who ran the command (i.e. not for all the other users on the HPC cluster).
  • The issue is fixed by instead using a simple C++ program which attempts to bind a socket to an address for a given port.

Warning

There is still a race condition that can occur in the time frame between the job script checking the port and the model server actually occupying it. However, I didn't encounter this issue yet during my tests and fixing it would require some major changes to the UM-Bridge interface used for serving models.

Related Issues

closes #83, closes #48

@linusseelinger
Copy link
Member

Great, thanks! :)

@linusseelinger linusseelinger merged commit 660d146 into main Oct 21, 2024
26 checks passed
@Schlevidon Schlevidon deleted the hpc-fix-port-issue branch November 19, 2024 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HPC: Port selection bug in job.sh HPC: Sporadic model crashes on Helix
2 participants