Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPC: Sporadic model crashes on Helix #48

Closed
linusseelinger opened this issue Jan 23, 2024 · 0 comments · Fixed by #86 · May be fixed by #71
Closed

HPC: Sporadic model crashes on Helix #48

linusseelinger opened this issue Jan 23, 2024 · 0 comments · Fixed by #86 · May be fixed by #71
Assignees

Comments

@linusseelinger
Copy link
Member

linusseelinger commented Jan 23, 2024

Multiply by 2 test jobs (modified to take 10 seconds per evaluation) occasionally log Quit after Listening on port x.... I'm running 100 instances via HQ, queried from the test script below. The issue happens around once every 300 runs. As a result, the test script waits infinitely for the failed job to return something.

#!/bin/bash 
 
echo "Sending requests..." 
 
for i in {1..100} 
do 
   # Expected output: {"output":[[200.0]]} 
   # Check if curl output equals expected output 
   # If not, print error message 
 
   if [ "$(curl -s http://localhost:4242/Evaluate -X POST -d '{"name": "forward", "input": [[100.0]]}')" == '{"output":[[200.0]]}' ]; then 
       echo -n "y" 
   else 
       echo -n "n" 
       #echo "Error: curl output does not equal expected output" 
   fi & 
 
done 
 
echo "Requests sent. Waiting for responses..." 
 
wait

A possible workaround is to set a minimum port of 60000. I therefore suspect the issue is just an occupied port.

Either the port finder does not work as intended on Helix, or maybe there is a race condition due to the short time between port finder and reserving the port by launching the model (seems unlikely).

@LennoxLiu LennoxLiu self-assigned this Feb 12, 2024
@LennoxLiu LennoxLiu linked a pull request Apr 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants