Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spread fit log file location on remote machines #6

Open
achubaty opened this issue Apr 12, 2021 · 0 comments
Open

spread fit log file location on remote machines #6

achubaty opened this issue Apr 12, 2021 · 0 comments

Comments

@achubaty
Copy link
Contributor

achubaty commented Apr 12, 2021

When setting up the cluster object, the log file (e.g., fireSense_SpreadFit_2021-04-12_124702_pid27400.log) is being specified explicitly in a way that requires that the full file path exists on the remote cluster nodes.

This location should perhaps be in /tmp on the remote machine(s) because there is no guarantee the project output path exists there. If that path does not exist, fitting fails with the folowing error:

Error in file(outfile, open = "a") : cannot open the connection
Calls: workRSOCK -> sinkWorkerOutput -> file
In addition: Warning message:
In file(outfile, open = "a") :
  cannot open file '/mnt/projects2/WBI_SBW/outputs/AB/fireSense_SpreadFit_2021-04-12_104043_pid27400.log': No such file or directory
Execution halted
Error in file(outfile, open = "a") : cannot open the connection
Calls: workRSOCK -> sinkWorkerOutput -> file
In addition: Warning message:
In file(outfile, open = "a") :
  cannot open file '/mnt/projects2/WBI_SBW/outputs/AB/fireSense_SpreadFit_2021-04-12_104043_pid27400.log': No such file or directory
Execution halted
Error in file(outfile, open = "a") : cannot open the connection
Calls: workRSOCK -> sinkWorkerOutput -> file
In addition: Warning message:
In file(outfile, open = "a") :
  cannot open file '/mnt/projects2/WBI_SBW/outputs/AB/fireSense_SpreadFit_2021-04-12_104043_pid27400.log': No such file or directory
Execution halted
2021-04-12 10:47:45 ERROR::e: Failed to launch and connect to R worker on remote machineforcast01.localfrom local machineforcast02.
 * The error produced by socketConnection() was:reached elapsed time limit’ (which suggests that the connection timeout of 120 seconds (argument 'connectTimeout') kicked in)
 * The localhost socket connection that failed to connect to the R worker used port 11604 using a communication timeout of 2592000 seconds and a connection timeout of 120 seconds.
 * Worker launch call: '/usr/bin/ssh' -R 11672:localhost:11604 forcast01.local "'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'workRSOCK <- tryCatch(parallel:::.slaveRSOCK, error=function(e) parallel:::.workRSOCK); workRSOCK()' MASTER=localhost PORT=11672 OUT=/mnt/projects2/WBI_SBW/outputs/AB/fireSense_SpreadFit_2021-04-12_104043_pid27400.log TIMEOUT=2592000 XDR=FALSE".
 * Troubleshooting suggestions:
   - Suggestion #1: Set 'verbose=TRUE' to see more details.
   - Suggestion #2: Set 'outfile=NULL' to see output from worker.
   - Suggestion #3: Set 'rshlogfile=TRUE' to enable logging for ‘/usr/bin/ssh’.

 * Number of attempts: 3 (15s delay)
 2021-04-12 10:47:45 ERROR::e: socketConnection("localhost", port = port, server = TRUE, blocking = TRUE, open = "a+b", timeout = timeout)
Error in socketConnection("localhost", port = port, server = TRUE, blocking = TRUE,  : 
  Failed to launch and connect to R worker on remote machineforcast01.localfrom local machineforcast02.
 * The error produced by socketConnection() was:reached elapsed time limit’ (which suggests that the connection timeout of 120 seconds (argument 'connectTimeout') kicked in)
 * The localhost socket connection that failed to connect to the R worker used port 11604 using a communication timeout of 2592000 seconds and a connection timeout of 120 seconds.
 * Worker launch call: '/usr/bin/ssh' -R 11672:localhost:11604 forcast01.local "'Rscript' --default-packages=datasets,utils,grDevices,graphics,stats,methods -e 'workRSOCK <- tryCatch(parallel:::.slaveRSOCK, error=function(e) parallel:::.workRSOCK); workRSOCK()' MASTER=localhost PORT=11672 OUT=/mnt/projects2/WBI_SBW/outputs/AB/fireSense_SpreadFit_2021-04-12_104043_pid27400.log TIMEOUT=2592000 XDR=FALSE".

The error about timeout is a red herring, as it's the warning about the path not existing that is the real culprit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant