-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specifying network interface #260
Comments
If I understand correctly, you are creating a cluster manually (by calling |
Yes. I use ray start to spawn ray processes on multiple machines:
I also change ray get_node_ip_address() function to always return the IP address of the internal network interface, so that ray actor/task/object store communicates through the internal network interface. But it seems xgboost_ray/xgboost uses its own collective communication framework that would automatically choose the public network interface. Best, |
Hmm, xgboost_ray should use the IP returned of |
It returns "192.168.6.1" on node-0, and "192.168.6.2" on node-1, etc. |
How do you detect that xgboost chooses the wrong interface? One place where extra logging can be added is |
I |
I am not an expert in Rabit, but it looks like you may need to set up OS-level routing for it to work, or disable the other network interface. I'd also consider opening an issue in the xgboost repository. |
Hi Xgboost_ray authors,
I just wonder if it is possible to specify the network interface used in xgboost_ray/xgboost. Currently I am running xgboost_benchmark.py in a shared testbed (https://www.cloudlab.us/) where each machine has one public network interface and one internal network interface. However, xgboost_ray/xgboost would automatically choose the public network interface, which has a much lower network bandwidth than the internal one.
This is what my machine has from
ifconfig
, and I would like to use interfaceens1f1
instead ofeno49
. Is there any way to achieve that. Thanks in advance!Best,
Yang
The text was updated successfully, but these errors were encountered: