This folder contains information of Distributed XGBoost (Distributed GBDT).
- The distributed version is built on Rabit:Reliable Allreduce and Broadcast Library
- Rabit is a portable library that provides fault-tolerance for Allreduce calls for distributed machine learning
- This makes xgboost portable and fault-tolerant against node failures
- You can run Distributed XGBoost on platforms including Hadoop(see hadoop folder) and MPI
- Rabit only replies a platform to start the programs, so it should be easy to port xgboost to most platforms
- In the root folder, type
make
- If you have C++11 compiler, it is recommended to use
make cxx11=1
- If you have C++11 compiler, it is recommended to use
- Rabit handles all the fault tolerant and communications efficiently, we only use platform specific command to start programs
- The Hadoop version does not rely on Mapreduce to do iterations
- You can expect xgboost not suffering the drawbacks of iterative MapReduce program
- The design choice was made because Allreduce is very natural and efficient for distributed tree building
- In current version of xgboost, the distributed version is only adds several lines of Allreduce synchronization code
- The multi-threading nature of xgboost is inheritated in distributed mode
- This means xgboost efficiently use all the threads in one machine, and communicates only between machines
- Remember to run on xgboost process per machine and this will give you maximum speedup
- For more information about rabit and how it works, see the Rabit's Tutorial
There are two solvers in distributed xgboost. You can check for local demo of the two solvers, see row-split and col-split
- Column-based solver split data by column, each node work on subset of columns, it uses exactly the same algorithm as single node version.
- Row-based solver split data by row, each node work on subset of rows,
it uses an approximate histogram count algorithm, and will only examine subset of
potential split points as opposed to all split points.
- This is the mode used by current hadoop version, since usually data was stored by rows in many industry system