What happens when you have a multi-target regression problem and you still want to use XGBoost's implementation of gradient boosting trees (or any other regressor that does not natively support multi-target regression)? Probably the quickest solution is to use scikit-learn's MultiOutputRegressor()
, a wrapper that fits one regressor per target.
import xgboost as xgb
from sklearn.multioutput import MultiOutputRegressor
# some function to return your training data
X, y = import_and_prepare_your_training_data()
print("Training samples: %d"%X.shape[0])
print("Input features: %d"%X.shape[1]))
print("Targets: %d"%y.shape[1]))
model1 = MultiOutputRegressor(\
xgb.XGBRegressor(objective='reg:squarederror'))
model1.fit(X,y)
>>>
Training samples: 1000
Input features: 145
Targets: 118
Quite easily one can now use XGBoost for multi-target regression. However, when dealing with large training sets, a little performance optimization can go a long way to reduce the required training time and one must be careful to parallelize the workload at the approptriate level. Recalling that scikit-learn's MultiOutputRegressor()
is training one regressor per target value, its built in multiprocessing will distribute the training of one XGBoost regressor per n_jobs
CPUs until all targets have been trained. Contrarily, XGBoost uses MPI threading to parallelize the training of a single XGBoost regressor over however the nthread
CPUs allocated for the job, a task which subsequently must be serially executed by scikit-learn for the number of targets in our problem. Making 10 CPUs available for the training, we can see that scikit-learn's parallelization scheme in this multi-target regression application provides a speed up in training time of about 4x. This is quite nice as my dataset actually contains ~400,000 training examples (not the 1,000 shown below), thus saving me hours of train time.
import xgboost as xgb
from sklearn.multioutput import MultiOutputRegressor
import time
# some function to return your training data
X, y = import_and_prepare_your_training_data()
model1 = MultiOutputRegressor(\
xgb.XGBRegressor(objective='reg:squarederror'),n_jobs=10)
model2 = MultiOutputRegressor(\
xgb.XGBRegressor(objective='reg:squarederror',nthread=10))
start1 = time.time()
model1.fit(X,y)
end1 = time.time()
start2 = time.time()
model2.fit(X,y)
end2 = time.time()
print("Model 1 execution time: %.2f"%(end1-start1))
print("Model 2 execution time: %.2f"%(end2-start2))
>>>
Model 1 execution time: 14.74
Model 2 execution time: 65.21