Multi-target regression with XGBoost

Determining the parallelization scheme that minimizes training time

Posted by Matt Witman on October 21, 2019

What happens when you have a multi-target regression problem and you still want to use XGBoost's implementation of gradient boosting trees (or any other regressor that does not natively support multi-target regression)? Probably the quickest solution is to use scikit-learn's MultiOutputRegressor(), a wrapper that fits one regressor per target.

import xgboost as xgb
from sklearn.multioutput import MultiOutputRegressor

# some function to return your training data
X, y = import_and_prepare_your_training_data()
print("Training samples: %d"%X.shape[0])
print("Input features: %d"%X.shape[1]))
print("Targets: %d"%y.shape[1]))

model1 = MultiOutputRegressor(\
  xgb.XGBRegressor(objective='reg:squarederror'))

model1.fit(X,y)
            
 >>>
Training samples: 1000
Input features: 145
Targets: 118
          

Quite easily one can now use XGBoost for multi-target regression. However, when dealing with large training sets, a little performance optimization can go a long way to reduce the required training time and one must be careful to parallelize the workload at the approptriate level. Recalling that scikit-learn's MultiOutputRegressor() is training one regressor per target value, its built in multiprocessing will distribute the training of one XGBoost regressor per n_jobs CPUs until all targets have been trained. Contrarily, XGBoost uses MPI threading to parallelize the training of a single XGBoost regressor over however the nthread CPUs allocated for the job, a task which subsequently must be serially executed by scikit-learn for the number of targets in our problem. Making 10 CPUs available for the training, we can see that scikit-learn's parallelization scheme in this multi-target regression application provides a speed up in training time of about 4x. This is quite nice as my dataset actually contains ~400,000 training examples (not the 1,000 shown below), thus saving me hours of train time.

import xgboost as xgb
from sklearn.multioutput import MultiOutputRegressor
import time

# some function to return your training data
X, y = import_and_prepare_your_training_data()

model1 = MultiOutputRegressor(\
  xgb.XGBRegressor(objective='reg:squarederror'),n_jobs=10)

model2 = MultiOutputRegressor(\
  xgb.XGBRegressor(objective='reg:squarederror',nthread=10))

start1 = time.time()
model1.fit(X,y)
end1 = time.time()

start2 = time.time()
model2.fit(X,y)
end2 = time.time()

print("Model 1 execution time: %.2f"%(end1-start1))
print("Model 2 execution time: %.2f"%(end2-start2))

            
 >>>
Model 1 execution time: 14.74
Model 2 execution time: 65.21