XGBoost has gotten a lot of attention recently as the algorithm has been very successful in machine learning competitions. We in Aster engineering have been getting a lot of requests to provide this function to our customers. In AA 7.0, we’ve released an XGBoost/Gradient Boosting function.
The techniques of XGBoost can be used to improve the performance of any classifier. Most often, it’s used with decision trees, which is how we’ve built it in Aster.
Decision trees are a supervised learning technique that tries to develop rules (“decisions”) to predict the outcome associated with an observation. Each rule is a binary choice based on the value of a single predictor: the next binary choice depends on the value of that predictor, and so on, until a prediction can be made. The rules can be easily summarized and visualized as a tree, as shown below.
In this tree, the outcome is 0, 1, 2, 3, or 4, where 0 indicates no heart disease, and 1 through 4 represent increasing severity of heart disease. The first “rule” is based on the value of the “Thal” column. If it is anything other than 6 or 7, the predicted outcome is 0. If the value in the Thal column is 6 or 7, the next step is to look at the value in the STDep column. If it is less than 0.7, the next step is to look at the value in the Ca column; if it is greater than or equal to 0.7, the next step depends on the value in the ChestPain column. To make a prediction for an observation, follow the rules down the tree until you reach a leaf node. The number at the leaf node is the predicted result for that observation.
A couple of techniques that can significantly improve the performance of decision trees are bagging and boosting. Bagging stands for “bootstrap aggregation”. Bootstrapping is a statistical technique where multiple datasets are created from a single dataset by taking repeated random samples, with replacement, from the original dataset. In this way you create a large number of slightly different datasets. Bagging starts by bootstrapping a large number of datasets and creating a decision tree for each one. Then, combine the trees by either majority vote (for classification problems) or averaging (for regression problems).
Random forest is a very popular variant of bagging. With random forests, you use bootstrapping to create new datasets as you do with bagging, but at each split, you only consider a subset of the predictors. This forces the algorithm to consider a wider range of predictors, creating a more diverse set of trees and a more robust model.
Boosting is a different approach. With boosting, you build trees sequentially. Each tree focuses specifically on the errors made by the previous tree. The idea is to gradually build a better model by improving the performance of the model at each step. This is different from bagging and random forest because at each stage you try to improve the model, by specifically looking at the points that the previous model didn’t predict correctly, instead of just creating a bunch of models and averaging them all together.
There are several approaches to boosting. XGBoost is based on gradient boosting.
The gradient boosting process starts by creating a decision tree to fit the data. Then, you use this tree to make a prediction for each observation and calculate the error for each prediction. Even though you’re predicting the same data that you used to build the tree, the tree is not a perfect model, so there will be some error. In the next iteration, this set of prediction errors becomes the new dataset. That is, each data point in the data set is replaced by the delta between the actual result and the predicted result. At each iteration, you replace the dataset with the errors made by the previous iteration. Then, you build a tree that tries to fit this new dataset of the deltas, make new predictions, and so on. When you add these trees together, the result should be closer to the original actual value that you were trying to fit, because you’re adding a model of the error. This process is repeated for a specified number of iterations.
Gradient boosting and XGBoost use a number of other optimizations to further improve performance.
Regularization is a common technique in machine learning. It refers to penalizing the number or the magnitude of the model parameters. It’s a way to prevent overfitting, or building a model that fits the training data so closely that it becomes unflexible and doesn’t perform well on different data.
When working with decision trees, regularization can be used to control the complexity of the tree, either by reducing the number of leaf nodes or the values assigned to each leaf node.
Typically in gradient boosting, when you add the trees together, each tree is multiplied, by a number less than 1 to slow the learning process down (boosting is often described as a way to “learn slowly”). The idea is that moving gradually toward an optimal solution is better than taking large steps which might lead you to overshoot the optimal result.
Subsampling is also a common technique in machine learning. It refers to building trees using only a subset of the rows or columns. The idea is to force the process to consider a more diverse set of observations (rows) or predictors (columns), so that it builds a more robust model.
The Aster XGBoost function also boosts trees in parallel. This is a form of row subsampling, where each vworker gets assigned a subset of the rows, and creates a set of boosted trees based on that data.
Stopping criteria are another important factor when building decision trees. In the Aster XGBoost function, you specify the exact number of boosting steps. The function also has stopping criteria that control the size of each tree; these arguments are analogous to those used in the other Aster decision tree functions Single_Tree_Drive, Forest_Drive, and AdaBoost_Drive.
Here’s the syntax of XGBoost_Drive. Refer to the Aster Analytics Foundation User Guide (Release 7.00.02, September 2017) for more information about the function arguments.
Here’s an example. The dataset is available from the UCI Machine Learning Repository. It’s a set of fetal monitoring observations classified into 3 categories. There are 2126 observations and 21 numeric attributes. The first few rows are shown below.
As usual when training a model, we divide the dataset into training and test sets, and use the training set to build the model. Here’s a sample function call:
The function displays a message when it finishes:
We can use the XGBoost_Predict function to try out the model on the test dataset:
Here are the first few rows of the output:
select id, nsp, prediction from ctg_predict;
To conclude, we’re very excited to make this algorithm available to our customers. Try it out!
James, G., Witten, D., Hastie, T., & Tibshirani R. (2013). An Introduction to Statistical Learning with Applications in R. Available at: http://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Available at: https://web.stanford.edu/~hastie/Papers/ESLII.pdf
Friedman, J. “Greedy Function Approximation: A Gradient Boosting Machine.” IMS 1999 Reitz Lecture. Available at: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf
Chen, T., Guestrin, C. “XGBoost: A Scalable Tree Boosting System.” KDD ’16, August 13-17, 2016, San Francisco, CA. Available at: http://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf
Dataset used in example:
Cardiotocography Data Set. Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.