Gradient boosting is a technique used in creating models for prediction. The technique is mostly used in regression and classification procedures. Prediction models are often presented as decision trees for choosing the best prediction. Gradient boosting presents model building in stages, just like other boosting methods, while allowing the generalization and optimization of differentiable loss functions.
The concept of gradient boosting originated from American statistician Leo Breiman, who discovered that the technique could be applied to appropriate cost functions as an optimization algorithm. The method has undergone further development to optimize cost functions by iteratively picking weak hypotheses or a function with a negative gradient.
Gradient boosting is a method used in building predictive models.
Regularization techniques are used to reduce overfitting effects, eliminating the degradation by ensuring the fitting procedure is constrained.
The stochastic gradient boosting algorithm is faster than the conventional gradient boosting procedure since the regression trees now require fitting smaller data sets.
Take j as a parameter in gradient boosting that denotes the tree number terminal nodes. Parameter j is adjustable, depending on the data being handled, and controls the number of times variables interact in a model. When the decision stumps are two, i.e., j=2, interactions between variables in the model are not allowed.
When the decision stumps rise to three, i.e., j=3, interaction effects allowed are for up to two variables only. The trend continues in that manner, depending on the number of decision stumps.
However, the number of decision stumps that are most appropriate is between four to eight decision stumps. Decision stumps below four are insufficient for most applications, while decision stumps above eight are too many and unnecessary.
Gradient Boosting Regularization
When training sets are fit too close, they tend to move toward degradation in their ability to generalize a model. Regularization techniques are used to reduce the overfitting effect, eliminating the degradation by ensuring the fitting procedure is constrained.
One popular regularization parameter is M, which denotes the number of iterations of gradient boosting. M stands for the number of decision trees in the entire model when the decision tree is the base learner.
A larger number of gradient boosting iterations reduces training set errors. Raising the number of gradients boosting iterations too high increases overfitting. Monitoring the error of prediction from a distinct validation data set can help choose the optimal value for the number of gradients boosting iterations.
In addition to using the number of gradients boosting iterations as a regularization parameter, one can use the depth of trees as an efficient regularization parameter. When the depth of trees increases, the model is likely going to overfit the training data.
Gradient Boosting Shrinkage
Shrinkage is a gradient boosting regularization procedure that helps modify the update rule, which is aided by a parameter known as the learning rate. The use of learning rates below 0.1 produces improvements that are significant in the generalization of a model.
The dramatic improvements can be witnessed in gradient boosting without shrinkage, where the learning rate parameter is equal to 1. The computational time will, however, be raised, which is more expensive during querying and training. This is because when the learning rate is low, the number of iterations required will rise.
Stochastic Gradient Boosting
Friedman was motivated to propose an improvement to the gradient boosting algorithm by the bootstrap aggregation or bagging technique by Breiman. Friedman proposed that the algorithm could be improved by the base learners’ iterations being matched with the respective subsamples and that the training set could be sampled randomly without replacement. The modification from Friedman’s perspective improved the algorithm’s accuracy significantly.
The size of a subsample is a constant fraction in the training set size. When the subsample is equal to 1, the algorithm becomes deterministic. When the values of the subsample are small, the algorithm experiences randomness, which reduces the chances of overfitting. It also acts as a regularization procedure known as stochastic gradient boosting.
The stochastic gradient boosting algorithm is faster than that of the conventional gradient boosting procedure. The algorithm is faster because the regression trees now require fitting smaller data sets into every iteration, as opposed to larger data sets in the conventional procedure.
Subsampling is similar to bagging, where they allow the definition of out-of-bag errors in the improvement of prediction performance. By evaluating previous predictions, the base learners can correct the shortcomings to improve on the prediction at hand. Estimating the out-of-bag errors helps in avoiding the validation of data sets independently.
Tree Complexity Penalization
Another gradient boosting regularization method is to penalize the complexity of trees. The complexity of a model can be defined as the number of proportional tree leaves. The optimization of the model can be done by pruning the trees to reduce the model’s complexity, which eliminates any branches that can’t reach the threshold’s loss.
Thank you for reading CFI guide’s to Gradient Boosting. To keep advancing your career, the additional CFI resources below will be useful: