GBDT (2017)

幻灯片内容

2017年12月财大组会论文报告,讲解梯度提升决策树算法论文。

Greedy Function Approximation: A Gradient Boosting Machine

论文信息

摘要

Function estimation/approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest-descent minimization. A general gradient descent “boosting” paradigm is developed for additive expansions based on any fitting criterion. Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such “TreeBoost” models are presented. Gradient boosting of regression trees produces competitive, highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire and Friedman, Hastie and Tibshirani are discussed.

幻灯片 - gbdt.pptx

在线浏览



下载相关资料:GBDT.tar

拓展内容 - 回归树

Regression Trees (Further Reading)

Suppose our data consists of p input and one response Y, for each of N observations: (xi,yi) for i=1,2,...,N, with xi=(xi1,xi2,,xip).
The algorithm needs to automatically decide on the splitting variables and split points.
Suppose first that we have a partition into M regions R1, R2, , RM, and we model the response as a constant cm in each region:

f(x)=m=1MCmI(xRm).

If we adopt as our criterion minimization of the sum of squares (yif(xi))2, it is easy to see that the best c^m is just the average of yi in region Rm:

c^m=ave(yi|xiRm)

Now finding the best binary partition in terms of minimum sum of squares is generally computationally infeasible. Hence we proceed with a greedy algorithm. Starting with all of the data, consider a splitting variable j and split point s, and define the pair of half-planes

R1(j,s)={X|Xjs}andR2(j,s)={X|Xj>s}.

Then we seek the splitting variable j and split point s that solve

argminj,sargminc1xiR1(j,s)(yic1)2+argminc2xiR2(j,s)(yic2)2

For any choice j and s, the inner minimization is solved by

c^1=ave(yi|xiR1(j,s))andc^2=ave(yi|xiR2(j,s))

For each splitting variable, the determination of the split point s can be done very quickly and hence by scanning through all of the inputs, determination of the best pair (j,s) is feasible.

Having found the best split, we partition the data into the two resulting regions and repeat the splitting process on each of the two regions. Then this process is repeated on all of the resulting regions.

How large should we grow the tree? Clearly a very large tree might overfit the data, while a small tree might not capture the important structure.

Tree size is a tuning parameter governing the model’s complexity, and the optimal tree size should be adaptively chosen from the data. One approach would be to split tree nodes only if the decrease in sum-of-squares due to the split exceeds some threshold. This strategy is too short-sighted, however, since a seemingly worthless split might lead to a very good split below it.

The preferred strategy is to grow a large tree T0, stopping the splitting process only when some minimum node size (say 5) is reached. Then this large tree is pruned using cost-complexity pruning, which we now describe.
We define a subtree TT0 to be any tree that can be obtained by pruning T0, that is, collapsing any number of its internal (non-terminal) nodes. We index terminal nodes by m, with node m representing region Rm. Let |T| denote the number of terminal nodes in T. Letting

Nm=#{xiRm},c^m=1NmxiRmyi,Qm(T)=1NmxiRm(yic^m)2,

we define the cost complexity criterion:

Cα(T)=m=1|T|NmQm(T)+α|T|.

The idea is to find, for each α, the subtree TαT0 to minimize Cα(T).

The tuning parameter α0 governs the tradeoff between tree size and its goodness of fit to the data. Large values of α result in smaller trees Tα, and conversely for smaller values of α. As the notation suggests, with α=0 the solution is the full tree T0. We discuss how to adaptively choose α below.

For each α one can show that there is a unique smallest subtree Tα that minimizes Cα(T). To find Tα we use weakest link pruning: we successively collapse the internal node that produces the smallest per-node increase in mNmQm(T), and continue until we produce the single-node (root) tree.
This gives a (finite) sequence of subtrees, and one can show this sequence
must contain Tα. Estimation of α is achieved by five- or tenfold cross-validation: we choose the value α^ to minimize the cross-validated sum of squares. Our final tree is Tα^.