GBDT (2017)

幻灯片内容

2017年12月财大组会论文报告，讲解梯度提升决策树算法论文。

Greedy Function Approximation: A Gradient Boosting Machine

论文信息

作者：Jerome H. Friedman
发表至: The Annals of Statistics, Vol. 29, No. 5 (Oct., 2001), pp. 1189-1232
发表信息：
Published by: Institute of Mathematical Statistics
DOI：（无）
文章链接：https://www.jstor.org/stable/2699986

摘要

Function estimation/approximation is viewed from the perspective of numerical optimization in function space, rather than parameter space. A connection is made between stagewise additive expansions and steepest-descent minimization. A general gradient descent “boosting” paradigm is developed for additive expansions based on any fitting criterion. Specific algorithms are presented for least-squares, least absolute deviation, and Huber-M loss functions for regression, and multiclass logistic likelihood for classification. Special enhancements are derived for the particular case where the individual additive components are regression trees, and tools for interpreting such “TreeBoost” models are presented. Gradient boosting of regression trees produces competitive, highly robust, interpretable procedures for both regression and classification, especially appropriate for mining less than clean data. Connections between this approach and the boosting methods of Freund and Shapire and Friedman, Hastie and Tibshirani are discussed.

幻灯片 - gbdt.pptx

在线浏览

下载相关资料：GBDT.tar

拓展内容 - 回归树

Regression Trees (Further Reading)

Suppose our data consists of $p$ input and one response $Y$ , for each of $N$ observations: $(x_i, y_i)$ for $i = 1, 2, ..., N$ , with $x_i = (x_{i_1}, x_{i_2}, \dots, x_{i_p})$ .
The algorithm needs to automatically decide on the splitting variables and split points.
Suppose first that we have a partition into $M$ regions $R_1$ , $R_2$ , $\dots$ , $R_M$ , and we model the response as a constant $c_m$ in each region:

f (x) = \sum m = 1 M C m I (x \in R m) .

$f(x) = \sum_{m=1}^MC_mI(x\in R_m).$

If we adopt as our criterion minimization of the sum of squares $\sum (y_i − f(x_i))^2$ , it is easy to see that the best $\hat c_m$ is just the average of $y_i$ in region $R_m$ :

c^m = a v e (y i | x i \in R m)

$\hat c_m=ave(y_i|x_i\in R_m)$

Now ﬁnding the best binary partition in terms of minimum sum of squares is generally computationally infeasible. Hence we proceed with a greedy algorithm. Starting with all of the data, consider a splitting variable $j$ and split point $s$ , and deﬁne the pair of half-planes

R 1 (j, s) = {X | X j \leq s} a n d R 2 (j, s) = {X | X j > s} .

$R_1(j, s)=\{X|X_j\leq s\} \quad and \quad R_2(j, s)=\{X|X_j>s \}.$

Then we seek the splitting variable $j$ and split point $s$ that solve

arg min j, s ⎡ ⎣ arg min c 1 \sum x i \in R 1 (j, s) (y i - c 1) 2 + arg min c 2 \sum x i \in R 2 (j, s) (y i - c 2) 2 ⎤ ⎦

$\arg \min_{j, s}\left[\arg \min_{c_1}\sum_{x_i\in R_1(j, s)}(y_i-c_1)^2+ \arg \min_{c_2}\sum_{x_i\in R_2(j, s)}(y_i-c_2)^2\right]$

For any choice $j$ and $s$ , the inner minimization is solved by

c^1 = a v e (y i | x i \in R 1 (j, s)) a n d c^2 = a v e (y i | x i \in R 2 (j, s))

$\hat c_1= ave(y_i|x_i\in R_1(j, s)) \quad and \quad \hat c_2= ave(y_i|x_i\in R_2(j, s))$

For each splitting variable, the determination of the split point s can be done very quickly and hence by scanning through all of the inputs, determination of the best pair $(j, s)$ is feasible.

Having found the best split, we partition the data into the two resulting regions and repeat the splitting process on each of the two regions. Then this process is repeated on all of the resulting regions.

How large should we grow the tree? Clearly a very large tree might overﬁt the data, while a small tree might not capture the important structure.

Tree size is a tuning parameter governing the model’s complexity, and the optimal tree size should be adaptively chosen from the data. One approach would be to split tree nodes only if the decrease in sum-of-squares due to the split exceeds some threshold. This strategy is too short-sighted, however, since a seemingly worthless split might lead to a very good split below it.

The preferred strategy is to grow a large tree $T_0$ , stopping the splitting process only when some minimum node size (say 5) is reached. Then this large tree is pruned using cost-complexity pruning, which we now describe.
We deﬁne a subtree $T \subset T_0$ to be any tree that can be obtained by pruning $T_0$ , that is, collapsing any number of its internal (non-terminal) nodes. We index terminal nodes by $m$ , with node $m$ representing region $R_m$ . Let $|T|$ denote the number of terminal nodes in $T$ . Letting

Nm=#{xi∈Rm},c^m=1Nm∑xi∈Rmyi,Qm(T)=1Nm∑xi∈Rm(yi−c^m)2,

$N_m = \#\{x_i\in R_m\},\\ \hat c_m=\frac1{N_m}\sum_{x_i\in R_m}y_i,\\ Q_m(T)=\frac1{N_m}\sum_{x_i\in R_m}(y_i-\hat c_m)^2,$

we deﬁne the cost complexity criterion:

C α (T) = \sum m = 1 | T | N m Q m (T) + α | T | .

$C_\alpha(T)=\sum_{m=1}^{|T|}N_mQ_m(T)+\alpha|T|.$

The idea is to ﬁnd, for each $\alpha$ , the subtree $T_\alpha \subseteq T_0$ to minimize $C_\alpha(T)$ .

The tuning parameter $\alpha \geq 0$ governs the tradeoﬀ between tree size and its goodness of ﬁt to the data. Large values of $\alpha$ result in smaller trees $T_\alpha$ , and conversely for smaller values of $\alpha$ . As the notation suggests, with $\alpha = 0$ the solution is the full tree $T_0$ . We discuss how to adaptively choose $\alpha$ below.

For each $\alpha$ one can show that there is a unique smallest subtree $T_\alpha$ that minimizes $C_\alpha(T)$ . To ﬁnd $T_\alpha$ we use weakest link pruning: we successively collapse the internal node that produces the smallest per-node increase in $\sum_m N_mQ_m(T)$ , and continue until we produce the single-node (root) tree.
This gives a (ﬁnite) sequence of subtrees, and one can show this sequence
must contain $T_\alpha$ . Estimation of $\alpha$ is achieved by five- or tenfold cross-validation: we choose the value $\hat \alpha$ to minimize the cross-validated sum of squares. Our ﬁnal tree is $T_\hat \alpha$ .