XGBoost推导

XGBoost推导

目标

目标:我们希望学习一个既准确又简单的模型来实现预测
因此目标函数可以定为:
<munderover> i = 1 n </munderover> l ( y i , <mover accent="true"> y ^ </mover> i ) + <munder> k </munder> Ω ( f k ) , f k F \sum_{i=1}^{n} l\left(y_{i}, \hat{y}_{i}\right)+\sum_{k} \Omega\left(f_{k}\right), f_{k} \in \mathcal{F} i=1nl(yi,y^i)+kΩ(fk),fkF
由于我们使用的是树模型,而不是权重向量,因此无法使用SGD算法来找到函数 f f f。但是可以使用Additive Training(Boosting)加性训练的方式来找到函数 f f f.

Additive Training(Boosting)

从一个常数预测开始,每一轮训练增加一个新的函数
<mstyle displaystyle="false" scriptlevel="0"> <mover accent="true"> y ^ </mover> i ( 0 ) = 0 </mstyle> <mstyle displaystyle="false" scriptlevel="0"> <mover accent="true"> y ^ </mover> i ( 1 ) = f 1 ( x i ) = <mover accent="true"> y ^ </mover> i ( 0 ) + f 1 ( x i ) </mstyle> <mstyle displaystyle="false" scriptlevel="0"> <mover accent="true"> y ^ </mover> i ( 2 ) = f 1 ( x i ) + f 2 ( x i ) = <mover accent="true"> y ^ </mover> i ( 1 ) + f 2 ( x i ) </mstyle> <mstyle displaystyle="false" scriptlevel="0"> <mover accent="true"> y ^ </mover> i ( t ) = k = 1 t f k ( x i ) = <mover accent="true"> y ^ </mover> i ( t 1 ) + f t ( x i ) </mstyle> \begin{array}{l}{\hat{y}_{i}^{(0)}=0} \\ {\hat{y}_{i}^{(1)}=f_{1}\left(x_{i}\right)=\hat{y}_{i}^{(0)}+f_{1}\left(x_{i}\right)} \\ {\hat{y}_{i}^{(2)}=f_{1}\left(x_{i}\right)+f_{2}\left(x_{i}\right)=\hat{y}_{i}^{(1)}+f_{2}\left(x_{i}\right)} \\ {\hat{y}_{i}^{(t)}=\sum_{k=1}^{t} f_{k}\left(x_{i}\right)=\hat{y}_{i}^{(t-1)}+f_{t}\left(x_{i}\right)}\end{array} y^i(0)=0y^i(1)=f1(xi)=y^i(0)+f1(xi)y^i(2)=f1(xi)+f2(xi)=y^i(1)+f2(xi)y^i(t)=k=1tfk(xi)=y^i(t1)+ft(xi)

如何决定新加入的函数

由目标函数决定!
在第 t t t轮训练中, <mover accent="true"> y ^ </mover> i ( t ) = <mover accent="true"> y ^ </mover> i ( t 1 ) + f t ( x i ) \hat{y}_{i}^{(t)}=\hat{y}_{i}^{(t-1)}+f_{t}\left(x_{i}\right) y^i(t)=y^i(t1)+ft(xi)
因此目标函数可写成:
<mstyle displaystyle="true" scriptlevel="0"> O b j ( t ) </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = <munderover> i = 1 n </munderover> l ( y i , <mover accent="true"> y ^ </mover> i ( t ) ) + <munderover> i = 1 t </munderover> Ω ( f i ) </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = <munderover> i = 1 n </munderover> l ( y i , <mover accent="true"> y ^ </mover> i ( t 1 ) + f t ( x i ) ) + Ω ( f t ) + <mtext>  constant  </mtext> </mstyle> \begin{aligned} O b j^{(t)} &amp;=\sum_{i=1}^{n} l\left(y_{i}, \hat{y}_{i}^{(t)}\right)+\sum_{i=1}^{t} \Omega\left(f_{i}\right) \\ &amp; = \sum_{i=1}^{n} l\left(y_{i}, \hat{y}_{i}^{(t-1)}+f_{t}\left(x_{i}\right)\right)+\Omega\left(f_{t}\right)+\text { constant } \end{aligned} Obj(t)=i=1nl(yi,y^i(t))+i=1tΩ(fi)=i=1nl(yi,y^i(t1)+ft(xi))+Ω(ft)+ constant 

由于前 t 1 t-1 t1轮的模型已确定,因此其复杂度是确定,所以 t = 1 t 1 Ω ( f t ) = c o n s t a n t \sum_{t=1}^{t-1}\Omega(f_t) = constant t=1t1Ω(ft)=constant

将目标函数泰勒展开

泰勒展开式
一维:
f ( x + Δ x ) f ( x ) + f ( x ) Δ x + 1 2 f ( x ) Δ x 2 f(x+\Delta x) \simeq f(x)+f^{\prime}(x) \Delta x+\frac{1}{2} f^{\prime \prime}(x) \Delta x^{2} f(x+Δx)f(x)+f(x)Δx+21f(x)Δx2
二维:
f ( x , y + Δ y ) f ( x , y ) + f ( x , y ) y Δ y + 1 2 2 f ( x , y ) y 2 Δ y 2 f(x, y+\Delta y) \simeq f(x,y) + \frac{\partial f(x,y)}{\partial y} \Delta y + \frac{1}{2}\frac{\partial ^2 f(x, y)}{\partial y^2}\Delta y^2 f(x,y+Δy)f(x,y)+yf(x,y)Δy+21y22f(x,y)Δy2

g i = <mover accent="true"> y ^ </mover> ( t 1 ) l ( y i , <mover accent="true"> y ^ </mover> ( t 1 ) ) , h i = <mover accent="true"> y ^ </mover> ( t 1 ) 2 l ( y i , <mover accent="true"> y ^ </mover> ( t 1 ) ) g_{i}=\partial_{\hat{y}^{(t-1)}} l\left(y_{i}, \hat{y}^{(t-1)}\right), \quad h_{i}=\partial_{\hat{y}^{(t-1)}}^{2} l\left(y_{i}, \hat{y}^{(t-1)}\right) gi=y^(t1)l(yi,y^(t1)),hi=y^(t1)2l(yi,y^(t1)),目标函数为:
O b j ( t ) <munderover> i = 1 n </munderover> [ l ( y i , <mover accent="true"> y ^ </mover> i ( t 1 ) ) + g i f t ( x i ) + 1 2 h i f t 2 ( x i ) ] + Ω ( f t ) + c o n s t a n t O b j^{(t)} \simeq \sum_{i=1}^{n}\left[l\left(y_{i}, \hat{y}_{i}^{(t-1)}\right)+g_{i} f_{t}\left(x_{i}\right)+\frac{1}{2} h_{i} f_{t}^{2}\left(x_{i}\right)\right]+\Omega\left(f_{t}\right)+ constant Obj(t)i=1n[l(yi,y^i(t1))+gift(xi)+21hift2(xi)]+Ω(ft)+constant

移除常数项后,目标函数为:
<munderover> i = 1 n </munderover> [ g i f t ( x i ) + 1 2 h i f t 2 ( x i ) ] + Ω ( f t ) \sum_{i=1}^{n}\left[g_{i} f_{t}\left(x_{i}\right)+\frac{1}{2} h_{i} f_{t}^{2}\left(x_{i}\right)\right]+\Omega\left(f_{t}\right) i=1n[gift(xi)+21hift2(xi)]+Ω(ft)

定义树的复杂度

将样本到叶子节点分数的映射关系表示成:
f t ( x ) = w q ( x ) q ( x ) 1 , 2 , . . . , T f_t(x) = w_{q(x)}\\ q(x) \in {1,2,...,T} ft(x)=wq(x)q(x)1,2,...,T

w w w是叶子节点的权重, T T T为叶子节点总个数

定义树的复杂度为:
Ω ( f t ) = γ T + 1 2 λ <munderover> j = 1 T </munderover> w j 2 \Omega(f_t) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T}w_j^2 Ω(ft)=γT+21λj=1Twj2

目标函数求解

现按照样本所属的叶子节点划分样本子集, I j = { i q ( x i ) = j } I_j = \left \{ i | q(x_i)=j \right \} Ij={iq(xi)=j},属于同一个叶子节点的归为一类,共有 T T T类。

<mstyle displaystyle="true" scriptlevel="0"> O b j ( t ) </mstyle> <mstyle displaystyle="true" scriptlevel="0"> <munderover> i = 1 n </munderover> [ g i f t ( x i ) + 1 2 h i f t 2 ( x i ) ] + Ω ( f t ) </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = <munderover> i = 1 n </munderover> [ g i w q ( x i ) + 1 2 h i w q ( x i ) 2 ] + γ T + λ 1 2 <munderover> j = 1 T </munderover> w j 2 </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = <munderover> j = 1 T </munderover> [ ( <munder> i I j </munder> g i ) w j + 1 2 ( <munder> i I j </munder> h i + λ ) w j 2 ] + γ T </mstyle> \begin{aligned} O b j^{(t)} &amp; \simeq \sum_{i=1}^{n}\left[g_{i} f_{t}\left(x_{i}\right)+\frac{1}{2} h_{i} f_{t}^{2}\left(x_{i}\right)\right]+\Omega\left(f_{t}\right) \\ &amp;=\sum_{i=1}^{n}\left[g_{i} w_{q\left(x_{i}\right)}+\frac{1}{2} h_{i} w_{q\left(x_{i}\right)}^{2}\right]+\gamma T+\lambda \frac{1}{2} \sum_{j=1}^{T} w_{j}^{2} \\ &amp;=\sum_{j=1}^{T}\left[\left(\sum_{i \in I_{j}} g_{i}\right) w_{j}+\frac{1}{2}\left(\sum_{i \in I_{j}} h_{i}+\lambda\right) w_{j}^{2}\right]+\gamma T \end{aligned} Obj(t)i=1n[gift(xi)+21hift2(xi)]+Ω(ft)=i=1n[giwq(xi)+21hiwq(xi)2]+γT+λ21j=1Twj2=j=1TiIjgiwj+21iIjhi+λwj2+γT

G j = i I j g i , H j = i I j h i G_{j}=\sum_{i \in I_{j}} g_{i} , H_{j}=\sum_{i \in I_{j}} h_{i} Gj=iIjgi,Hj=iIjhi

则目标函数简化成
<mstyle displaystyle="true" scriptlevel="0"> O b j ( t ) </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = <munderover> j = 1 T </munderover> [ ( <munder> i I j </munder> g i ) w j + 1 2 ( <munder> i I j </munder> h i + λ ) w j 2 ] + γ T </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = <munderover> j = 1 T </munderover> [ G j w j + 1 2 ( H j + λ ) w j 2 ] + γ T </mstyle> \begin{aligned} O b j^{(t)} &amp;=\sum_{j=1}^{T}\left[\left(\sum_{i \in I_{j}} g_{i}\right) w_{j}+\frac{1}{2}\left(\sum_{i \in I_{j}} h_{i}+\lambda\right) w_{j}^{2}\right]+\gamma T \\ &amp;=\sum_{j=1}^{T}\left[G_{j} w_{j}+\frac{1}{2}\left(H_{j}+\lambda\right) w_{j}^{2}\right]+\gamma T \end{aligned} Obj(t)=j=1TiIjgiwj+21iIjhi+λwj2+γT=j=1T[Gjwj+21(Hj+λ)wj2]+γT

w j w_j wj来说是一个一元二次函数,当
w j = G j 2 × 1 2 ( H j + λ ) = G j H j + λ w_j^* = - \frac{G_j}{2 \times \frac{1}{2}(H_j+\lambda)} = \frac{G_j}{H_j + \lambda} wj=2×21(Hj+λ)Gj=Hj+λGj
目标函数取得最小值:
<mstyle displaystyle="true" scriptlevel="0"> O b j ( t ) </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = <munderover> j = 1 T </munderover> [ G j 2 4 1 2 ( H j + λ ) ] + γ T </mstyle> <mstyle displaystyle="true" scriptlevel="0"> </mstyle> <mstyle displaystyle="true" scriptlevel="0"> = 1 2 <munderover> j = 1 T </munderover> G j 2 H j + λ + γ T </mstyle> \begin{aligned} Obj^{(t)} &amp;= \sum_{j=1}^T[-\frac{G_j ^ 2}{4 \cdot\frac{1}{2} (H_j+\lambda)}] + \gamma T \\ &amp;= -\frac{1}{2} \sum_{j=1}^{T} \frac{G_j^2}{H_j + \lambda} + \gamma T \end{aligned} Obj(t)=j=1T[421(Hj+λ)Gj2]+γT=21j=1THj+λGj2+γT

树的生成

  • 从根结点(所有数据在同一个结点中),深度为0开始
  • 对每一个叶子结点,尝试将其分裂成两个叶子结点,分裂后目标函数值的变化如下:
    G a i n = 1 2 [ G L 2 H L + λ + G R 2 H R + λ ( G L + G R ) 2 H L + H R + λ ] γ G a i n=\frac{1}{2}\left[\frac{G_{L}^{2}}{H_{L}+\lambda}+\frac{G_{R}^{2}}{H_{R}+\lambda}-\frac{\left(G_{L}+G_{R}\right)^{2}}{H_{L}+H_{R}+\lambda}\right]-\gamma Gain=21[HL+λGL2+HR+λGR2HL+HR+λ(GL+GR)2]γ
  • 一直分裂直至不满足分裂条件为止

如何找到最优分裂特征

  • 对每一个特征,将其特征值排序
  • 尝试使用每一个特征值进行划分
  • 选出所有特征所有特征值中增益最大的作为分类依据

剪枝和正则

  • 增益不能为负。训练损失和树的复杂度得到平衡
    G a i n = G L 2 H L + λ + G R 2 H R + λ ( G L + G R ) 2 H L + H R + λ γ G a i n=\frac{G_{L}^{2}}{H_{L}+\lambda}+\frac{G_{R}^{2}}{H_{R}+\lambda}-\frac{\left(G_{L}+G_{R}\right)^{2}}{H_{L}+H_{R}+\lambda}-\gamma Gain=HL+λGL2+HR+λGR2HL+HR+λ(GL+GR)2γ
  • 提前停止。当最优分裂的增益值为负时,停止生长。(但可能这一次分裂有利于后续分裂)
  • 设定最大深度,修剪所有增益为负的叶子结点

XGBoost算法步骤

  • 在每一轮中,新建一棵空树 f t ( x ) f_t(x) ft(x)
  • 计算每个叶子节点中每个样本的一阶梯度和二阶梯度值
    g i = <mover accent="true"> y ^ </mover> ( t 1 ) l ( y i , <mover accent="true"> y ^ </mover> ( t 1 ) ) , h i = <mover accent="true"> y ^ </mover> ( t 1 ) 2 l ( y i , <mover accent="true"> y ^ </mover> ( t 1 ) ) g_{i}=\partial_{\hat{y}^{(t-1)}} l\left(y_{i}, \hat{y}^{(t-1)}\right), \quad h_{i}=\partial_{\hat{y}^{(t-1)}}^{2} l\left(y_{i}, \hat{y}^{(t-1)}\right) gi=y^(t1)l(yi,y^(t1)),hi=y^(t1)2l(yi,y^(t1))
  • 计算不同特征不同特征值作为分裂依据时的增益
    G a i n = G L 2 H L + λ + G R 2 H R + λ ( G L + G R ) 2 H L + H R + λ γ G a i n=\frac{G_{L}^{2}}{H_{L}+\lambda}+\frac{G_{R}^{2}}{H_{R}+\lambda}-\frac{\left(G_{L}+G_{R}\right)^{2}}{H_{L}+H_{R}+\lambda}-\gamma Gain=HL+λGL2+HR+λGR2HL+HR+λ(GL+GR)2γ
  • 不断地生长树,直至不满足分裂条件
  • 将这一轮的树 f t ( x ) f_t(x) ft(x)添加到模型中
    y ( t ) = y ( t 1 ) + ϵ f t ( x i ) y^{(t)}=y^{(t-1)}+\epsilon f_{t}\left(x_{i}\right) y(t)=y(t1)+ϵft(xi)

ϵ \epsilon ϵ称为步长,即在每一轮中,并不是做完了所有的优化,而是留一部分给后续的优化轮次,这样可以防止过拟合

参考资料

全部评论

相关推荐

点赞 收藏 评论
分享
牛客网
牛客企业服务