Online Learning 5

Online Subgradient Descent

在线梯度下降算法如下图所示

OGD

其中,每一步迭代的时候,根据梯度(或次梯度)更新后,将新得到的权重投影到权重的可行域上。

定理1:如果\(w'\in\mathcal{W}\)\(w=\prod_{\mathcal{W}}(w')\)\(w^*\in\mathcal{W}\),那么\((w'-w)(w^*-w)\leq0\)

证明:

假设存在\(\hat{w}\in\mathcal{W}\),使得\((w'-\hat{w})(w^*-\hat{w})>0\)。定义 \[ \begin{aligned} Z(\lambda) &=(1-\lambda)\hat{w}+\lambda w^* &\lambda\in[0,1]\\ &=\hat{w}+\lambda(w^*-\hat{w}) \end{aligned} \] 注意,因为\(\mathcal{W}\)是凸集,所以有\(Z(\lambda)\in\mathcal{W}\)\[ \begin{aligned} \|w'-Z(\lambda)\|^2 &=\|w'-\hat{w}-\lambda(w^*-\hat{w})\|^2\\ &=\|w'-\hat{w}\|^2+\lambda^2\|w^*-\hat{w}\|^2 -2\lambda(w'-\hat{w})(w^*-\hat{w}) \end{aligned} \]

\(h(\lambda)=\lambda^2\|w^*-\hat{w}\|-2\lambda(w'-\hat{w})(w^*-\hat{w})\),该函数的根为\(0\)\(\frac{(w'-\hat{w})(w^*-\hat{w})}{\|w^*-\hat{w}\|^2}\),当\(\lambda\in(0,\frac{(w'-\hat{w})(w^*-\hat{w})}{\|w^*-\hat{w}\|^2})\)时,\(h(\lambda)<0\),所以有 \[ \|w'-Z(\lambda)\|^2\leq\|w'-\hat{w}\|^2 \] 对于任何使得\(h(\lambda)<0\)\(\lambda\),都会产生\(Z(\lambda)\in\mathcal{W}\)相比于\(\hat{w}\)更接近\(w'\),原假设不成立,该定理得证。\(\Box\)

定理2:\(w'\notin\mathcal{W}\)\(w=\prod_{\mathcal{W}}(w')\)\(w^*\in\mathcal{W}\),那么\(\frac{1}{2}\|w'-w\|^2+\frac{1}{2}\|w-w^*\|^2\leq\frac{1}{2}\|w^*-w'\|^2\)

证明: \[ \begin{aligned} &\frac{1}{2}\|w'\|^2+\frac{1}{2}\|w\|^2-w'\cdot w +\frac{1}{2}\|w\|^2+\frac{1}{2}\|w^*\|^2-w\cdot w^* -\frac{1}{2}\|w^*\|^2-\frac{1}{2}\|w'\|^2+w^*\cdot w'\\ =&w\cdot w-w'\cdot w-w\cdot w^*+w^*\cdot w'\\ =&w'(w^*-w)-w(w^*-w)\\ =&(w'-w)(w^*-w)\\ \leq&0 \end{aligned} \] 证毕。\(\Box\)

接下来我们来证明一下潜在函数(potential function)为\(\Phi(w,w^*)=\frac{1}{2}\|w-w^*\|^2\),选择\(w^*\in\mathcal{W}\),对于梯度下降,我们有 \[ \begin{aligned} &\Phi(w_t,w^*)-\Phi(w'_{t+1},w^*)\\ =&\frac{1}{2}\|w_t-w^*\|^2-\frac{1}{2}\|w'_{t+1}-w^*\|^2\\ =&\frac{1}{2}\|w_t-w^*\|^2-\frac{1}{2}\|w_t-w^*-\eta_tg_t\|^2\\ =&\frac{1}{2}\|w_t-w^*\|^2-\frac{1}{2}\|w_t-w^*\|^2+\eta_t(w_t-w^*)\cdot g_t-\frac{1}{2}\eta_t^2\|g_t\|^2\\ =&-\frac{1}{2}\eta_t^2\|g_t\|^2+\eta_t(w_t-w^*)\cdot g_t \end{aligned} \] 其中后一项 \[ \begin{aligned} (w_t-w^*)\cdot g_t&=\frac{1}{\eta_t}(\Phi(w_t,w^*)-\Phi(w'_{t+1},w^*))+\frac{1}{2}\eta_t\|g_t\|^2\\ &\leq\frac{1}{\eta_t}( \Phi(w_t,w^*)-\Phi(w'_{t+1},w^*)+\Phi(w'_{t+1},w^*)-\Phi(w_{t+1},w^*))+\frac{1}{2}\eta_t\|g_t\|^2\\ \sum^T_{t=1}f_t(w_t)-f_t(w^*) &\leq \sum^T_{t=1}w_t\cdot g_t-w^*\cdot g_t\\ &\leq\sum^T_{t=1}\frac{1}{\eta_t}(\Phi(w_t,w^*)-\Phi(w_{t+1},w^*)) +\sum^T_{t=1}\frac{1}{2}\eta_t\|g_t\|^2\\ &=\frac{1}{\eta_1}\Phi(w_1,w^*)-\frac{1}{\eta_T}\Phi(w_{T+1},w^*) +\sum^T_{t=2}(\frac{1}{\eta_{t}}-\frac{1}{\eta_{t-1}})\Phi(w_t,w^*) +\sum^T_{t=1}\frac{1}{2}\eta_t\|g_t\|^2\\ \end{aligned} \]

此处我们假设\(f_t\)是G-Lipschitz的,则\(\|g_t\|^2\leq G^2\)。假设\(\|w^*\|\leq R,\|w_t\|\leq R, \forall t\),则\(\frac{1}{2}\|w_t-w^*\|^2\leq2R^2\),所以上式有 \[ \begin{aligned} \sum^T_{t=1}f_t(w_t)-f_t(w^*) &\leq\frac{2}{\eta_1}R^2-\frac{1}{\eta_T}\Phi(w_{T+1},w^*) +2\sum^T_{t=2}(\frac{1}{\eta_{t}}-\frac{1}{\eta_{t-1}})R^2 +\sum^T_{t=1}\frac{1}{2}\eta_t\|g_t\|^2\\ \end{aligned} \] 其中\(\frac{1}{\eta_T}\Phi(w_{T+1},w^*)\geq0\),所以在计算上界的时候可以消去这一项,有 \[ \begin{aligned} \sum^T_{t=1}f_t(w_t)-f_t(w^*) &\leq\frac{2}{\eta_1}R^2 +2\sum^T_{t=2}(\frac{1}{\eta_{t}}-\frac{1}{\eta_{t-1}})R^2 +\sum^T_{t=1}\frac{1}{2}\eta_tG^2\\ &=2R^2(\frac{1}{\eta_1}+\sum^T_{t=2}(\frac{1}{\eta_{t}}-\frac{1}{\eta_{t-1}})) +\sum^T_{t=1}\frac{1}{2}\eta_tG^2\\ &=2R^2\frac{1}{\eta_T}+\frac{G^2}{2}\sum^T_{t=1}\frac{1}{2}\eta_t \end{aligned} \]\(\eta_t=\eta\cdot\frac{1}{\sqrt{T}}\)\[ \begin{aligned} \sum^T_{t=1}f_t(w_t)-f_t(w^*) &\leq2R^2\frac{1}{\eta_T}+\frac{G^2}{2}\sum^T_{t=1}\frac{1}{2}\eta_t\\ &=\frac{2R^2\sqrt{T}}{\eta}+\frac{G^2}{2}T\eta\frac{1}{\sqrt{T}}\\ \end{aligned} \]\(\eta=\frac{2R}{G}\)\[ \begin{aligned} \sum^T_{t=1}f_t(w_t)-f_t(w^*) &\leq\frac{2R^2\sqrt{T}}{\eta}+\frac{G^2}{2}T\eta\frac{1}{\sqrt{T}}\\ &=2RG\sqrt{T} \end{aligned} \]\(T\)不可知的时候 \[ \begin{aligned} \text{Regret}&\leq\frac{2R^2}{\eta_T}+\frac{G^2}{2}\sum^T_{t=1}\eta_t\\ &=\frac{2R^2\sqrt{T}}{\eta}+\frac{G^2}{2}\eta\sum^T_{t=1}\frac{1}{\sqrt{t}}\\ &\leq\frac{2R^2\sqrt{T}}{\eta}+\frac{G^2}{2}\eta\sqrt{T} \end{aligned} \]

\(\eta=\frac{\sqrt{2}R}{G}\)\[ \begin{aligned} \text{Regret} &\leq\frac{2R^2\sqrt{T}}{\eta}+\frac{G^2}{2}\eta\sqrt{T}\\ &=2\sqrt{2}RG\sqrt{T} \end{aligned} \]