在机器学习领域很多优化问题都是黑箱优化问题，即目标函数 $f(x)$ 是一个黑箱函数（这里考虑最大化问题，Black box optimization also typically requires that all dimensions have bounds on the search space），我们可能并不知道函数的解析表达式或者不知道函数的导数信息，我们对函数值的估计只能是通过采样得到的含噪声干扰的函数值 $y_{sample}=f(x)+\epsilon$，$\epsilon$ 假设为高斯。（这里需要假设目标函数是Lipschitz-continuous），即存在 $C$，对于任意 $x_1,x_2$ 有

$$ ||f(x_1)-f(x_2)||\le C ||x_1-x_2|| $$

如果函数 $f$ 便于计算我们可以通过在许多点上进行采样，比如网格搜索，随机搜索或者数值梯度估计来寻找函数的最值。但是经常遇到的问题是函数的 evaluation 是很困难的，比如深度神经网络的超参数整定。这时候贝叶斯优化就派的上用场了。贝叶斯优化考虑对函数 $f$ 的先验，并根据从 $f$ 中进行采样来得到 $f$ 的后验。用于近似目标函数的模型称为 surrogate 模型，用于指导函数向最优解方向采样的函数称为 acquisition 函数。

Surrogate model

一个常见的 surrogate model 就是高斯过程。

$$ f(x) \sim GP(m(x),k(x,x’)) $$

A GP is an extension of the multivariate Gaussian distribution to an infinite- dimension stochastic process for which any finite combination of dimensions will be a Gaussian distribution.

均值函数常设为0，核函数可以用比如 squared exponential function

$$ k(x_i,x_j)=exp(-\frac{1}{2}||x_i-x_j||^2) $$

Two points that are close together can be expected to have a very large influence on each other, whereas distant points have almost none.

预测分布为：

$$ P(f_{t+1}|D_{1:t},x_{t+1})=\mathcal{N}(u_t(x_{t+1}),\sigma_t^2(x_{t+1})) $$

其中

$$ u_t(x_{t+1})=k^TK^{-1}f_{1:t}\\ \sigma_t^2(x_{t+1})=k(x_{t+1},x_{t+1})-k^TK^{-1}k $$

Acquisition functions

Acquisition 函数在 exploitation 和 exploration 之间进行权衡，Exploitation 意味着向 surrogate 模型预测值更优的点进行采样， exploration 意味着向预测不确定性更高的点进行采样。 A somewhat more satisfying alternative acquisition function would be one that takes into account not only the probability of improvement, but the magnitude of the improvement a point can potentially yield.

通过最大化 Acquisition 函数来确定下一个采样点。

比如目标函数 $f$ 的采样点为：

$$ x_t = \arg \max_x u(x|D_{1:t-1}) $$

其中 $u$ 是 acquisition 函数，$D_{1:t-1}=[(x_1,y_1),…,(x_{t-1},y_{t-1})]$ 为之前 $t-1$ 个采样值。

常见的 acquisition 函数有 maximum probability of improvement (MPI), expected improvement (EI) 和 upper confidence bound (UCB)。

优化算法描述

贝叶斯优化过程如下步骤，在时刻 $t=1,2,…$ 重复：

通过优化acquisition 函数得到下一个采样点：$x_t=\arg \max_x u(x|D_{1:t-1})$
从目标函数中获得一个可能的含噪采样 $y_t = f(x_t)+\epsilon_t$
将采样值加入之前的数据集 $D_{1:t}=[D_{1:t-1},(x_t,y_t)]$ 并更新高斯过程

Expected improvement

Expected improvement 定义为

$$ EI(x)=\mathbb{E} \max (f(x)-f(x^+),0) $$ 其中 $f(x^+)$ 是目前的最佳采样，$x^+$ 是最佳采样点，即 $x^+ = \arg \max_{x_i\in x_{1:t}} f(x_i)$。则 Expected improvement （通过积分）可以解析地表示为

$$ \begin{aligned} \mathbb{E}(I) &= \int_{I=0}^{I=\infty} I * \frac{1}{\sqrt{2\pi} \sigma(x)}exp(-\frac{(\mu(x)-f(x^+)-I)^2}{2\sigma^2(x)}) d I\\ &=\int_{z=-\frac{(\mu(x)-f(x^+) )}{\sigma(x)}}^{\infty} (\sigma(x) z + u(x)-f(x^+)) \frac{1}{\sqrt{2\pi} \sigma(x)} exp(-\frac{z^2}{2}) \sigma(x) d z\\ &= \sigma(x) \int_{z=-\frac{(\mu(x)-f(x^+) )}{\sigma(x)}}^{\infty} z \frac{1}{\sqrt{2\pi}} exp(-\frac{z^2}{2}) d z \\ &+ (u(x)-f(x^+)) \int_{Z=-\frac{(\mu(x)-f(x^+) )}{\sigma(x)}}^{\infty} \frac{1}{\sqrt{2\pi}} exp(-\frac{Z^2}{2}) d z\\ &= (u(x)-f(x^+)) \Phi(\frac{\mu(x)-f(x^+)}{\sigma(x)}) + \sigma(x) \phi(\frac{\mu(x)-f(x^+)}{\sigma(x)}) \end{aligned} $$

ps: 对于标准正态分布有 $\phi’(z)=-z\phi(z)$

引入 a trade-off parameter $\xi \ge 0$ 来控制全局搜索和局部优化，则有：

如果 $\sigma(x)>0$

$$ EI(x) = (\mu(x)-f(x^+)-\xi)\Phi(Z) + \sigma(x)\phi(Z); $$

如果 $\sigma(x)=0$ $$ EI(x)=0 $$ 其中 $Z = \frac{\mu(x)-f(x^+)-\xi}{\sigma(x)}$ (当 $\sigma(x)>0$) 或 $0$ (当 $\sigma(x)=0$)；$\mu(x),\sigma(x)$ 分别是高斯过程在 $x$ 处的后验均值的标准差，$\Phi,\phi$ 分别是标准正态分布的 CDF 和 PDF；EI 中的第一项求和项表示 exploitation 第二项求和项表示 exploration。

参数 $\xi$ 决定了在优化过程中 exploration 的程度，越大的 $\xi$ 意味着更多地进行 exploration，换言之，$\xi$ 越大，后验均值部分的贡献相对方差部分的贡献减小 ($\xi$ 的建议值为 $0.01$)。

「机器学习」贝叶斯优化

Surrogate model

Acquisition functions

优化算法描述

Expected improvement

参考文献