Dropout

10 March 2020

cs nlp dropout regularization

Dropout這個概念是在2014年被提出的，它的概念其實很簡單，在訓練的時候，隨機將一些神經元關閉，這樣避免神經元之間過度依賴(prevents units from co-adapting too much)，並在Inference時將所有神經元開啟，這樣可以輕鬆估計各個不同小的神經網路的平均值，使用dropout可以大幅降低overfitting的可能

原理

乍聽之下很簡單的原理，用一個機率$p$來開啟神經元，但其實背後做到了model combination, weight sharing的動作。假設我們有n顆神經元，每顆都可能開或不開，因此我們的神經網路的可能總共為$2^{n}$種，我們稱每個小的神經網路為thinned network，所以在我們訓練途中，其實是在訓練$2^{n}$個 thinned network ，並且由於各個 thinned network 之間又是使用同個權重，因此可以看作權重的分享(weight sharing)

在test time的時候，理論上我們要平均$2^{n}$個 thinned network 的預測，但不可能這樣做，我們透過將神經元權重scale-down的動作，也就是將每個神經元的權重乘以它被開啟的機率$p$，這使得我們可以將$2^{n}$個 thinned network 合併成單一個神經網路

公式

這邊直接將原論文的段落引述出來，因為一些符號的意義直接看原文定義是最精準的

Consider a neural network with $L$ hidden layers. Let $l \in \left \{ 1, \cdots , L \right \}$ index the hidden layers of the network. Let $z^{(l)}$ denote the vector of inputs into layer $l$, $y^{l}$ denote the vector of outputs from layer $l$ ($y^{0} = \mathbf{x} $ is the input). $W^{l}$ and $\mathbf{b^{(l)}}$ are the weights and biases at layer $l$. The feed-forward operation of a standard neural network can be described as (for $l \in \left \{ 1, \cdots , L \right \}$ and any hidden unit $i$)

\[z_{i}^{(l+1)} = w_{i}^{l+1} y^{l} + b_{i}^{(l+1)},\] \[y_{i}^{(l+1)} = f(z_{i}^{(l+1)}),\]

where f is any activation function, for example, sigmoid function

With dropout, the feed-forward operation becomes

\[r_{j}^{(l)} \sim \textit{Bernoulli} (p),\] \[\tilde{y}^{(l)} = r^{(l)} \ast y^{(l)},\] \[z_{i}^{l+1} = w_{i}^{(l+1)} \tilde{y}^{(l)} + b_{i}^{(l+1)},\] \[y_{i}^{(l+1)} = f(z_{i}^{(l+1)}).\]

原理

公式

Reference