Deep-Learning | Feng Memo

0. Symbol

$ n $: dimension of x

$ m $: number of samples

$i$: index of samples

$x^{(i)}$: the $x$ of $i^{th}$ sample

$ \alpha $: learning rate

1. Binary Classification

a.Logistic Regression:

$\hat{y} = P(y = 1| x)$

where we want $ 0\leq \hat{y} \leq 1 $.

we can use sigmoid function:

$\hat{y} = \sigma(\omega x + b)$

where

$\sigma(z) = \frac{1}{1 + e^{-z}}$

To write a common format of activation function $z$, we set $ \Theta $ as follow:

$\Theta = \begin{bmatrix} \theta_0 \\ \theta_1 \\ .\\ .\\ .\\ \theta_{n_x} \end{bmatrix}$

where $ \theta_0 $ is $ b $,

the matrix

$\begin{bmatrix} \theta_1 \\ .\\ .\\ .\\ \theta_{n_x} \end{bmatrix}$

is $\omega $.

We can write our active function $z$ as $ \Theta^{T}x $

So, the final logistic regression can written as:

$\hat{y}^{(i)} = \sigma(\Theta^{T}x^{(i)})$

b. Cost function

$L(\hat{y}, y) = -(y\log{\hat{y}} + (1 - y)\log{(1 - \hat{y})})$

If $y = 1$, $L(\hat{y}, y) = -\log{\hat{y}}$ and we want lower loss, $\hat{y}$ will close to 1.

If $y = 0$, $L(\hat{y}, y) = -\log{(1-\hat{y})}$ and we want lower loss, $\hat{y}$ will close to 0.

So, $L(\hat{y}, y)$ can be used as loss function.

Then, we can get the cost function from the cost function:

$\begin{aligned} J(\Theta) &= \frac{1}{m}\sum^{m}_{i = 1}L(\hat{y}, y)\\ &= -\frac{1}{m}\sum^{m}_{i = i}[y^{(i)}\log{\hat{y}^{(i)}} + (1 - y^{(i)})\log{(1 - \hat{y}^{(i)}})] \end{aligned}$

c. Conclusion

Logistic Regression:

$\hat{y}^{(i)} = \sigma(\Theta^{T}x^{(i)})$

Loss function:

$L(\hat{y}, y) = -(y\log{\hat{y}} + (1 - y)\log{(1 - \hat{y})})$

Cost function:

$J(\Theta) = -\frac{1}{m}\sum^{m}_{i = i}[y^{(i)}\log{\hat{y}^{(i)}} + (1 - y^{(i)})\log{(1 - \hat{y}^{(i)}})]$

2. Gradient Descent

Repeat {

$\omega = \omega - \alpha\frac{\partial J(\omega, b)}{\partial \omega}$;

$b = b - \alpha\frac{\partial J(\omega, b)}{\partial b}$;

}

当 $\omega$ 和 $b$ 的值几乎不变时，说明到达了最低点。

此时的 $\omega$ 和 $b$ 则是我们需要的最优解。

3. How to calculate the gradient descent

Calculate the derivation of $L(\hat{y}, y)$

Firstly ,we calculate $ \frac{\partial L(\hat{y}, y)}{\partial \hat{y}} $:

$\frac{\partial L(\hat{y}, y)}{\partial \hat{y}} = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}$

Then, we calculate $\frac{\partial \hat{y}}{\partial z}$:

$\frac{\partial \hat{y}}{\partial z} = \hat{y} (1 - \hat{y})$

Finally, we can get $ \frac{\partial L(\hat{y}, y)}{\partial z }$ :

$\begin{aligned} \frac{\partial L(\hat{y}, y)}{\partial z} &= \frac{\partial L(\hat{y}, y)}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \\ &= (-\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}) \cdot (\hat{y} (1 - \hat{y})) \\ &= \hat{y} - y \end{aligned}$

Then, we can get $\frac{\partial L(\hat{y}, y)}{\partial w_i}$:

$\begin{aligned} \frac{\partial L(\hat{y}, y)}{\partial w_i} &= \frac{\partial L(\hat{y}, y)}{\partial z}\cdot \frac{\partial z}{\partial w_i}\\ &= (\hat{y} - y)x_i \end{aligned}$

Calculate the derivation of $J(w, b)$

$\frac{\partial}{\partial w_i} J(w,b) = \frac{1}{m} \sum_{i = 1}^{m} \frac{\partial}{\partial w_i} L(\hat{y}^{(i)},y^{(i)})$

Algorithm to calculate it:

$J=0;\partial w_1 = 0; \partial w_2 = 0; …;\partial w_n = 0 $

$For\ i = 0\ to\ m: $

$z^{(i)} = w^T x^{(i)} + b\\ a^{(i)} = \sigma(z^{(i)}) \\ J+= -[y^{(i)} log{a^{(i)}} + (1 - y^{(i)})log{(1 - a^{(i)})}] \\ \partial z^{(i)} = a^{(i)} - y^{(y)} \\ \partial w_1 += x^{(i)}_1 \partial z^{(i)} \\ \partial w_2 += x^{(i)}_2 \partial z^{(i)} \\ \partial w_3 += x^{(i)}_3 \partial z^{(i)} \\ .\\ .\\ .\\ \partial w_n = x^{(i)}_n \partial z^{(i)}\\ \partial b += \partial x^{(i)}$

$J /= m; \partial w_1 /= m; \partial w_1 /= m; \partial w_2 /= m;…;\partial w_n /= m$

After computing the $\frac{\partial}{\partial w_i} J(w,b)$, we can do the gradient descent

Repeat {

$\omega = \omega - \alpha\frac{\partial J(\omega, b)}{\partial \omega}$;

$b = b - \alpha\frac{\partial J(\omega, b)}{\partial b}$;

}

in each repeat, we need calculus $\frac{\partial J}{\partial w}$ and $\frac{\partial J}{\partial b}$