0. Symbol

$ n $: dimension of x

$ m $: number of samples

$i$: index of samples

$x^{(i)}$: the $x$ of $i^{th}$ sample

$ \alpha $: learning rate

1. Binary Classification

a.Logistic Regression:

where we want $ 0\leq \hat{y} \leq 1 $.

we can use sigmoid function:

where

To write a common format of activation function $z$, we set $ \Theta $ as follow:

where $ \theta_0 $ is $ b $,

the matrix

is $\omega $.

We can write our active function $z$ as $ \Theta^{T}x $

So, the final logistic regression can written as:

b. Cost function

If $y = 1$, $L(\hat{y}, y) = -\log{\hat{y}}$ and we want lower loss, $\hat{y}$ will close to 1.

If $y = 0$, $L(\hat{y}, y) = -\log{(1-\hat{y})}$ and we want lower loss, $\hat{y}$ will close to 0.

So, $L(\hat{y}, y)$ can be used as loss function.

Then, we can get the cost function from the cost function:

c. Conclusion

Logistic Regression:

Loss function:

Cost function:

2. Gradient Descent

Repeat {

$\omega = \omega - \alpha\frac{\partial J(\omega, b)}{\partial \omega}$;

$b = b - \alpha\frac{\partial J(\omega, b)}{\partial b}$;

}

当 $\omega$ 和 $b$ 的值几乎不变时,说明到达了最低点。

此时的 $\omega$ 和 $b$ 则是我们需要的最优解。

3. How to calculate the gradient descent

Calculate the derivation of $L(\hat{y}, y)$

Firstly ,we calculate $ \frac{\partial L(\hat{y}, y)}{\partial \hat{y}} $:

Then, we calculate $\frac{\partial \hat{y}}{\partial z}$:

Finally, we can get $ \frac{\partial L(\hat{y}, y)}{\partial z }$ :

Then, we can get $\frac{\partial L(\hat{y}, y)}{\partial w_i}$:

Calculate the derivation of $J(w, b)$

Algorithm to calculate it:

$J=0;\partial w_1 = 0; \partial w_2 = 0; …;\partial w_n = 0 $

$For\ i = 0\ to\ m: $

$J /= m; \partial w_1 /= m; \partial w_1 /= m; \partial w_2 /= m;…;\partial w_n /= m$

After computing the $\frac{\partial}{\partial w_i} J(w,b)$, we can do the gradient descent

Repeat {

​ $\omega = \omega - \alpha\frac{\partial J(\omega, b)}{\partial \omega}$;

​ $b = b - \alpha\frac{\partial J(\omega, b)}{\partial b}$;

}

in each repeat, we need calculus $\frac{\partial J}{\partial w}$ and $\frac{\partial J}{\partial b}$