0. Symbol
$ n $: dimension of x
$ m $: number of samples
$i$: index of samples
$x^{(i)}$: the $x$ of $i^{th}$ sample
$ \alpha $: learning rate
1. Binary Classification
a.Logistic Regression:
where we want $ 0\leq \hat{y} \leq 1 $.
we can use sigmoid function:
where
To write a common format of activation function $z$, we set $ \Theta $ as follow:
where $ \theta_0 $ is $ b $,
the matrix
is $\omega $.
We can write our active function $z$ as $ \Theta^{T}x $
So, the final logistic regression can written as:
b. Cost function
If $y = 1$, $L(\hat{y}, y) = -\log{\hat{y}}$ and we want lower loss, $\hat{y}$ will close to 1.
If $y = 0$, $L(\hat{y}, y) = -\log{(1-\hat{y})}$ and we want lower loss, $\hat{y}$ will close to 0.
So, $L(\hat{y}, y)$ can be used as loss function.
Then, we can get the cost function from the cost function:
c. Conclusion
Logistic Regression:
Loss function:
Cost function:
2. Gradient Descent
Repeat {
$\omega = \omega - \alpha\frac{\partial J(\omega, b)}{\partial \omega}$;
$b = b - \alpha\frac{\partial J(\omega, b)}{\partial b}$;
}
当 $\omega$ 和 $b$ 的值几乎不变时,说明到达了最低点。
此时的 $\omega$ 和 $b$ 则是我们需要的最优解。
3. How to calculate the gradient descent
Calculate the derivation of $L(\hat{y}, y)$
Firstly ,we calculate $ \frac{\partial L(\hat{y}, y)}{\partial \hat{y}} $:
Then, we calculate $\frac{\partial \hat{y}}{\partial z}$:
Finally, we can get $ \frac{\partial L(\hat{y}, y)}{\partial z }$ :
Then, we can get $\frac{\partial L(\hat{y}, y)}{\partial w_i}$:
Calculate the derivation of $J(w, b)$
Algorithm to calculate it:
$J=0;\partial w_1 = 0; \partial w_2 = 0; …;\partial w_n = 0 $
$For\ i = 0\ to\ m: $
$J /= m; \partial w_1 /= m; \partial w_1 /= m; \partial w_2 /= m;…;\partial w_n /= m$
After computing the $\frac{\partial}{\partial w_i} J(w,b)$, we can do the gradient descent
Repeat {
$\omega = \omega - \alpha\frac{\partial J(\omega, b)}{\partial \omega}$;
$b = b - \alpha\frac{\partial J(\omega, b)}{\partial b}$;
}
in each repeat, we need calculus $\frac{\partial J}{\partial w}$ and $\frac{\partial J}{\partial b}$