diff --git a/.gitignore b/.gitignore
index 7270eda..4880327 100644
--- a/.gitignore
+++ b/.gitignore
@@ -10,3 +10,6 @@
# vim swap file
*.swp
+
+#include images:
+!images/*
diff --git a/images/2neurones.pdf b/images/2neurones.pdf
new file mode 100644
index 0000000..f2ac155
Binary files /dev/null and b/images/2neurones.pdf differ
diff --git a/images/2neurones.svg b/images/2neurones.svg
new file mode 100644
index 0000000..60df733
--- /dev/null
+++ b/images/2neurones.svg
@@ -0,0 +1,250 @@
+
+
+
+
diff --git a/images/logreg_illustration.pdf b/images/logreg_illustration.pdf
new file mode 100644
index 0000000..656bb05
Binary files /dev/null and b/images/logreg_illustration.pdf differ
diff --git a/images/logreg_illustration.svg b/images/logreg_illustration.svg
new file mode 100644
index 0000000..0c6b364
--- /dev/null
+++ b/images/logreg_illustration.svg
@@ -0,0 +1,151 @@
+
+
+
+
diff --git a/images/neuron_model.pdf b/images/neuron_model.pdf
new file mode 100644
index 0000000..60875a0
Binary files /dev/null and b/images/neuron_model.pdf differ
diff --git a/images/neuron_model.svg b/images/neuron_model.svg
new file mode 100644
index 0000000..eaa3212
--- /dev/null
+++ b/images/neuron_model.svg
@@ -0,0 +1,239 @@
+
+
+
+
diff --git a/images/percepton_problem.pdf b/images/percepton_problem.pdf
new file mode 100644
index 0000000..8775a20
Binary files /dev/null and b/images/percepton_problem.pdf differ
diff --git a/images/percepton_problem.svg b/images/percepton_problem.svg
new file mode 100644
index 0000000..a515726
--- /dev/null
+++ b/images/percepton_problem.svg
@@ -0,0 +1,392 @@
+
+
+
+
diff --git a/images/slr_illustration.pdf b/images/slr_illustration.pdf
new file mode 100644
index 0000000..9649976
Binary files /dev/null and b/images/slr_illustration.pdf differ
diff --git a/images/slr_illustration.svg b/images/slr_illustration.svg
new file mode 100644
index 0000000..a88d33d
--- /dev/null
+++ b/images/slr_illustration.svg
@@ -0,0 +1,133 @@
+
+
+
+
diff --git a/main.tex b/main.tex
index 387cc7c..bafe0ef 100644
--- a/main.tex
+++ b/main.tex
@@ -74,7 +74,14 @@ Descriptive Models provide new information to describe the data in a new way. Cl
The difference between classification and regression is that classification provide a categorical output (one element of a pre-deternime finite set, for example a label) when regression provide a continuouse output as a real number.
+\newpage
\section{Simple Linear Regression}
+
+\begin{figure}[h]
+ \centering
+ \includegraphics[width=\textwidth]{images/slr_illustration.pdf}
+\end{figure}
+
We are interested in finding a function that represent best the non-functional relationship\footnote{Meaning that the output cannot be expressed as a mathematical function of the input.} between the input and the output. We suppose that we have a dataset of $n$ input/output points. We can first try to determine if there is a statistical relationship between input and output. For that we look at the covariance and the corelation.
With $\bar{Y}$ the average of the outputs and $\bar{X}$ the average of the inputs and $S$ the standard deviation:
@@ -94,9 +101,16 @@ This is equivalent to using the Maximum Likelihood Estimator (MLE) with the (rea
This works well for monovariate problems. In the case of multivariate problems, you have to take each input individually and evaluate the linear regression from this input to the output and the problem becomes way more complex (not detailed here).
+\newpage
\section{Logistic Regression}
-gistic regression is fundamentally different from Linear Regression in the sens that it returns the probability of an input belongign to one class. This probability can be compared to a threshold (usually $0.5$) to express the predicted class of the input. In this sense, the logistic regression does not return a real number but a categorical output. In this context, the "training" data consist of inputs that are real number and output that are binary (either 0 or 1). The logistic regression can be viewed as an extension of the linear regression where the affine curve is "bent" to be confined between 0 and 1. We start by expressing the logistic curve as the probability of the input $x$ belonging to the class 1.
+\begin{figure}[h]
+ \centering
+ \includegraphics[width=\textwidth]{images/logreg_illustration.pdf}
+\end{figure}
+
+
+The Loggistic regression is fundamentally different from Linear Regression in the sens that it returns the probability of an input belongign to one of two classes. This probability can be compared to a threshold (usually $0.5$) to express the predicted class of the input. In this sense, the logistic regression does not return a real number but a categorical output. In this context, the "training" data consist of inputs that are real number and output that are binary (either 0 or 1). The logistic regression can be viewed as an extension of the linear regression where the affine curve is "bent" to be confined between 0 and 1. We start by expressing the logistic curve as the probability of the input $x$ belonging to the class 1.
\begin{equation}
p(x) = \dfrac{1}{1+e^{-y}} = \dfrac{1}{1+e^{-(\beta_1x+\beta_0)}}
@@ -114,4 +128,112 @@ L_k =
$$
The issue now is that this function is non-linear so we cannot simply derive it and solve for 0 to get the optimum. Numerical methods such as gradient descent or Newton–Raphson need to be leverage to approximate the optimum $\beta$ value.
+
+\section{Fundamentals of Artificial Neural Networks}
+
+
+\begin{definition}{Artificial Neural Network}{ann}
+ An Artificial Neural Network (ANN) is a systems that can acquire, store and utilize experimental knowledge. It is a set of parallelized and distributed computational elements characterisez by there architecture, learning paradigme, and activation functions.
+\end{definition}
+
+ANNs are based on \textit{what and how we think} brain's neurons look like and function. Each neurone is a computational element that receivs impulsion or signal from one or more other neurons and, based on a set of intrinsec parameters (the weights) and a threshold/bias (the activation function), transmit a signal to one or more subsequent neurons. The ANNs can be categorized using multiple criteria:
+
+Topologies:
+\begin{itemize}
+ \item Feed Forward: The information flow goes from the first to the last layer in only one direction.
+ \item Recurent: There is no main information flow direction. At some point the information goes out of the network.
+\end{itemize}
+
+Activations functions:
+\begin{itemize}
+ \item Steps.
+ \item Sigmoid.
+ \item Sigmoid derivativ.
+ \item Linear.
+ \item $\dots$
+\end{itemize}
+
+Learning paradigmes:
+\begin{itemize}
+ \item Supervised learning: The learning is guided by the error.
+ \item Unsupervised learning.
+ \item Reinforcement learning: Values or penalize sequences of steps based on the result.
+ \item $\dots$
+\end{itemize}
+
+\newpage
+\subsection{McCulloch-Pitts Model}
+The first model of a neurone was proposed to represent the computing process of a biological neurone with mathematical elements.
+
+
+\begin{figure}[h]
+ \centering
+ \includegraphics[width=0.7\textwidth]{images/neuron_model.pdf}
+\end{figure}
+
+This model does not include any learning mechanism. The weights and activation function parameters need to be set manually in order to procude a meaningfull result. In itseft, this model is pretty but not very usefull. In order to make it learn, we turn it into a perceptron by updating the weights iteratively depending on the output generated by the neurone on our training set of datapoitns.
+
+\subsection{Perceptron}
+\begin{figure}[h]
+ \centering
+ \includegraphics[width=0.5\textwidth]{images/percepton_problem.pdf}
+ \label{fig:perceptron_problem}
+ \caption{Example of a problem solvable with the perceptron.}
+\end{figure}
+
+The perceptron can learn and adjust its parameter using trainig data. The aim of this problem is to find a hyperplan (a line in this case) that separates the $+1$ and $-1$ points. If each class is on one and only one side of the line then the problem is solved. There is no notion of \textit{best separation line} (like there would be for an SVM for example). To solve this problem, we use a one-neurone model. This neurone has weights $\omega = [\omega_1,\omega_2]^T$ and a threshold $\theta$. The equation of the output $o_i$ from the input $x_i = [x_1,x_2]$ is given by:
+\begin{equation}
+ o_i = [\omega_1,\omega_2,\theta] \cdot [x_1,x_2,1]^T = \omega \cdot x_i + \theta
+\end{equation}
+
+Without any better idea of where to start, we initialize the parameters to random (but reasonable) values: $\omega_1=-1, \omega_2=1, \theta=-1$. This set of parameters give the first line presented in figure \ref{fig:perceptron_problem}.
+
+The first line is not very good because the lower $-1$ point is missclassified. We can use this point to update the line. Let's define the the update function using $\eta$ the learning rate (used to reduce the variation between two iterations in order to stay in a stable state).
+
+\begin{align*}
+ \Delta\omega &= 2\eta t x_i\\
+ \Delta\theta &= -2 \eta t \textrm{ if $t \neq o$, oterwise $\Delta\theta = 0$}
+\end{align*}
+
+With this update function and $\eta=0.1$, we can update the weights to $\omega = [-3,-1], \theta=0$. This produces the second line that perfectly.
+
+The one-neurone perceptron can learn and solve linearily separable problems. There is one hyper parameter to tune but other than that it should converge to a solution if the learning rate is not to big. The stopping criteria is is either finding a solution or reaching a number of iterations.
+
+\subsection{Multi-Layer Perceptron}
+One neurone is good but it is very limited. For example, it is limited to linearily separable problems. The XOR problem where two classes are spread along the both diagonalsof the plan is an example of a very simple and yet non-linearily separable problem where a single-neurone perceptron will fail. We need to connect multiple neurones together now to tackle this kind of problems. We make a few changes to the neurone to prepare it to the use in a network.
+
+\begin{itemize}
+ \item The neurone has now a trainable bias which is just a third parameter along with the weights.
+ \item The training function is now defined as $\Delta \omega = \eta \left( \dfrac{\partial P}{\partial \omega_1} + \dfrac{\partial P}{\partial \omega_2}\right)$
+ \item The performance function is define as $P = -\dfrac{1}{2} (t-o)^2$. The square is to make it smooth and derivable, the half is because we know we are going to derivate it at some point, and the negative sign is to make it a maximisation problem (no real use, maximizing performance is the same as minimizing loss or cost).
+\end{itemize}
+
+We can now conenct the two neurones and this minimal network will produce an output $O_i$ for each input $x_i$.
+
+\begin{figure}[h]
+ \centering
+ \includegraphics[width=0.8\textwidth]{images/2neurones.pdf}
+ \label{fig:2neurones}
+ \caption{Minimal multi-layers perceptron with two neurones..}
+\end{figure}
+
+We want to maximise the performance value $P$ using the inputs $x_i$. This maximisation problem requires to determine the derivative of $P$ with respect to each parameter and use it to update the parameter (gradient descent method). In order to obtain the derivative of $P$ with respect to both $\omega_i$, we leverage the chain rule to get from the end of the network to the begining using the derivative at each step:
+
+
+\[\arraycolsep=4pt\def\arraystretch{2.2}
+\left\{
+ \begin{array}{ll}
+ \dfrac{\partial P}{\partial \omega_2} &= \dfrac{\partial P}{\partial O_i} \dfrac{\partial O_i}{\partial P_2} \dfrac{\partial P_2}{\partial \omega_2}\\
+ \dfrac{\partial P}{\partial \omega_1} &= \dfrac{\partial P}{\partial \omega_2} \dfrac{\partial \omega_2}{\partial y} \dfrac{\partial y}{\partial P_1} \dfrac{\partial P_1}{\partial \omega_1}\\
+\end{array}
+\right.
+\]
+
+Each derivative is fairly easy to get. We can express our deltas as:
+\begin{align*}
+ \dfrac{\partial P}{\partial \omega_2} &= (t_i - O_i) \times (1-O_i)O_i \times y\\
+ \dfrac{\partial P}{\partial \omega_1} &= (t_i - O_i) \times (1-O_i)O_i \times \omega_2 \times(1-y)y\times x_i\\
+\end{align*}
+
+These derivatives can be used now to update the valeus of $\omega_1$ and $\omega_2$ according to the learning rate $\eta$. This chaining of derivative from the output all the way to the input of the network is the idea behind backtracking.
\end{document}