ece_657/main.tex

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[left=2cm,right=2cm,bottom=2cm, top=2cm]{geometry}
\usepackage{amsmath}
\usepackage{tcolorbox}
\tcbuselibrary{theorems}
\newtcbtheorem{definition}{Definition}{colback=red!5!white,colframe=red!75!black,fonttitle=\bfseries\large}{def}
\usepackage{xcolor}
\usepackage{graphicx,import}
\usepackage{fancyhdr}
\usepackage{url}
\usepackage{hyperref}
\usepackage{amsmath}

\usepackage{hyperref}
\hypersetup{
    linktoc=all,     %set to all if you want both sections and subsections linked
    colorlinks,
    citecolor=black,
    filecolor=black,
    linkcolor=black,
    urlcolor=black
}

\pagestyle{fancy}
\fancyhf{}
\lhead{ECE-657 Tools of Intelligent System Design}
\rhead{Spring 2022}
\lfoot{Arthur Grisel-Davy}
\rfoot{Page \thepage}


\begin{document}


\thispagestyle{empty}
\begin{center}
    \par\noindent\rule{\textwidth}{2px}\\
    \vspace{0.5cm}
    {\Huge Tools of Intelligent System Design}
    \vspace{0.3cm}
    \par\noindent\rule{\textwidth}{2px}
\end{center}

\tableofcontents

\newpage

\section{Introduction}
This first section is dedicated to defining the core concepts that will be developed in the course. Most importantly, the definitions of what is an intelligent system/machine.

\begin{definition}{Intelligent System}{intellignet-system}
    An algorithm enabled by constraints, exposed by representations that support models, and targeted at reflection, perception, and action.\\
    Without loss of generality, an intelligent system is one that generate hypotheses ans test them.
\end{definition}

We say that the algorithm is \textit{enabled by constraints} because the constraints gives it a direction to follow to solve the problem. Without constraints, the fields of possible solutions is too broad and the algorithm cannot posible start choosing a direction. The constraints are necessary to enable the intelligence. An intelligent system has one main feature that is the generation of outputs based on the inputs and the nature of the system. Common capabilities of inteligent systems include sensory perception, pattern recognition, learning and knowledge acquisition, inference from incomplete information etc.

\begin{definition}{Intelligent Machine}{intelligent-machine}
    An intelligent machine is one that can exibit one or more intelligent characteristics of a human. An intelligent machine embodies machine intelligence. An intelligent machine, howevern may take a broader meaning than an intelligent computer.
\end{definition}

\begin{figure}[h]
    \centering
    \includegraphics[width=0.8\textwidth]{images/intelligent_machine_example.png}
\end{figure}

ML algorithm can be sorted in two categories: Predictive and Descriptive.

\begin{definition}{Predictive vs Descriptive}{desc-pred}
Predictive Models returns a prediction or outcome that is not known. Classification or regressions are examples of predictive analysis.\\
Descriptive Models provide new information to describe the data in a new way. Clustering or summarization are examples or descriptive analysis.\\
\end{definition}

The difference between classification and regression is that classification provide a categorical output (one element of a pre-deternime finite set, for example a label) when regression provide a continuouse output as a real number.

\newpage
\section{Simple Linear Regression}

\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{images/slr_illustration.pdf}
\end{figure}

We are interested in finding a function that represent best the non-functional relationship\footnote{Meaning that the output cannot be expressed as a mathematical function of the input.} between the input and the output. We suppose that we have a dataset of $n$ input/output points. We can first try to determine if there is a statistical relationship between input and output. For that we look at the covariance and the corelation.

With $\bar{Y}$ the average of the outputs and $\bar{X}$ the average of the inputs and $S$ the standard deviation:
\begin{align}
    Covariance &= \dfrac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{X})(y_i-\bar{Y})\\
    Correlation &= \dfrac{Cov(X,Y)}{S_xS_y}
\end{align}

The correlation is normalized between $-1$ and 1. A correlation of $-1$ or 1 indicate a perfect statistical relationship, great for applying SLR. On the other hand, the closer the correlation is to 0 the less information we have. A low correlation does not indicate that there is no statistical relationship, it simply mean that the relationship is not linear.

Now that we know that there is a linear relationship, we want to find the parameters of the affine function that best approximate this relationship. The function is of the form $y_i = f(x_i) = \beta_1 x_i +\beta_0$. The best approwimation of $\beta$ is the one that minimizes the square error. We can find $\beta$ by deriving and solving for 0 the srqare error defined by:
\begin{equation}
    Q = \sum_{i=1}^{n}(y_i - \hat{y_i})^2 = \sum_{i=1}^{n}(y_i - \beta_1x_i+\beta_0)^2
\end{equation}

This is equivalent to using the Maximum Likelihood Estimator (MLE) with the (reasonable) assumption that the output is a reesult of the Gaussian noice around the function of the input i.e. a Normal distribution.

This works well for monovariate problems. In the case of multivariate problems, you have to take each input individually and evaluate the linear regression from this input to the output and the problem becomes way more complex (not detailed here).

\newpage
\section{Logistic Regression}

\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{images/logreg_illustration.pdf}
\end{figure}


The Loggistic regression is fundamentally different from Linear Regression in the sens that it returns the probability of an input belongign to one of two classes. This probability can be compared to a threshold (usually $0.5$) to express the predicted class of the input. In this sense, the logistic regression does not return a real number but a categorical output. In this context, the "training" data consist of inputs that are real number and output that are binary (either 0 or 1). The logistic regression can be viewed as an extension of the linear regression where the affine curve is "bent" to be confined between 0 and 1. We start by expressing the logistic curve as the probability of the input $x$ belonging to the class 1.

\begin{equation}
    p(x) = \dfrac{1}{1+e^{-y}} = \dfrac{1}{1+e^{-(\beta_1x+\beta_0)}}
\end{equation}

We find again the linear regression function that is used to predict $y$ from $x$ but it is now integrated in the larger logistic function to contrain it between 0 and 1. We still have to find the best value for $\beta$ and in this case, and we can use the common likelihood estimator (or negative log-lokelihood for practicality) and optimize it. The loss for the $k^{th}$ point is expressed as:

$$
L_k =
\begin{cases}
    -ln(p_k)~if~y_k&=1\\
    -ln(1-p_k)~if~y_k&=0\\
\end{cases}
= -ln\left(\dfrac{1}{1+e^{-(\beta_1x+\beta_0)}}\right)^{y_k}\left(\dfrac{1}{1+e^{\beta_1x+\beta_0)}}\right)^{1-y_k}
$$

The issue now is that this function is non-linear so we cannot simply derive it and solve for 0 to get the optimum. Numerical methods such as gradient descent or Newton–Raphson need to be leverage to approximate the optimum $\beta$ value.

\section{Fundamentals of Artificial Neural Networks}


\begin{definition}{Artificial Neural Network}{ann}
    An Artificial Neural Network (ANN) is a systems that can acquire, store and utilize experimental knowledge. It is a set of parallelized and distributed computational elements characterisez by there architecture, learning paradigme, and activation functions.
\end{definition}

ANNs are based on \textit{what and how we think} brain's neurons look like and function. Each neurone is a computational element that receivs impulsion or signal from one or more other neurons and, based on a set of intrinsec parameters (the weights) and a threshold/bias (the activation function), transmit a signal to one or more subsequent neurons. The ANNs can be categorized using multiple criteria:

Topologies:
\begin{itemize}
    \item Feed Forward: The information flow goes from the first to the last layer in only one direction.
    \item Recurent: There is no main information flow direction. At some point the information goes out of the network.
\end{itemize}

Activations functions:
\begin{itemize}
    \item Steps.
    \item Sigmoid.
    \item Sigmoid derivativ.
    \item Linear.
    \item $\dots$
\end{itemize}

Learning paradigmes:
\begin{itemize}
    \item Supervised learning: The learning is guided by the error.
    \item Unsupervised learning.
    \item Reinforcement learning: Values or penalize sequences of steps based on the result.
    \item $\dots$
\end{itemize}

\newpage
\subsection{McCulloch-Pitts Model}
The first model of a neurone was proposed to represent the computing process of a biological neurone with mathematical elements.


\begin{figure}[h]
    \centering
    \includegraphics[width=0.7\textwidth]{images/neuron_model.pdf}
\end{figure}

This model does not include any learning mechanism. The weights and activation function parameters need to be set manually in order to procude a meaningfull result. In itseft, this model is pretty but not very usefull. In order to make it learn, we turn it into a perceptron by updating the weights iteratively depending on the output generated by the neurone on our training set of datapoitns.

\subsection{Perceptron}
\begin{figure}[h]
    \centering
    \includegraphics[width=0.5\textwidth]{images/percepton_problem.pdf}
    \label{fig:perceptron_problem}
    \caption{Example of a problem solvable with the perceptron.}
\end{figure}

The perceptron can learn and adjust its parameter using trainig data. The aim of this problem is to find a hyperplan (a line in this case) that separates the $+1$ and $-1$ points. If each class is on one and only one side of the line then the problem is solved. There is no notion of \textit{best separation line} (like there would be for an SVM for example). To solve this problem, we use a one-neurone model. This neurone has weights $\omega = [\omega_1,\omega_2]^T$ and a threshold $\theta$. The equation of the output $o_i$ from the input $x_i = [x_1,x_2]$ is given by:
\begin{equation}
    o_i = [\omega_1,\omega_2,\theta] \cdot [x_1,x_2,1]^T = \omega \cdot x_i + \theta
\end{equation}

Without any better idea of where to start, we initialize the parameters to random (but reasonable) values: $\omega_1=-1, \omega_2=1, \theta=-1$. This set of parameters give the first line presented in figure \ref{fig:perceptron_problem}.

The first line is not very good because the lower $-1$ point is missclassified. We can use this point to update the line. Let's define the the update function using $\eta$ the learning rate (used to reduce the variation between two iterations in order to stay in a stable state).

\begin{align*}
    \Delta\omega &= 2\eta t x_i\\
    \Delta\theta &= -2 \eta t \textrm{ if $t \neq o$, oterwise $\Delta\theta = 0$}
\end{align*}

With this update function and $\eta=0.1$, we can update the weights to $\omega = [-3,-1], \theta=0$. This produces the second line that perfectly.

The one-neurone perceptron can learn and solve linearily separable problems. There is one hyper parameter to tune but other than that it should converge to a solution if the learning rate is not to big. The stopping criteria is is either finding a solution or reaching a number of iterations.

\subsection{Multi-Layer Perceptron}
One neurone is good but it is very limited. For example, it is limited to linearily separable problems. The XOR problem where two classes are spread along the both diagonalsof the plan is an example of a very simple and yet non-linearily separable problem where a single-neurone perceptron will fail. We need to connect multiple neurones together now to tackle this kind of problems. We make a few changes to the neurone to prepare it to the use in a network.

\begin{itemize}
    \item The neurone has now a trainable bias which is just a third parameter along with the weights.
    \item The training function is now defined as $\Delta \omega = \eta \left( \dfrac{\partial P}{\partial \omega_1} + \dfrac{\partial P}{\partial \omega_2}\right)$
    \item The performance function is define as $P = -\dfrac{1}{2} (t-o)^2$. The square is to make it smooth and derivable, the half is because we know we are going to derivate it at some point, and the negative sign is to make it a maximisation problem (no real use, maximizing performance is the same as minimizing loss or cost).
\end{itemize}

We can now conenct the two neurones and this minimal network will produce an output $O_i$ for each input $x_i$.

\begin{figure}[h]
    \centering
    \includegraphics[width=0.8\textwidth]{images/2neurones.pdf}
    \label{fig:2neurones}
    \caption{Minimal multi-layers perceptron with two neurones..}
\end{figure}

We want to maximise the performance value $P$ using the inputs $x_i$. This maximisation problem requires to determine the derivative of $P$ with respect to each parameter and use it to update the parameter (gradient descent method). In order to obtain the derivative of $P$ with respect to both $\omega_i$, we leverage the chain rule to get from the end of the network to the begining using the derivative at each step:


\[\arraycolsep=4pt\def\arraystretch{2.2}
\left\{
    \begin{array}{ll}
    \dfrac{\partial P}{\partial \omega_2} &= \dfrac{\partial P}{\partial O_i} \dfrac{\partial O_i}{\partial P_2} \dfrac{\partial P_2}{\partial \omega_2}\\
    \dfrac{\partial P}{\partial \omega_1} &= \dfrac{\partial P}{\partial \omega_2} \dfrac{\partial \omega_2}{\partial y} \dfrac{\partial y}{\partial P_1} \dfrac{\partial P_1}{\partial \omega_1}\\
\end{array}
\right.
\]

Each derivative is fairly easy to get. We can express our deltas as:
\begin{align*}
    \dfrac{\partial P}{\partial \omega_2} &= (t_i - O_i) \times (1-O_i)O_i \times y\\
    \dfrac{\partial P}{\partial \omega_1} &= (t_i - O_i) \times (1-O_i)O_i \times \omega_2 \times(1-y)y\times x_i\\
\end{align*}

These derivatives can be used now to update the valeus of $\omega_1$ and $\omega_2$ according to the learning rate $\eta$. This chaining of derivative from the output all the way to the input of the network is the idea behind backtracking.
\end{document}