You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

442 lines
35 KiB
TeX

This file contains ambiguous Unicode characters!

This file contains ambiguous Unicode characters that may be confused with others in your current locale. If your use case is intentional and legitimate, you can safely ignore this warning. Use the Escape button to highlight these characters.

\documentclass{article}
\usepackage[utf8]{inputenc}
\usepackage[left=2cm,right=2cm,bottom=2cm, top=2cm]{geometry}
\usepackage{amsmath}
\usepackage{tcolorbox}
\tcbuselibrary{theorems}
\newtcbtheorem{definition}{Definition}{colback=red!5!white,colframe=red!75!black,fonttitle=\bfseries\large}{def}
\usepackage{xcolor}
\usepackage{graphicx,import}
\usepackage{fancyhdr}
\usepackage{url}
\usepackage{hyperref}
\usepackage{amsmath}
\usepackage{hyperref}
\hypersetup{
linktoc=all, %set to all if you want both sections and subsections linked
colorlinks,
citecolor=black,
filecolor=black,
linkcolor=black,
urlcolor=black
}
\pagestyle{fancy}
\fancyhf{}
\lhead{ECE-657 Tools of Intelligent System Design}
\rhead{Spring 2022}
\lfoot{Arthur Grisel-Davy}
\rfoot{Page \thepage}
\begin{document}
\thispagestyle{empty}
\begin{center}
\par\noindent\rule{\textwidth}{2px}\\
\vspace{0.5cm}
{\Huge Tools of Intelligent System Design}
\vspace{0.3cm}
\par\noindent\rule{\textwidth}{2px}
\end{center}
\tableofcontents
\newpage
Final Examdate: July 29, 9am
\section{Introduction}
This first section is dedicated to defining the core concepts that will be developed in the course. Most importantly, the definitions of what is an intelligent system/machine.
\begin{definition}{Intelligent System}{intellignet-system}
An algorithm enabled by constraints, exposed by representations that support models, and targeted at reflection, perception, and action.\\
Without loss of generality, an intelligent system is one that generate hypotheses ans test them.
\end{definition}
We say that the algorithm is \textit{enabled by constraints} because the constraints gives it a direction to follow to solve the problem. Without constraints, the fields of possible solutions is too broad and the algorithm cannot posible start choosing a direction. The constraints are necessary to enable the intelligence. An intelligent system has one main feature that is the generation of outputs based on the inputs and the nature of the system. Common capabilities of inteligent systems include sensory perception, pattern recognition, learning and knowledge acquisition, inference from incomplete information etc.
\begin{definition}{Intelligent Machine}{intelligent-machine}
An intelligent machine is one that can exibit one or more intelligent characteristics of a human. An intelligent machine embodies machine intelligence. An intelligent machine, howevern may take a broader meaning than an intelligent computer.
\end{definition}
\begin{figure}[h]
\centering
\includegraphics[width=0.8\textwidth]{images/intelligent_machine_example.png}
\end{figure}
ML algorithm can be sorted in two categories: Predictive and Descriptive.
\begin{definition}{Predictive vs Descriptive}{desc-pred}
Predictive Models returns a prediction or outcome that is not known. Classification or regressions are examples of predictive analysis.\\
Descriptive Models provide new information to describe the data in a new way. Clustering or summarization are examples or descriptive analysis.\\
\end{definition}
The difference between classification and regression is that classification provide a categorical output (one element of a pre-deternime finite set, for example a label) when regression provide a continuouse output as a real number.
\newpage
\section{Simple Linear Regression}
\begin{figure}[h]
\centering
\includegraphics[width=\textwidth]{images/slr_illustration.pdf}
\end{figure}
We are interested in finding a function that represent best the non-functional relationship\footnote{Meaning that the output cannot be expressed as a mathematical function of the input.} between the input and the output. We suppose that we have a dataset of $n$ input/output points. We can first try to determine if there is a statistical relationship between input and output. For that we look at the covariance and the corelation.
With $\bar{Y}$ the average of the outputs and $\bar{X}$ the average of the inputs and $S$ the standard deviation:
\begin{align}
Covariance &= \dfrac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{X})(y_i-\bar{Y})\\
Correlation &= \dfrac{Cov(X,Y)}{S_xS_y}
\end{align}
The correlation is normalized between $-1$ and 1. A correlation of $-1$ or 1 indicate a perfect statistical relationship, great for applying SLR. On the other hand, the closer the correlation is to 0 the less information we have. A low correlation does not indicate that there is no statistical relationship, it simply mean that the relationship is not linear.
Now that we know that there is a linear relationship, we want to find the parameters of the affine function that best approximate this relationship. The function is of the form $y_i = f(x_i) = \beta_1 x_i +\beta_0$. The best approwimation of $\beta$ is the one that minimizes the square error. We can find $\beta$ by deriving and solving for 0 the srqare error defined by:
\begin{equation}
Q = \sum_{i=1}^{n}(y_i - \hat{y_i})^2 = \sum_{i=1}^{n}(y_i - \beta_1x_i+\beta_0)^2
\end{equation}
This is equivalent to using the Maximum Likelihood Estimator (MLE) with the (reasonable) assumption that the output is a reesult of the Gaussian noice around the function of the input i.e. a Normal distribution.
This works well for monovariate problems. In the case of multivariate problems, you have to take each input individually and evaluate the linear regression from this input to the output and the problem becomes way more complex (not detailed here).
\newpage
\section{Logistic Regression}
\begin{figure}[h]
\centering
\includegraphics[width=\textwidth]{images/logreg_illustration.pdf}
\end{figure}
The Loggistic regression is fundamentally different from Linear Regression in the sens that it returns the probability of an input belongign to one of two classes. This probability can be compared to a threshold (usually $0.5$) to express the predicted class of the input. In this sense, the logistic regression does not return a real number but a categorical output. In this context, the "training" data consist of inputs that are real number and output that are binary (either 0 or 1). The logistic regression can be viewed as an extension of the linear regression where the affine curve is "bent" to be confined between 0 and 1. We start by expressing the logistic curve as the probability of the input $x$ belonging to the class 1.
\begin{equation}
p(x) = \dfrac{1}{1+e^{-y}} = \dfrac{1}{1+e^{-(\beta_1x+\beta_0)}}
\end{equation}
We find again the linear regression function that is used to predict $y$ from $x$ but it is now integrated in the larger logistic function to contrain it between 0 and 1. We still have to find the best value for $\beta$ and in this case, and we can use the common likelihood estimator (or negative log-lokelihood for practicality) and optimize it. The loss for the $k^{th}$ point is expressed as:
$$
L_k =
\begin{cases}
-ln(p_k)~if~y_k&=1\\
-ln(1-p_k)~if~y_k&=0\\
\end{cases}
= -ln\left(\dfrac{1}{1+e^{-(\beta_1x+\beta_0)}}\right)^{y_k}\left(\dfrac{1}{1+e^{\beta_1x+\beta_0)}}\right)^{1-y_k}
$$
The issue now is that this function is non-linear so we cannot simply derive it and solve for 0 to get the optimum. Numerical methods such as gradient descent or NewtonRaphson need to be leverage to approximate the optimum $\beta$ value.
\section{Fundamentals of Artificial Neural Networks}
\begin{definition}{Artificial Neural Network}{ann}
An Artificial Neural Network (ANN) is a systems that can acquire, store and utilize experimental knowledge. It is a set of parallelized and distributed computational elements characterisez by there architecture, learning paradigme, and activation functions.
\end{definition}
ANNs are based on \textit{what and how we think} brain's neurons look like and function. Each neurone is a computational element that receivs impulsion or signal from one or more other neurons and, based on a set of intrinsec parameters (the weights) and a threshold/bias (the activation function), transmit a signal to one or more subsequent neurons. The ANNs can be categorized using multiple criteria:
Topologies:
\begin{itemize}
\item Feed Forward: The information flow goes from the first to the last layer in only one direction.
\item Recurent: There is no main information flow direction. At some point the information goes out of the network.
\end{itemize}
Activations functions:
\begin{itemize}
\item Steps.
\item Sigmoid.
\item Sigmoid derivativ.
\item Linear.
\item $\dots$
\end{itemize}
Learning paradigmes:
\begin{itemize}
\item Supervised learning: The learning is guided by the error.
\item Unsupervised learning.
\item Reinforcement learning: Values or penalize sequences of steps based on the result.
\item $\dots$
\end{itemize}
\newpage
\subsection{McCulloch-Pitts Model}
The first model of a neurone was proposed to represent the computing process of a biological neurone with mathematical elements.
\begin{figure}[h]
\centering
\includegraphics[width=0.7\textwidth]{images/neuron_model.pdf}
\end{figure}
This model does not include any learning mechanism. The weights and activation function parameters need to be set manually in order to procude a meaningfull result. In itseft, this model is pretty but not very usefull. In order to make it learn, we turn it into a perceptron by updating the weights iteratively depending on the output generated by the neurone on our training set of datapoitns.
\subsection{Perceptron}
\begin{figure}[h]
\centering
\includegraphics[width=0.5\textwidth]{images/percepton_problem.pdf}
\label{fig:perceptron_problem}
\caption{Example of a problem solvable with the perceptron.}
\end{figure}
The perceptron can learn and adjust its parameter using trainig data. The aim of this problem is to find a hyperplan (a line in this case) that separates the $+1$ and $-1$ points. If each class is on one and only one side of the line then the problem is solved. There is no notion of \textit{best separation line} (like there would be for an SVM for example). To solve this problem, we use a one-neurone model. This neurone has weights $\omega = [\omega_1,\omega_2]^T$ and a threshold $\theta$. The equation of the output $o_i$ from the input $x_i = [x_1,x_2]$ is given by:
\begin{equation}
o_i = [\omega_1,\omega_2,\theta] \cdot [x_1,x_2,1]^T = \omega \cdot x_i + \theta
\end{equation}
Without any better idea of where to start, we initialize the parameters to random (but reasonable) values: $\omega_1=-1, \omega_2=1, \theta=-1$. This set of parameters give the first line presented in figure \ref{fig:perceptron_problem}.
The first line is not very good because the lower $-1$ point is missclassified. We can use this point to update the line. Let's define the the update function using $\eta$ the learning rate (used to reduce the variation between two iterations in order to stay in a stable state).
\begin{align*}
\Delta\omega &= 2\eta t x_i\\
\Delta\theta &= -2 \eta t \textrm{ if $t \neq o$, oterwise $\Delta\theta = 0$}
\end{align*}
With this update function and $\eta=0.1$, we can update the weights to $\omega = [-3,-1], \theta=0$. This produces the second line that perfectly.
The one-neurone perceptron can learn and solve linearily separable problems. There is one hyper parameter to tune but other than that it should converge to a solution if the learning rate is not to big. The stopping criteria is is either finding a solution or reaching a number of iterations.
\subsection{Multi-Layer Perceptron}
One neurone is good but it is very limited. For example, it is limited to linearily separable problems. The XOR problem where two classes are spread along the both diagonalsof the plan is an example of a very simple and yet non-linearily separable problem where a single-neurone perceptron will fail. We need to connect multiple neurones together now to tackle this kind of problems. We make a few changes to the neurone to prepare it to the use in a network.
\begin{itemize}
\item The neurone has now a trainable bias which is just a third parameter along with the weights.
\item The training function is now defined as $\Delta \omega = \eta \left( \dfrac{\partial P}{\partial \omega_1} + \dfrac{\partial P}{\partial \omega_2}\right)$
\item The performance function is define as $P = -\dfrac{1}{2} (t-o)^2$. The square is to make it smooth and derivable, the half is because we know we are going to derivate it at some point, and the negative sign is to make it a maximisation problem (no real use, maximizing performance is the same as minimizing loss or cost).
\end{itemize}
We can now conenct the two neurones and this minimal network will produce an output $O_i$ for each input $x_i$.
\begin{figure}[h]
\centering
\includegraphics[width=0.8\textwidth]{images/2neurones.pdf}
\label{fig:2neurones}
\caption{Minimal multi-layers perceptron with two neurones.}
\end{figure}
We want to maximise the performance value $P$ using the inputs $x_i$. This maximisation problem requires to determine the derivative of $P$ with respect to each parameter and use it to update the parameter (gradient descent method). In order to obtain the derivative of $P$ with respect to both $\omega_i$, we leverage the chain rule to get from the end of the network to the begining using the derivative at each step:
\[\arraycolsep=4pt\def\arraystretch{2.2}
\left\{
\begin{array}{ll}
\dfrac{\partial P}{\partial \omega_2} &= \dfrac{\partial P}{\partial O_i} \dfrac{\partial O_i}{\partial P_2} \dfrac{\partial P_2}{\partial \omega_2}\\
\dfrac{\partial P}{\partial \omega_1} &= \dfrac{\partial P}{\partial \omega_2} \dfrac{\partial \omega_2}{\partial y} \dfrac{\partial y}{\partial P_1} \dfrac{\partial P_1}{\partial \omega_1}\\
\end{array}
\right.
\]
Each derivative is fairly easy to get. We can express our deltas as:
\begin{align*}
\dfrac{\partial P}{\partial \omega_2} &= (t_i - O_i) \times (1-O_i)O_i \times y\\
\dfrac{\partial P}{\partial \omega_1} &= (t_i - O_i) \times (1-O_i)O_i \times \omega_2 \times(1-y)y\times x_i\\
\end{align*}
These derivatives can be used now to update the valeus of $\omega_1$ and $\omega_2$ according to the learning rate $\eta$. This chaining of derivative from the output all the way to the input of the network is the idea behind backtracking.
\subsection{Backtracking}
As we saw with the simple MLP example before, it is possible to update the parameters when knowing the output. In a very simple case, we only need to know the gradient of the next neurone to update the gradient of the previous neurone. We would like now to generalize and consider a general MLP consisting of multiple neurones per layer. By applying the same method as before, it is still possible to update all the parameters in a more general case.\\
\textbf{Notation:} We consider an MLP of multiple hidden layers. The first hidden layer connected to the input is the layer $l$. The last hidden layer connected to the output is the layer $k$. Between $l$ and $k$, any number of layer can be imagined. For this example, we will consider 2 non-specific layers $i$ and $j$. The network is the ncomposed of 4 hidden layers containing each any number of neurones represented by $N_{layer}$. Each neurone is composed of two parts: $a_i$ is the input part that compute $w^Tz_i$ with $z_i$ the input of the layer $i$, and an activation part $f_i$ that represent the activation function. We consider here that all activation functions are sigmoids represented by $\sigma$.
\begin{figure}[h]
\centering
\includegraphics[width=\textwidth]{images/backtrack.pdf}
\label{fig:backtrack}
\caption{General view of an MLP with multiple hidden layers.}
\end{figure}
We consider an error function of the form $E = \dfrac{1}{2}(O-t)^2$. We wat to minimize this error function. At first all parameters are considered initialised to a random value. Because of that, it is possible to present a training sample to the network to obtain the prediction. This is called the forward pass. After the forward pass that parameters remain unchanged but all the intermediate values inside and at the output of each neurone are populated. We also get an output vector $O$. We now want to minimize the error by adjusting the parameter. We have to find for each parameter $w$ the direction of change that will minimize the error with regard to this parameter i.e. the gradient of $E$ with respect to $w$. We consider a generale case and suppsoe that we want to obtain the gradient for the parameters $w_{Li}$. It is not possible to directly derive $E$ with respect to $w_{Li}$ as there is no direct relationship between them. What we can do is derive $E$ with respect to the output value of the last layer $a_k$. Lets use the chain rule to express that:
\begin{equation}
\dfrac{\partial E}{\partial w_{Li}} = \dfrac{\partial E}{\partial a_k} \times \dfrac{\partial a_k}{\partial w_{Li}}
\end{equation}
When expressing the derivative of $E$ with respect to $a_k$, we run into the issue that $a_k$ is used as input for all neurones of the output layer. During the forward pass each neurone takes the value of $a_k$ and use it to compute its own output value. But during the computation of the gradients, we need to take into account the effect of $a_k$ through all the output neurones. The simplest way to do that is to add all the contributions:
\begin{equation}
\dfrac{\partial E}{\partial a_k} = \sum_{k}\left[(t-f_k)(1-f_k)f_k\right]
\end{equation}
Using the same principle, we can express all the link in the chain rule to connect the parameter we search and the error function:
\begin{equation}
\dfrac{\partial E}{\partial w_{Li}} = \sum_j\left[ \sigma'(f_k)\sum_{k}\left[(t-f_k)(1-f_k)f_k\right]w_{jk}\right]w_{ij}\sigma'(a_i)f_l
\end{equation}
This looks terrifying but it is composed of simple elements:
\begin{itemize}
\item The change of the output with respect of the input of a neurone (passage through the activation function is the derivative of the sigmoid function (not expressed because it is ugly): $\dfrac{\partial f_i}{\partial a_i} = \sigma'(ai)$.
\item The change of the input of a neurone with respect to the output of the previous neurone is the weights: $\dfrac{\partial a_j}{\partial f_i} = \dfrac{\partial}{\partial f_i} f_i\times w_{ij} = w_{ij}$
\item The change of the error function with respect to the input of the last layer: $\dfrac{\partial E}{\partial a_k} =(t-f_k)(1-f_k)f-k$
\end{itemize}
With these elements, it is possible to express the impact of each weights on the error function and thus getting the gradient that direct the change to appply to each weight to minimize the error. The weight updat function is expressed as:
\begin{equation}
\Delta w(t+1) = -\eta \dfrac{\partial E(t)}{\partial w} + \gamma \Delta w(t-1)
\end{equation}
The first term express the momentum of the learning with the $\eta$ parameter that control the step of the parameter at each iteration. The second term is the direction of the learning. To avoid jumping everywhere, we take into account the previous direction with a weight of $\gamma$. This help smooth the direction and get to the minimum faster.
\section{Radial Basis Function (RBF) Network}
The idea is to transform our data to make them linearily separable. Once they are linearily separable, a perceptron (or any other classifier like an SVM) can find the best separation. In order to make the data linearily separable, We would like to bring together the data of the same class. The intention is that whetever the initial distribution, if you pull data of different classes to different point hard enough, at some point they will become linearily separable. To achieve that, we design a specific type of network where the first layer (after the input) has fixed weights (all ones) and special function. These functions are responsible for the transformation of the input. The rest of the network is an MLP that will take care of finding the best separation. The problem is now dual:
\begin{itemize}
\item Find the parameters of the transformation.
\item Find the values of the weights for the separation of the classes.
\end{itemize}
There can be different transformation function. The most common is the Gaussian Kernel Function given by $g_i(x) = e^{\dfrac{-|x-v_i|^2}{2\sigma_i^2}}$. The learning of these parameters (one set of parameter per neurone of the transformation layer) and the parameters of the MLP is done in steps:
\begin{enumerate}
\item First an unsupervised clustering method creates the correct number of clusters among the input data. These clusters will be used as centers for pulling the data.
\item ...
\end{enumerate}
\section{Kohonen's Self-Organizing Networks}
This NN do not use the regular configurtion of other NNs. It is an unsupervised method that provide clustering and dimensionality reduction.
Lets begin with an idea; Lets say you have two sheet of paper. Each sheet have some nodes in them (same number on each sheet) and each is fully connected to all the nodes in the other sheet. All the connections to one node is a bundle of synaptic connections. One sheet is pre-synaptic and the other is post-synaptic. Now the pre-synaptic node can either ignite or inhibit. The idea is that the similar pre-synaptic nodess will excite post-synaptic nodes that are close because thay have similar features (in the same region). We can now rearange the pre-synaptic nodes based on the excitment regions of the post-synaptic nodes.
Fast forward ten years. Now Kohonen take on this idea andbuilt on it. First, he supposed that the nodes are preceptrons. Second, the synaptic connections are weighted (like from one layer to the other).He also release the contraint that the two sheets should have the same number of nodes. He also gets rid of the lower sheet and consider the perceptron of the upper sheet to be fully connected to each input data point. For each input, we can compute the output of each perceptron, and select one (and only one) to represent the input, for each input. For each input data, the selected node is rewarded.
We represent this with an MLP with an input layer the same size as the number of feature of the example and many neurones in the second layer.
\begin{itemize}
\item First, Select how many groups our data should be segregated in (i.e. how many representative)
\item Second, initialize the weights randomly.
\item Then, compute the value of each node for the first example. The best scoring node is considered winner. To compute the value many similarity measure can be used but in this case we will use the distance between the weights vector and the input example vector.
\item Reward the winner node. Again there could be multiple ways to do it. We want to get closer to the center of the whole cluster, not tho the specific example that elected this node as the winner. We define the variation of the weights as $\Delta w = \alpha(k)||x_i - w_i||$. The penalty is defined as $\alpha(k) = \alpha(0)(1-\dfrac{k}{T}$ where $k$ is the current epoch and $T$ is the total number of epoch. $\alpha(0)$ is a hyper parameter. The learning rate decreases with every iteration to reach a centroid value.
\item Continue until the total number of epoch is reached.
\end{itemize}
This network is similar to the k-mean clustering algorithm. The main difference is that the weights (centroids) are updated after each training example. The main problem with these method is that we need to know the number of clusters beforehand. This is anoying and we would like to get rid of that constaint.
We now onsider that the hidden layer has a lot of nodes. We imagine that these nodes are initialy equally spaced across a place. Now we can still select a winner for each training example. Where this differes is that now the entire neiborhood of the winning node is rewarded, not only the winner. We design a new reward equation that takes into account the distance to the winning node: $\Delta w_i = \alpha(k) h(w_i) ||x-\hat{w}||$ with $h(w_i)=||\hat{w}-w_i||$, and $\hat{w}$ the winning node. Basically the reward decreases Gaussianly around the winning node, and the variance of the gaussian decreases with the iterations.
Now we dont need to know beforehand the number of clusters because the clusters are not represented by unique nodes anymore but by whole regions in the map. The dmensionality reduction is happening because a point with 4 (or however many) features is represented by a node in a 2D (or ND) space.
In the end the algorithm has 3 main stages: initialization, competition and ???.
\begin{figure}
\centering
\includegraphics[width=0.7\textwidth]{images/kohonen.pdf}
\caption{Kohonen self-organizing network}
\label{kohonen}
\end{figure}
\newpage
\section{Hopfield Network}
In the spirit of mimicing the way our brain work, we may want to mimic the memory component. We will study now the other big type of NN architecture: the recurent NN. Hopfield came with a way to restirct this kind of network to make it usable.
\begin{itemize}
\item Connections are now bidirectional with the same weight. $w_{ij} = w_{ji}$.
\item The neurones cannot excite themself. $w_{ii} = 0$.
\item The neurones are not processing units but states of value -1 or 1 (or 0 and 1).
\item There is no activation function.
\item The state is 1 if $\sum_j w_{i,j}O_iO_j > \theta$. Otherwise the state is $-1$.
\end{itemize}
with these contraints, we built a statefull NN where the nodes have states. For the network to become stable, the energy of the network needs to reach a minimum.
[insert hopfield network figure]
\begin{definition}{Energy of the network}{energy-network}
If the threshold is 0, the energy of the network is defined as $E = -\dfrac{1}{2}\sum_i\sum_j w_{ij}O_iO_j$.\\
If the threshold is not 0, the energy is defined as $E = -\dfrac{1}{2}\sum_i\sum_j w_{ij}O_iO_j + \sum_iO_i\theta$.
\end{definition}
To proov that the network reaches a stable state.
At first, we look into the state of the node $O_i$. The new value of the node depends on the value of $\sum_j w_{ij}O_iO_i$ compared to $\theta$. If the sum is more than $\theta$, then the new state of the node is 1. Otherwise it is -1.
\begin{align*}
E &= -\dfrac{1}{2}\sum_i\sum_j w_{ij}O_iO_j + \sum_iO_i\theta\\
\Delta E &= \Delta E_{old} - \Delta E_{new}\\
\Delta E &= -\dfrac{1}{2}\sum_j w_{ij}O_i^{new}O_j + O_i^{new}\theta - (-\dfrac{1}{2}\sum_j w_{ij}O_i^{old}O_j + O_i^{old}\theta)\\
\Delta E &= -\Delta O_i( \dfrac{1}{2}\sum_j w_{ij}O_j - \theta)\\
\end{align*}
The sign of the $\Delta E$ seems to depends on the relative values of $\dfrac{1}{2}\sum_j w_{ij}O_j$ and $\theta$. However, regardless of the values, the delta will always be a negative because the sum affects the value of $\Delta O_i$ based on the value of $\theta$. The node will either go from -1 to 1 or from 1 to -1. If the changed is allowed, then the energy gets down.
We believe now that we got to a stable point for the network (lowest energy point). Now the nodes of the network represent coded information that the network has memorized. If we know the weights of the network, we can get to the lowest energy configuration. The problem is that one set of weights can result in fiding multiple multiple different configurations based on the starting point. Also, the price in weights is very expensive compared to the information the network can \textit{remember}. The point os this network was the mimic memory. Not an efficient or usefull memory, but memory nonetheless. So how to make it more efficient? One first option is to update the weights in order to reduces the occurences where weights and states does not form a bijection. We can calculate the best value for the state given the state. We take a look at the product between two states.
\begin{equation*}
W = \dfrac{1}{n}[p_1,\dots,p_n]\cdot [p_1,\dots,p_n]^T-I_n
\end{equation*}
\section{Deep Learning}
Dee plearning is a subset of machine learning. This is not a different category. It emerges from the limitations of MLPs. MLPs transforms inputs linearily (and through the activation function) into a new set of features at each layer. We could imagine doing that for a 1000 layers to detect more and more complexe features in our data. This would not work because of the problem of the vanishing gradient. The idea is that every new layer affects the error signal.
If the initial weights are big, each layer increases the error signal during the backpropagation. If the network is too deep, the gradient will explode. If the initial gradient is small, each layer will decrease the error signal. If the network is too small, the gradient will vanish. Each case is bad but the vanishing gradient is worse because the network will not event learn. As a matter of fact, there is more chance of the gradient vanishing because of the derivative of the sigmoid (also added to the error signal at every layer).
Is the weights cannot be changed because the error signal is too weak, the weights remain in their initial random state and the network is no good.
Another reason why deep MLP is not a good idea is that the network will be very prone to overfitting. One countermeasure is a lot of data but data isn't free.
There are possible ways of solving this.
\begin{itemize}
\item Initializing the weights around 1 so that it dosn't explode or vanish (at least not instantly).
\item Not using the sigmoid but a function where the derivative is not a function of the output. Relu (or Leaky Relu) for example.
\end{itemize}
None of them solves the problem in the long run. The vanishing gradient remain a problem after some training time.
For simple applications, we don't need deep learning anyway. The problem mainly arises when considering images. A small image of $250\times250\times1$ is already an input vector of $\approx 65k$. Usually, the hidden layer is larger than the input layer so the first matrix is already billions of terms.
Even if the problem of vanishing gradient is solved (skipping layers for example), there is still too many parameters and this requires a very large amount of training data.
\subsection{Convolutionnal Neurals Networks}
One of the idea to make the deep NNs applicable is to reduce the number of parameters. The densely connected layers of an MLP creates one trainable parameter per pair of previous/next nodes. To get around this issue, we can create \textit{sparsly} connected neural networks. We remove some of the connections and this reduces the trainable-parameters by as much. This is good but we can go further.
The connections stils requires unique weights. We can actually get rid of this constraint and decide that groups of weights share the same value. This reduces again the number of parameters to train.
With these two ideas, we can reduce the number of parameters enougth to actually use a lot of layers. This allowed AlexNet to do a 1000-classes classification on images. This network leverages convolutions. Applying convolution on an image with a kernel is the same as grouping weights together and considering only certains input to compute the output (between two layers).
$\dots$
\subsection{Recurent Neural Network}
The network encountered so far assume that the data are independant and identically distributed. Recurent neural network on the other side does not make this assumption and assume some relationship/dependency between the data. This is especially applicable in the case of speech and text study where the context of a work can change its meaning. Common networks or model are not equipped to deal with sequential data. These data are specific and thus requires specific models or assumption to be procesed. Let's take an example:
\begin{center}
{\Large The ball was blue when it came out.}\\
{\Large The ball came out blue.}
\end{center}
These two sentences are composed of more or less the same words but have vastly different meaning. Understanding the relationship between the words is fundamental in understanding the complete sentence. A feed forward NN would consider each word as an input and would fail misserably at understanding the sentence. The same temporal/sequential aspect of data can be found in time series (stock prices for example).
Recurent NN offers flexibility by not requiring fixed input and output sizes. The first notion proposed to describe a recurent NN was to evaluate the connection between. If two elements always come together, there is a strong positive connection between them. If on the other hand two elements do not usually come together, then they have a negative connection. This idea allows to find connection between words that usually appears together in sentences. The words are represented by states. We can try to redict the next state (word) from the previous and current state.
[insert unravel RNN schema]
All of the weights for $x$ and $y$ are the same. The states of $a$ can be computed by:
\begin{equation*}
a_1 = g_a(w_{ax}\cdot x_1 + w_{a}\cdot a_0 + b_a)
\end{equation*}
And similarly $y$ can also be computed as:
\begin{equation*}
y_i = g_y(w_{ay}\cdot a_i + b_y)
\end{equation*}
In words, the state is a function of the input $x$ and of the previous state $a_{i-1}$. In turn ,the output $y$ is a function of the current state. Of course, agood matrice form is always welcome:
\begin{equation*}
a(t+1) = g_a\left([w_{a},w_{ax}]\cdot \left[\begin{array}{c}a(t)\\x(t+1)\end{array}\right]+b_a\right)
\end{equation*}
Now the big question that is comming is \textit{How the Heck do you update your weights?}. Backpropagation to the rescue. Each output can be used to compute one loss value. We combine these loss values with cross-entropy (entropy generalized to multiple classes).
We can now update weights for the states with Backpropagation Through Time (BPTT) begining from the last state to the first one. The important part of this network are not the input or output but the states of the network that contains influence of all previous states.
The architecture presented until now is called \textit{mini-to-mini}. There are as many output as input. A more common one would be \textit{many-to-one} that gives one output for multiple input (synthesizing text). All other combinations of \textit{mini} and \textit{one} can be considered and are usefull for other problems.
\subsection{Language Processing}
Now that the problem of sequential data is solved by recurent NNs, we can analyse text. The probleme with text is that computers, contrary to humans, are good with number but not so much with words. The word has to be represented by values that carries the meaning. One interesting aspect about language is that the words are finite and the sentences are finite. There are only a few hundred thousand words in the dictionary. We could assign an index to each word and call it a day. Each word would be represetned by a vector. The problem with this method is that there is no preservation of similarity in this representation. wo very different words can have similar representation and the other way around too. To find a better encoding, we suppose that every word in the dictionary has the same number of features. These features describe some aspect of the word. Features could be for example gender, Royalty, Age, object, animal etc. We can represent each word by some value associated to each feature.
To go about creating a good encoding begins with the concept of affinity matrix. This matrix contains the number of time the word of the column comes after the word of the row. Knowing this affinity matrix, we can use a RNN with the column of the affinity matrix as input and the probability of the next word to get the weights that represent the features extracted to represent the words.
\end{document}