RNN

2 years ago · f02fd23718
parent a806284585
commit f02fd23718
1 changed files with 98 additions and 12 deletions
--- a/main.tex
+++ b/main.tex
@ -324,10 +324,10 @@ Now we dont need to know beforehand the number of clusters because the clusters
 In the end the algorithm has 3 main stages: initialization, competition and ???.

 \begin{figure}
-    \centering
-    \includegraphics[width=0.7\textwidth]{images/kohonen.pdf}
-    \caption{Kohonen self-organizing network}
-    \label{kohonen}
+\centering
+\includegraphics[width=0.7\textwidth]{images/kohonen.pdf}
+\caption{Kohonen self-organizing network}
+\label{kohonen}
 \end{figure}

 \newpage
@ -335,21 +335,107 @@ In the end the algorithm has 3 main stages: initialization, competition and ???.
 In the spirit of mimicing the way our brain work, we may want to mimic the memory component. We will study now the other big type of NN architecture: the recurent NN. Hopfield came with a way to restirct this kind of network to make it usable.

 \begin{itemize}
-    \item Connections are now bidirectional with the same weight. $w_{ij} = w_{ji}$.
-    \item The neurones cannot excite themself. $w_{ii} = 0$.
-    \item The neurones are not processing units but states of value 1 or 0.
-    \item There is no activation function.
-    \item The state is 1 if $\sum_j w_{i,j}O_iO_j > \theta$. Otherwise the state is $-1$.
+\item Connections are now bidirectional with the same weight. $w_{ij} = w_{ji}$.
+\item The neurones cannot excite themself. $w_{ii} = 0$.
+\item The neurones are not processing units but states of value -1 or 1 (or 0 and 1).
+\item There is no activation function.
+\item The state is 1 if $\sum_j w_{i,j}O_iO_j > \theta$. Otherwise the state is $-1$.
 \end{itemize}

+with these contraints, we built a statefull NN where the nodes have states. For the network to become stable, the energy of the network needs to reach a minimum.
+
+
 [insert hopfield network figure]

 \begin{definition}{Energy of the network}{energy-network}
-    If the threshold is 0, the energy of the network is defined as $E = -\dfrac{1}{2}\sum_i\sum_j w_{ij}O_iO_j$.\\
-    If the threshold is not 0, the energy is defined as $E = -\dfrac{1}{2}\sum_i\sum_j w_{ij}O_iO_j + \sum_iO_i\theta$.
+If the threshold is 0, the energy of the network is defined as $E = -\dfrac{1}{2}\sum_i\sum_j w_{ij}O_iO_j$.\\
+If the threshold is not 0, the energy is defined as $E = -\dfrac{1}{2}\sum_i\sum_j w_{ij}O_iO_j + \sum_iO_i\theta$.
 \end{definition}

-We now have to proov that the network reaches a minimum energy state that is stable.
+To proov that the network reaches a stable state.
+At first, we look into the state of the node $O_i$. The new value of the node depends on the value of $\sum_j w_{ij}O_iO_i$ compared to $\theta$. If the sum is more than $\theta$, then the new state of the node is 1. Otherwise it is -1.
+
+\begin{align*}
+E &= -\dfrac{1}{2}\sum_i\sum_j w_{ij}O_iO_j + \sum_iO_i\theta\\
+\Delta E &= \Delta E_{old} - \Delta E_{new}\\
+\Delta E &= -\dfrac{1}{2}\sum_j w_{ij}O_i^{new}O_j + O_i^{new}\theta - (-\dfrac{1}{2}\sum_j w_{ij}O_i^{old}O_j + O_i^{old}\theta)\\
+\Delta E &= -\Delta O_i( \dfrac{1}{2}\sum_j w_{ij}O_j - \theta)\\
+\end{align*}
+
+The sign of the $\Delta E$ seems to depends on the relative values of $\dfrac{1}{2}\sum_j w_{ij}O_j$ and $\theta$. However, regardless of the values, the delta will always be a negative because the sum affects the value of $\Delta O_i$ based on the value of $\theta$. The node will either go from -1 to 1 or from 1 to -1. If the changed is allowed, then the energy gets down.
+
+We believe now that we got to a stable point for the network (lowest energy point). Now the nodes of the network represent coded information that the network has memorized. If we know the weights of the network, we can get to the lowest energy configuration. The problem is that one set of weights can result in fiding multiple multiple different configurations based on the starting point. Also, the price in weights is very expensive compared to the information the network can \textit{remember}. The point os this network was the mimic memory. Not an efficient or usefull memory, but memory nonetheless. So how to make it more efficient? One first option is to update the weights in order to reduces the occurences where weights and states does not form a bijection. We can calculate the best value for the state given the state. We take a look at the product between two states.
+
+\begin{equation*}
+W = \dfrac{1}{n}[p_1,\dots,p_n]\cdot [p_1,\dots,p_n]^T-I_n
+\end{equation*} 
+
+\section{Deep Learning}
+Dee plearning is a subset of machine learning. This is not a different category. It emerges from the limitations of MLPs. MLPs transforms inputs linearily (and through the activation function)  into a new set of features at each layer. We could imagine doing that for a 1000 layers to detect more and more complexe features in our data. This would not work because of the problem of the vanishing gradient. The idea is that every new layer affects the error signal.
+If the initial weights are big, each layer increases the error signal during the backpropagation. If the network is too deep, the gradient will explode. If the initial gradient is small, each layer will decrease the error signal. If the network is too small, the gradient will vanish. Each case is bad but the vanishing gradient is worse because the network will not event learn. As a matter of fact, there is more chance of the gradient vanishing because of the derivative of the sigmoid (also added to the error signal at every layer).
+Is the weights cannot be changed because the error signal is too weak, the weights remain in their initial random state and the network is no good.
+
+Another reason why deep MLP is not a good idea is that the network will be very prone to overfitting. One countermeasure is a lot of data but data isn't free.
+
+There are possible ways of solving this.
+\begin{itemize}
+    \item Initializing the weights around 1 so that it dosn't explode or vanish (at least not instantly).
+    \item Not using the sigmoid but a function where the derivative is not a function of the output. Relu (or Leaky Relu) for example.
+\end{itemize}
+None of them solves the problem in the long run. The vanishing gradient remain a problem after some training time.
+
+For simple applications, we don't need deep learning anyway. The problem mainly arises when considering images. A small image of $250\times250\times1$ is already an input vector of $\approx 65k$. Usually, the hidden layer is larger than the input layer so the first matrix is already billions of terms.
+Even if the problem of vanishing gradient is solved (skipping layers for example), there is still too many parameters and this requires a very large amount of training data.
+
+\subsection{Convolutionnal Neurals Networks}
+One of the idea to make the deep NNs applicable is to reduce the number of parameters. The densely connected layers of an MLP creates one trainable parameter per pair of previous/next nodes. To get around this issue, we can create \textit{sparsly} connected neural networks. We remove some of the connections and this reduces the trainable-parameters by as much. This is good but we can go further.
+The connections stils requires unique weights. We can actually get rid of this constraint and decide that groups of weights share the same value. This reduces again the number of parameters to train.
+With these two ideas, we can reduce the number of parameters enougth to actually use a lot of layers. This allowed AlexNet to do a 1000-classes classification on images. This network leverages convolutions. Applying convolution on an image with a kernel is the same as grouping weights together and considering only certains input to compute the output (between two layers).
+
+$\dots$
+
+\subsection{Recurent Neural Network}
+The network encountered so far assume that the data are independant and identically distributed. Recurent neural network on the other side does not make this assumption and assume some relationship/dependency between the data. This is especially applicable in the case of speech and text study where the context of a work can change its meaning. Common networks or model are not equipped to deal with sequential data. These data are specific and thus requires specific models or assumption to be procesed. Let's take an example:
+
+
+\begin{center}
+    {\Large The ball was blue when it came out.}\\
+    {\Large The ball came out blue.}
+\end{center}
+
+These two sentences are composed of more or less the same words but have vastly different meaning. Understanding the relationship between the words is fundamental in understanding the complete sentence. A feed forward NN would consider each word as an input and would fail misserably at understanding the sentence. The same temporal/sequential aspect of data can be found in time series (stock prices for example).
+
+Recurent NN offers flexibility by not requiring fixed input and output sizes. The first notion proposed to describe a recurent NN was to evaluate the connection between. If two elements always come together, there is a strong positive connection between them. If on the other hand two elements do not usually come together, then they have a negative connection. This idea allows to find connection between words that usually appears together in sentences. The words are represented by states. We can try to redict the next state (word) from the previous and current state.
+
+[insert unravel RNN schema]
+
+All of the weights for $x$ and $y$ are the same. The states of $a$ can be computed by:
+
+\begin{equation*}
+    a_1 = g_a(w_{ax}\cdot x_1 + w_{a}\cdot a_0 + b_a)
+\end{equation*}
+
+And similarly $y$ can also be computed as:
+
+\begin{equation*}
+    y_i = g_y(w_{ay}\cdot a_i + b_y)
+\end{equation*}
+
+In words, the state is a function of the input $x$ and of the previous state $a_{i-1}$. In turn ,the output $y$ is a function of the current state. Of course, agood matrice form is always welcome:
+
+\begin{equation*}
+    a(t+1) = g_a\left([w_{a},w_{ax}]\cdot \left[\begin{array}{c}a(t)\\x(t+1)\end{array}\right]+b_a\right)
+\end{equation*}
+
+Now the big question that is comming is \textit{How the Heck do you update your weights?}. Backpropagation to the rescue. Each output can be used to compute one loss value. We combine these loss values with cross-entropy (entropy generalized to multiple classes).
+We can now update weights for the states with Backpropagation Through Time (BPTT) begining from the last state to the first one. The important part of this network are not the input or output but the states of the network that contains influence of all previous states.
+
+The architecture presented until now is called \textit{mini-to-mini}. There are as many output as input. A more common one would be \textit{many-to-one} that gives one output for multiple input (synthesizing text). All other combinations of \textit{mini} and \textit{one} can be considered and are usefull for other problems.
+
+\subsection{Language Processing}
+Now that the problem of sequential data is solved by recurent NNs, we can analyse text. The probleme with text is that computers, contrary to humans, are good with number but not so much with words. The word has to be represented by values that carries the meaning. One interesting aspect about language is that the words are finite and the sentences are finite. There are only a few hundred thousand words in the dictionary. We could assign an index to each word and call it a day. Each word would be represetned by a vector. The problem with this method is that there is no preservation of similarity in this representation. wo very different words can have similar representation and the other way around too. To find a better encoding, we suppose that every word in the dictionary has the same number of features. These features describe some aspect of the word. Features could be for example gender, Royalty, Age, object, animal etc. We can represent each word by some value associated to each feature.
+
+To go about creating a good encoding begins with the concept of affinity matrix. This matrix contains the number of time the word of the column comes after the word of the row. Knowing this affinity matrix, we can use a RNN with the column of the affinity matrix as input and the probability of the next word to get the weights that represent the features extracted to represent the words.


 \end{document}