\chapter{Processeurs scalaires, superscalaires et VLIW}

\section{Prédiction de branchement}

Un branchement a le comportement suivant, où N signifie non pris, P signifie pris et P*k indique une suite de k branchements pris.\\
PPNP*kNP*kNP*kNP*kNP*k...

\subsection{Prédiction après phase d'initialisation}

\subsubsection{a)} 

Pour un prédicteur 1 bit, à chaque mauvaise prédiction la prochaine prédiction sera inversée ou plutôt on ne change pas la prédiction tant qu'elle est bonne.

- $k=1$

\begin{tabular}{l l}
    Branchement &: P P N P N P N P N P \\
    Prédicteur &: \color{blue}{X} \color{green}{P} \color{red}{P N P N P N P N}
\end{tabular}

Prédictions bonnes à $\frac{k-1}{(k-1)+2} = 0 \implies 0\%$

- $k=3$

\begin{tabular}{l l}
    Branchement &: P P N P P P N P P P N P P P N P P P\\
    Prédicteur &: \color{blue}{X} \color{green}{P} \color{red}{P N} \color{green}{P P}  \color{red}{P N} \color{green}{P P}  \color{red}{P N} \color{green}{P P}  \color{red}{P N} \color{green}{P P}
\end{tabular}

Prédiction bonnes à $\frac{k-1}{(k-1)+2} = \frac{1}{2} \implies 50\%$

- $k=5$

\begin{tabular}{l l}
    Branchement &: P P N P P P P P N P P P P P N P P P P P N P P P P P\\
    Prédicteur &: \color{blue}{X} \color{green}{P} \color{red}{P N} \color{green}{P P P P}  \color{red}{P N} \color{green}{P P P P}  \color{red}{P N} \color{green}{P P P P}  \color{red}{P N} \color{green}{P P P P}
\end{tabular}

Prédiction bonnes à $\frac{k-1}{(k-1)+2} = \frac{4}{6} \implies 66\%$

On a bien la conclusion que plus une boucle est longue et donc une répétition de branchements est importante et plus le prédicteur sera efficace.

\subsubsection{b)}

Pour un prédicteur 2 bits, on passe d'un niveau fort (FX) à faible (fx) lors d'une erreur de prédiction avant de changer la prédiction de branchement. Lorsque la prédiction est bonne en revanche, on passe d'un état faible à un état fort.

FP $\iff$ fp $\iff$ fn $\iff$ FN

- $k=1$

\begin{tabular}{l l}
    Branchement &: P ~~~~~P ~~~~N ~~P ~N ~~P ~N ~~P ~N ~P \\
    Prédicteur &: \color{blue}{fx} \color{green}{FP|fp} \color{red}{FP} \color{green}{fp} \color{red}{FP} \color{green}{fp} \color{red}{FP} \color{green}{fp} \color{red}{FP} \color{green}{fp}
\end{tabular}

Prédictions bonnes de 50\% à condition de ne pas commencer par FNP.

- $k=3$

\begin{tabular}{l l}
    Branchement &: P ~~~~~P ~~~~N ~~P ~P ~~P ~~N ~~P ~~P ~~P ~~N ~~P ~~P ~~P ~~N ~~P ~~P ~P\\
    Prédicteur &: \color{blue}{fx} \color{green}{FP|fp} \color{red}{FP} \color{green}{fp FP FP}  \color{red}{FP} \color{green}{fp FP FP}  \color{red}{FP} \color{green}{fp FP FP}  \color{red}{FP} \color{green}{fp FP FP}
\end{tabular}

Prédictions bonnes à 75\%. Si la première prédiction est à FN, on perd seulement la validité de quelques (4) premiers branchements.

- $k=5$

(FX $\longrightarrow$ X, fx $\longrightarrow$ x)

\begin{tabular}{l l}
    Branchement &: P ~~~~~P ~~~N P P P P P N P P P P P N P P P P P N P P P P P\\
    Prédicteur &: \color{blue}{fx} \color{green}{FP|fp} \color{red}{P} \color{green}{p P P P P}  \color{red}{~P} \color{green}{p P P P P}  \color{red}{P} \color{green}{p P P P P}  \color{red}{~P} \color{green}{p P P P P}
\end{tabular}

Prédiction bonnes à 83\% avec le même problème que pour $k=3$.

On remarque cette fois-ci que le taux de bonne prédiction suit $\frac{k}{(k-1)+2}$.
On devient plus résilient aux boucles imbriquées.

\subsection{Prédicteur un bit et un bit d'historique initialisé à N}
Prédicteur P : si P est une bonne prédiction, le reste de ses prédictions sera à N sinon il sera à P.\\
Prédicteur N : si N est une bonne prédiction, le reste de ses prédictions sera à N sinon à P.

- $k=1$

\begin{tabular}{llllllllllll}
    Branchement &: &P &P &N &P &N &P &N &P &N &P \\
    Historique &: &\color{blue}X &\color{blue}P &\color{blue}P &\color{blue}N &\color{blue}P &\color{blue}N &\color{blue}P &\color{blue}N &\color{blue}P &\color{blue}N \\
    Prédicteur P &: &\color{blue}X &\color{green}\boxed{P} &\color{red}\boxed{P} &\color{red}N &\color{green}\boxed{N} &\color{red}N &\color{green}\boxed{N} &\color{red}N &\color{green}\boxed{N} &\color{red}N \\
    Prédicteur N &: &\color{blue}X &\color{red}N &\color{green}N &\color{red}\boxed{N} &\color{red}P &\color{green}\boxed{P} &\color{red}P &\color{green}\boxed{P} &\color{red}P &\color{green}\boxed{P} 
\end{tabular}

Prédictions bonnes à 100\% car le cycle de répétition (période) est de 2 ce qui est la limite d'historique pour un prédicteur 1 bit ($2^1$).

- $k=3$

\begin{tabular}{llllllllllllllllllll}
    Branchement &: &P &P &N &P &P &P &N &P &P &P &N &P &P &P &N &P &P &P\\
    Historique &: &\color{blue}X &\color{blue}P &\color{blue}P &\color{blue}N &\color{blue}P &\color{blue}P &\color{blue}P &\color{blue}N &\color{blue}P &\color{blue}P &\color{blue}P &\color{blue}N &\color{blue}P &\color{blue}P &\color{blue}P &\color{blue}N &\color{blue}P &\color{blue}P \\
    Prédicteur P &: &\color{blue}{X} &\color{green}\boxed{P} &\color{red}\boxed{P} &\color{gray}N &\color{red}\boxed{N} &\color{green}\boxed{P} &\color{red}{\boxed{P}}  &\color{gray}N &\color{red}\boxed{N} &\color{green}\boxed{P} &\color{red}{\boxed{P}}  &\color{gray}N &\color{red}\boxed{N} &\color{green}\boxed{P} &\color{red}{\boxed{P}}  &\color{gray}N &\color{red}\boxed{N} &\color{green}\boxed{P}\\
    Prédicteur N &: &\color{blue}{X} &\color{gray}{N} &\color{gray}{N} &\color{red}\boxed{N} &\color{gray}P &\color{gray}P  &\color{gray}{P} &\color{green}\boxed{P} &\color{gray}P &\color{gray}P  &\color{gray}{P} &\color{green}\boxed{P} &\color{gray}P &\color{gray}P  &\color{gray}{P} &\color{green}\boxed{P} &\color{gray}P &\color{gray}P
\end{tabular}

Prédiction bonnes à 50\%

- $k=5$

\begin{tabular}{l l}
    Branchement &: P P N P P P P P N P P P P P N P P P P P N P P P P P\\
    Historique &: \color{blue}X P P N P P P P P N P P P P P N P P P P P N P P P P \\
    Prédicteur P &: \color{blue}{X} \color{green}{P} \color{red}{P} \color{gray}{N} \color{red}{N} \color{green}{P P P} \color{red}{P} \color{gray}{N} \color{red}{N} \color{green}{P P P} \color{red}{P} \color{gray}{N} \color{red}{N} \color{green}{ P P P} \color{red}{P} \color{gray}{N} \color{red}{N} \color{green}{P P P}\\
    Prédicteur N &: \color{blue}{X} \color{gray}{N N} \color{green}P \color{gray}{ P P P P P} \color{green}{P} \color{gray}{P P P P P} \color{green}{P} \color{gray}{P P P P P} \color{green}{P}  \color{gray}{P P P P}
\end{tabular}

Prédiction bonnes de $\frac{2}{3}$

\section{Processeurs scalaires et superscalaires : exécution de boucles}

\subsection{Version scalaire du processeur et caches parfaits
(aucun cycle d’attente mémoire)}

Corps du programme en assembleur :

\begin{minted}[linenos, breaklines, frame = single]{asm}
boucle:
lf f1, (r1)     //f1 ← x[i]
lf f2, (r2)     //f2 ← y[i]
fmul f1, f1, f0 //f1 ← x[i]*a
NOP*3
fadd f2, f2, f1 //f2 ← y[i]+x[i]*a
NOP*3
sf f2, (r2)     //y[i] ← f2
addi r1, r1, 4  //x[i++]
addi r2, r2, 4  //y[i++]
addi r3, r3, -1 //i++
bne r3, boucle  //1000 itérations
\end{minted}

On part de la valeur i (r3) = 1000 jusqu'à 0 (while(i>0)) et on incrémente les indices en conséquence.
Avec les 6 NOP, on arrive à 15 cycles/opérations.

On optimise les latences en modifiant les chargements et incrémentation d'indices.

\begin{minted}[linenos, breaklines, frame = single]{asm}
boucle:
lf f1, (r1)     //f1 ← x[i]
lf f2, (r2)     //f2 ← y[i]
fmul f1, f1, f0 //f1 ← x[i]*a
addi r1, r1, 4  //x[i++]
addi r2, r2, 4  //y[i++]
addi r3, r3, -1 //i--
fadd f2, f2, f1 //f2 ← y[i]+x[i]*a
NOP*3
sf f2, -4(r2)   //y[i-1] ← f2
bne r3, boucle  //1000 itérations
\end{minted}
On gagne 3 cycles par opération soit 12 cycles/op.

\subsection{Version déroulée}
\begin{minted}[linenos, breaklines, frame = single]{asm}
boucle:
lf f1, (r1)     //f1 ← x[i]
lf f3, 4(r1)    //f3 ← x[i+1]
lf f5, 8(r1)    //f5 ← x[i+2]
lf f7, 12(r1)   //f7 ← x[i+3]
lf f2, (r2)     //f2 ← y[i]
lf f4, 4(r2)    //f4 ← y[i+1]
lf f6, 8(r2)    //f6 ← y[i+2]
lf f8, 12(r2)   //f8 ← y[i+3]
fmul f1, f1, f0 //f1 ← x[i]*a
fmul f3, f3, f0 //f3 ← x[i+1]*a
fmul f5, f5, f0 //f5 ← x[i+2]*a
fmul f7, f7, f0 //f7 ← x[i+3]*a
fadd f2, f2, f1 //f2 ← y[i]+x[i]*a
fadd f4, f4, f3 //f4 ← y[i+1]+x[i+1]*a
fadd f6, f6, f5 //f6 ← y[i+2]+x[i+2]*a
fadd f8, f8, f7 //f8 ← y[i+3]+x[i+3]*a
addi r1, r1, 16 //x[i++]
addi r2, r2, 16 //x[i++]
addi r3, r3, -4 //i++
sf f2, -12(r2)  //y[i-3] ← f2
sf f4, -8(r2)   //y[i-2] ← f4
sf f6, -4(r2)   //y[i-1] ← f6
sf f8, (r2)     //y[i] ← f8
bne r3, boucle  //1000 itérations
\end{minted}

On a 24 cycles pour 4 itérations, soit 6 cycles/itération.

\subsection{Superscalaire}

\begin{tabular}{r|c|c|c|c}
    &E0 & E1 & FM & FA \\
    \hline
    1&lf f1,(r1) &  &  &\\
    2&lf f2,(r2) & & & \\
    3&addi r1,r1,4& addi r2,r2,4&fmul f1,f1,f0 & \\
    4&addi r3,r3,-1&&&\\
    5&&&&\\
    6&&&&\\
    7&& & & fadd f2,f2,f1\\
    8&& & & \\
    9&&&&\\
    10&&&&\\
    11& sf f2,-4(r2)&bne r3, boucle & & 
\end{tabular}

On ne gagne qu'un cycle avec cette version (11cycles/itération).

\subsection{Superscalaire déroulé}

\begin{tabular}{r|c|c|c|c}
    &E0 & E1 & FM & FA \\
    \hline
    1&lf f1,(r1) & lf f3,4(r1)  &  &\\
    2& lf f5,8(r1)& lf f7,12(r1) & &\\
    3& lf f2,(r2)&  lf f4,4(r2)&  fmul f1,f1,f0&\\
    4& lf f6,8(r2)&  lf f8,12(r2)&  fmul f3,f3,f0&\\
    5& addi r1,r1,16& addi r2,r2,16& fmul f5,f5,f0 & \\
    6& addi r3,r3,-4& & fmul f5,f5,f0& \\
    7&&&& fadd f2,f2,f1\\
    8&&&& fadd f4,f4,f3\\
    9&&&& fadd f6,f6,f5\\
    10&& & & fadd f8,f8,f7\\
    11&sf f2,-12(r2) & & & \\
    12&sf f4,-8(r2)&&&\\
    13&sf f6,-4(r2)&&&\\
    14&sf f8, (r2) &bne r3, boucle & & 
\end{tabular}

Cette fois-ci, on utilise 14 cycles pour 4 itérations, soit 3,5 cycles/itération.