deux premiers algorithmes sous forme de Jupyter Notebook
This commit is contained in:
parent
0dbe5cc391
commit
a207c71687
1 changed files with 153 additions and 0 deletions
153
exercices.ipynb
Normal file
153
exercices.ipynb
Normal file
|
@ -0,0 +1,153 @@
|
||||||
|
{
|
||||||
|
"cells": [
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "0f3d617c",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"# TP Word Embedding\n",
|
||||||
|
"\n",
|
||||||
|
"## Bag of Words\n",
|
||||||
|
"Un sac de mots (ou *Bag of Words* en anglais, parfois abbrévié *BOW*) est un description d'un ensemble de mot sous forme d'un vecteur où l'ordre des mots ne rentre pas en compte.\n",
|
||||||
|
"\n",
|
||||||
|
"### Term Frequency\n",
|
||||||
|
"L'idée de Term Frequency est d'effectué un simple compte du nombre d'occurence (ou de la fréquence) du nombre de mots dans le corpus.\n",
|
||||||
|
"\n",
|
||||||
|
"Soit un vocabulaire $V$ dans un corps $C$ contenant $D$ documents.\n",
|
||||||
|
"Soit $w$ un mot dans un document $d \\in C$.\n",
|
||||||
|
"\n",
|
||||||
|
"Alors $TF(C)$ est une matrice de taille $|V|\\times|D|$ tel que\n",
|
||||||
|
"\n",
|
||||||
|
"$$ TF(C)_{ij} = \\frac{\\text{# words $i$ in document $j$}}{|V|} $$"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 1,
|
||||||
|
"id": "a1445527",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [],
|
||||||
|
"source": [
|
||||||
|
"document_1 = \"le chat mange la souris\"\n",
|
||||||
|
"document_2 = \"le chien regarde le canard\"\n",
|
||||||
|
"document_3 = \"le canard regarde le chat\"\n",
|
||||||
|
"\n",
|
||||||
|
"corpus = (document_1, document_2, document_3)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 2,
|
||||||
|
"id": "6c989264",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"{'le': 5, 'chat': 2, 'mange': 1, 'la': 1, 'souris': 1, 'chien': 1, 'regarde': 2, 'canard': 2}\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# construction du vocabulaire\n",
|
||||||
|
"vocabulary = []\n",
|
||||||
|
"for d in corpus:\n",
|
||||||
|
" for w in d.split(\" \"):\n",
|
||||||
|
" if w not in vocabulary:\n",
|
||||||
|
" vocabulary.append(w)\n",
|
||||||
|
" \n",
|
||||||
|
"# calcul d'un histogramme simple sur le corpus\n",
|
||||||
|
"\n",
|
||||||
|
"# intialisation du dictionnaire\n",
|
||||||
|
"freq = dict()\n",
|
||||||
|
"for v in vocabulary:\n",
|
||||||
|
" freq[v] = 0\n",
|
||||||
|
"\n",
|
||||||
|
"# compte des fréquences\n",
|
||||||
|
"for d in corpus:\n",
|
||||||
|
" for w in d.split(\" \"):\n",
|
||||||
|
" freq[w] += 1 \n",
|
||||||
|
" \n",
|
||||||
|
"print(freq)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "34fa1346",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"problèmes liés avec cette approche :\n",
|
||||||
|
"* indépendence au document (pousse les mots fréquents comme \"le\" vers le dessus alors qu'ils ne sont pas informatifs sémantiquements)\n",
|
||||||
|
"* pas de prise en compte de la case (majusucle / miniscule)\n",
|
||||||
|
"* simpliste"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "code",
|
||||||
|
"execution_count": 3,
|
||||||
|
"id": "5fc408eb",
|
||||||
|
"metadata": {},
|
||||||
|
"outputs": [
|
||||||
|
{
|
||||||
|
"name": "stdout",
|
||||||
|
"output_type": "stream",
|
||||||
|
"text": [
|
||||||
|
"[[1. 1. 1. 1. 1. 0. 0. 0.]\n",
|
||||||
|
" [2. 0. 0. 0. 0. 1. 1. 1.]\n",
|
||||||
|
" [2. 1. 0. 0. 0. 0. 1. 1.]]\n"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"source": [
|
||||||
|
"# calcul d'un histogramme par document\n",
|
||||||
|
"import numpy as np\n",
|
||||||
|
"\n",
|
||||||
|
"V = len(vocabulary)\n",
|
||||||
|
"D = len(corpus)\n",
|
||||||
|
"\n",
|
||||||
|
"tf_idf = np.zeros([D, V])\n",
|
||||||
|
"\n",
|
||||||
|
"for i, d in enumerate(corpus):\n",
|
||||||
|
" for w in d.split(\" \"):\n",
|
||||||
|
" j = vocabulary.index(w)\n",
|
||||||
|
" tf_idf[i,j] += 1\n",
|
||||||
|
" \n",
|
||||||
|
"print(tf_idf)"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"cell_type": "markdown",
|
||||||
|
"id": "771b997f",
|
||||||
|
"metadata": {},
|
||||||
|
"source": [
|
||||||
|
"problèmes liés avec cette approche :\n",
|
||||||
|
"* résoud uniquement le premier problème cité précedement\n",
|
||||||
|
"* devrait être une implémentation en matrice creuse (*sparse matrix*) car va en pratique contenir beaucoup de zéros pour un vocabulaire grand\n",
|
||||||
|
"\n",
|
||||||
|
"Heuresement des implémentations existantes comme dans `scikit learn` permettent de résoudre ces problèmes techniques."
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"metadata": {
|
||||||
|
"kernelspec": {
|
||||||
|
"display_name": "Python 3 (ipykernel)",
|
||||||
|
"language": "python",
|
||||||
|
"name": "python3"
|
||||||
|
},
|
||||||
|
"language_info": {
|
||||||
|
"codemirror_mode": {
|
||||||
|
"name": "ipython",
|
||||||
|
"version": 3
|
||||||
|
},
|
||||||
|
"file_extension": ".py",
|
||||||
|
"mimetype": "text/x-python",
|
||||||
|
"name": "python",
|
||||||
|
"nbconvert_exporter": "python",
|
||||||
|
"pygments_lexer": "ipython3",
|
||||||
|
"version": "3.9.7"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"nbformat": 4,
|
||||||
|
"nbformat_minor": 5
|
||||||
|
}
|
Loading…
Reference in a new issue