updated notebook

This commit is contained in:
kermorvant 2023-01-04 10:06:35 +01:00
parent 8a7572760d
commit 87dd736d83
4 changed files with 1275 additions and 0 deletions

View File

@ -0,0 +1,521 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Text classification on LeMonde2003 Dataset\n",
"\n",
"In this notebook, we \n",
"apply classification algorithms to newspaper articles published in 2003 in *Le Monde*. \n",
"\n",
"The data are here : https://cloud.teklia.com/index.php/s/X9BWJTP2PoSRQBm/download/LeMonde2003_9classes.csv.gz\n",
"\n",
"Download it into the data directory : \n",
"\n",
"```\n",
"wget https://cloud.teklia.com/index.php/s/X9BWJTP2PoSRQBm/download/LeMonde2003_9classes.csv.gz\n",
"```\n",
"\n",
"These articles concern different subjects but we will consider only articles related to the following subjects : entreprises (ENT), international (INT), arts (ART), société (SOC), France (FRA), sports (SPO), livres (LIV), télévision (TEL) and the font page articles (UNE).\n",
"\n",
"\n",
"> * Load the CSV file `data/LeMonde2003_9classes.csv.gz` containing the articles using pandas [pd.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). How many articles do you have ? \n",
"> * Plot the frequency histogram of the categories using seaborn [countplot](https://seaborn.pydata.org/tutorial/categorical.html) : `sns.countplot(data=df,y='category')`\n",
"> * Display the text of some of the article with the corresponding class using pandas [sample](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html)\n",
"> * Using the [WordCloud library](https://amueller.github.io/word_cloud/index.html), display a word cloud for the most frequent classes. You can remove the stop words using the `stopwords` option, using the list of stop words in French in `data/stop_word_fr.txt`.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"# load dataframe from CSV file\n",
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import seaborn as sns\n",
"%matplotlib inline\n",
"\n",
"# Plot the statistics of category\n",
"# YOUR CODE HERE"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Print examples of the articles\n",
"pd.set_option('display.max_colwidth', None)\n",
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from wordcloud import WordCloud\n",
"# Display one wordcloud for each of the most frequent classes\n",
"\n",
"from wordcloud import WordCloud\n",
"STOPWORDS = [x.strip() for x in open('data/stop_word_fr.txt').readlines()]\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# plot a word cloud for each category\n",
"for cat in ['ENT', 'INT', 'ART', 'SOC', 'FRA']:\n",
" # YOUR CODE HERE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Bag-of-word representation\n",
"\n",
"In order to apply machine learning algorithms to text, documents must be transformed into vectors. The most simple and standard way to transform a document into a vector is the *bag-of-word* encoding.\n",
"\n",
"The idea is very simple : \n",
"\n",
"1. define the set of all the possible words that can appear in a document; denote its size by `max_features`.\n",
"2. for each document, encode it with a vector of size `max_features`, with the value of the ith component of the vector equal to the number of time the ith word appears in the document.\n",
"\n",
"See [the wikipedia article on Bag-of-word](https://en.wikipedia.org/wiki/Bag-of-words_model) for an example.\n",
"\n",
"Scikit-learn proposes different methods to encode text into vectors : [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [TfidfTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html).\n",
"\n",
"The encoder must first be trained on the train set and applied to the different sets, for example with the 200 words : \n",
"\n",
"\tfrom sklearn.feature_extraction.text import CountVectorizer\n",
"\tvectorizer = CountVectorizer(max_features=200)\n",
" vectorizer.fit(X_train)\n",
" X_train_counts = vectorizer.transform(X_train)\n",
" X_test_counts = vectorizer.transform(X_test)\n",
" \n",
"**Question**:\n",
"\n",
"> * Split the dataset LeMonde2003 into train set (80%), dev set (10%) and test set (10%) using scikit-learn [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)\n",
"> * For each set, transform the text of the articles into vectors using the `CountVectorizer`, considering the 1000 most frequent words. \n",
"> * Train a naive bayes classifier on the data. \n",
"> * Evaluate the classification accuracy on the train, dev and test sets using the [score](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB.score) method. \n",
"\n",
"> ***Important*** : the test set must not be used during the training phase, and learning the vector representation of the words is part of the training. The dev set should be an evaluation of the test set.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"# Split the dataset, create X (features) and y (target), print the size\n",
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.feature_extraction.text import CountVectorizer\n",
"# Create document vectors\n",
"# YOUR CODE HERE\n",
"# create the vectorizer object\n",
"\n",
"# fit on train data\n",
"\n",
"# apply it on train and dev data\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.naive_bayes import MultinomialNB\n",
"# train a Naive Bayes classifier\n",
"# YOUR CODE HERE\n",
"# create the MultinomialNB\n",
"\n",
"# Train \n",
"\n",
"# Evaluate \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## TF-IDF representation\n",
"\n",
"The `CountVectorizer` encodes the text using the raw frequencies of the words. However, words that are very frequent and appear in all the documents will have a strong weight whereas they are not discriminative. The *Term-Frequency Inverse-Document-Frequency* weighting scheme take into accound the number of documents in which a given word occurs. A word that appear in many document will have less weight. See [the wikipedia page](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) for more details.\n",
"\n",
"With scikit-learn, the `TfidfTransformer` is applied after the `CountVectorizer` :\n",
"\n",
"\tfrom sklearn.feature_extraction.text import TfidfTransformer\n",
"\ttf_transformer = TfidfTransformer().fit(X_train_counts)\n",
" \tX_train_tf = tf_transformer.transform(X_train_counts)\n",
"\tX_test_tf = tf_transformer.transform(X_test_counts)\n",
"\t\n",
"**Question**:\n",
"\n",
"> * Use the TF-IDF representation to train a Multinomial Naive Bayes classifier. Report your best test error rate and the error rates for all the configurations tested."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.feature_extraction.text import TfidfTransformer\n",
"# YOUR CODE HERE\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Error analysis\n",
"\n",
"The classification error rate give an evaluation of the performance for all the classes. But since the classes are not equally distributed, they may not be equally well modelized. In order to get a better idea of the performance of the classifier, detailed metrics must be used : \n",
"\n",
"* [metrics.classification_report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) provides a detailed analysis per class : the precision (amongst all the example classified as class X, how many are really from the classX) and the recall (amongst all the example that are from the class X, how many are classified as class X) and the F-Score which is as a weighted harmonic mean of the precision and recall.\n",
"* [metrics.confusion_matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) which give the confusions between the classes. It can be displayed in color with [plot_confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html#sklearn.metrics.plot_confusion_matrix).\n",
"\n",
"**Question**:\n",
"\n",
"> * Report the `classification_report` for your classifier. Which classes have the best scores ? Why ?\n",
"> * Report the `confusion_matrix` for your classifier. Which classes are the most confused ? Why ?\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import classification_report, ConfusionMatrixDisplay\n",
"\n",
"# YOUR CODE HERE\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data re-configuration\n",
"After the error analysis, we came to the conclusion that one of the class can not be distinguised from the others. There is no use trying to solve an impossible problem.\n",
"\n",
"**Questions**:\n",
"\n",
"> * Remove the class `ÙNE` from the original dataset using pandas [replace](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html)\n",
"> * Plot the class statitics with seaborn\n",
"> * Create new splits\n",
"> * Retrain a NaiveBayes classifier using [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) with the 1000 most frequent words."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"# YOUR CODE HERE\n",
"\n",
"# Filter out the UNE class\n",
"\n",
"# Plot the statistics of classes\n",
"\n",
"# Make the splits and print the sizes for checking\n",
"\n",
"# Apply TfidfVectorizer\n",
"\n",
"# Train MultinomialNB\n",
"\n",
"# Print accuracy\n",
"\n",
"# Print confusion matric\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Hyperparameter optimization\n",
"\n",
"The classification process has many parameters : alpha for the classifier, max_features, max_df, min_df, using idf or not, ngram orders for the Count of TfIDF transformer. These parameters can be optimized by a grid search using GridSearchCV.\n",
"\n",
"**Question**:\n",
"\n",
"> * Using the template code below, find the best values for the parameter max_features, max_df, min_df, use_idf, ngram_range, alpha\n",
"> * Refit the best model on all the train+dev data and print accuracy on test set\n",
"\n",
"Note that for developping the code, the number of training samples is limited to 1000\n",
"\n",
"```\n",
"df_filtered_train_dev.iloc[:1000].text\n",
"```\n",
"\n",
"Once your code is correct, you can train on the full training set.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#### Hyperameters optimization with GridSearchCV = parallel processing\n",
"from sklearn.model_selection import GridSearchCV\n",
"from sklearn.pipeline import Pipeline\n",
"from pprint import pprint\n",
"from time import time\n",
"import logging\n",
"# Display progress logs on stdout\n",
"logging.basicConfig(level=logging.INFO,\n",
" format='%(asctime)s %(levelname)s %(message)s')\n",
"\n",
"# create train_dev and test set for using Cross-Validation\n",
"df_filtered_train_dev, df_filtered_test = train_test_split(df_filtered.dropna() ,test_size=0.10, random_state=42)\n",
"print ('train_dev size',df_filtered_train_dev.shape)\n",
"print ('test size',df_filtered_test.shape)\n",
"# keep only 1000 training data for debuging\n",
"X_train_dev, y_train_dev =df_filtered_train_dev.iloc[:1000].text, df_filtered_train.iloc[:1000].category\n",
"X_test, y_test =df_filtered_test.text, df_filtered_test.category\n",
"\n",
"\n",
"\n",
"pipeline = Pipeline([\n",
" ('tfidf', TfidfVectorizer()),\n",
" ('clf', MultinomialNB()),\n",
"])\n",
"\n",
"\n",
"parameters = {\n",
" 'tfidf__max_features': (500, 1000, 5000, 10000, None),\n",
" # YOUR CODE HERE\n",
"}\n",
"if __name__ == \"__main__\":\n",
" # multiprocessing requires the fork to happen in a __main__ protected\n",
" # block\n",
"\n",
" # find the best parameters for both the feature extraction and the\n",
" # classifier\n",
" grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=2, cv=3)\n",
"\n",
" print(\"Performing grid search...\")\n",
" print(\"pipeline:\", [name for name, _ in pipeline.steps])\n",
" print(\"parameters:\")\n",
" pprint(parameters)\n",
" t0 = time()\n",
" grid_search.fit(X_train_dev, y_train_dev)\n",
" print(\"done in %0.3fs\" % (time() - t0))\n",
" print()\n",
"\n",
" print(\"Best score: %0.3f\" % grid_search.best_score_)\n",
" print(\"Best parameters set:\")\n",
" best_parameters = grid_search.best_estimator_.get_params()\n",
" for param_name in sorted(parameters.keys()):\n",
" print(\"\\t%s: %r\" % (param_name, best_parameters[param_name]))\n",
" df = pd.DataFrame(grid_search.cv_results_)\n",
" print (df[['rank_test_score','param_tfidf__max_features','mean_test_score']].sort_values('rank_test_score'))\n",
" \n",
" # use refit and print accuracy on test set\n",
" # YOUR CODE HERE"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Classification with Neural networks\n",
"\n",
"Neural networks can be trained to learn both the vector representation of the words (instead of tf-idf) and how to classify the documents. The code below allows you to train a neural text classifier using word embeddings using Keras. Most of the code is written, you only have to define the architecture of the network with the correct parameters before training it : \n",
"\n",
"**Question**:\n",
"\n",
"> * Define a neural network in the function `get_model()` with the following parameters : \n",
"> * use only the 10 000 most frequent words in the documents\n",
"> * use 1024 as the maximal number of words in the articles\n",
"> * use an embedding size of 300: [embedding layer](https://keras.io/layers/embeddings/)\n",
"> * use a dropout of 0.5: [dropout layer](https://keras.io/layers/core/#dropout)\n",
"> * use 32 convolutional filters of size 2 x EMBED_SIZE: [1D convolutional layer](https://keras.io/layers/convolutional/#conv1d)\n",
"> * use a max pooling of size 2 : [1D Max Pooling](https://keras.io/layers/pooling/#maxpooling1d)\n",
"> * Train the model and compare its accuracy to the Naive Bayes models.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import ast\n",
"import os\n",
"from nn_utils import TrainingHistory\n",
"from keras.layers import Dense, Embedding, Input\n",
"from keras.layers import GRU, Dropout, MaxPooling1D, Conv1D, Flatten\n",
"from keras.models import Model\n",
"import numpy as np\n",
"import itertools\n",
"from keras.utils import np_utils\n",
"from sklearn.metrics import (classification_report, \n",
" precision_recall_fscore_support, \n",
" accuracy_score)\n",
"\n",
"from keras.preprocessing import text, sequence\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Model parameters\n",
"MAX_FEATURES = # YOUR CODE HERE\n",
"MAX_TEXT_LENGTH = # YOUR CODE HERE\n",
"EMBED_SIZE = # YOUR CODE HERE\n",
"BATCH_SIZE = 16\n",
"EPOCHS = 10\n",
"VALIDATION_SPLIT = 0.1"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def get_train_test(train_raw_text, test_raw_text):\n",
" \n",
" tokenizer = text.Tokenizer(num_words=MAX_FEATURES)\n",
"\n",
" tokenizer.fit_on_texts(list(train_raw_text))\n",
" train_tokenized = tokenizer.texts_to_sequences(train_raw_text)\n",
" test_tokenized = tokenizer.texts_to_sequences(test_raw_text)\n",
" return sequence.pad_sequences(train_tokenized, maxlen=MAX_TEXT_LENGTH), \\\n",
" sequence.pad_sequences(test_tokenized, maxlen=MAX_TEXT_LENGTH)\n",
"\n",
"\n",
"\n",
"def get_model():\n",
"\n",
" inp = Input(shape=(# YOUR CODE HERE,))\n",
" model = Embedding(# YOUR CODE HERE, # YOUR CODE HERE)(inp)\n",
" model = Dropout(# YOUR CODE HERE)(model)\n",
" model = Conv1D(filters=# YOUR CODE HERE, kernel_size=# YOUR CODE HERE, padding='same', activation='relu')(model)\n",
" model = MaxPooling1D(pool_size=# YOUR CODE HERE)(model)\n",
" model = Flatten()(model)\n",
" model = Dense(7, activation=\"softmax\")(model)\n",
" model = Model(inputs=inp, outputs=model)\n",
" \n",
" model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
" model.summary()\n",
" return model\n",
"\n",
"\n",
"def train_fit_predict(model, x_train, x_test, y, history):\n",
" \n",
" model.fit(x_train, y,\n",
" batch_size=BATCH_SIZE,\n",
" epochs=EPOCHS, verbose=1,\n",
" validation_split=VALIDATION_SPLIT)\n",
"\n",
" return model.predict(x_test)\n",
"\n",
"\n",
"# Get the list of different classes\n",
"CLASSES_LIST = np.unique(y_train)\n",
"n_out = len(CLASSES_LIST)\n",
"print(CLASSES_LIST)\n",
"\n",
"# Convert clas string to index\n",
"from sklearn import preprocessing\n",
"le = preprocessing.LabelEncoder()\n",
"le.fit(CLASSES_LIST)\n",
"y_train = le.transform(y_train) \n",
"y_test = le.transform(y_test) \n",
"train_y_cat = np_utils.to_categorical(y_train, n_out)\n",
"\n",
"# get the textual data in the correct format for NN\n",
"x_vec_train, x_vec_test = get_train_test(X_train, X_test)\n",
"print(len(x_vec_train), len(x_vec_test))\n",
"\n",
"# define the NN topology\n",
"model = get_model()\n",
"\n",
"# Define training procedure\n",
"history = TrainingHistory(x_vec_test, y_test, CLASSES_LIST)\n",
"\n",
"# Train and predict\n",
"y_predicted = train_fit_predict(model, x_vec_train, x_vec_test, train_y_cat, history).argmax(1)\n",
"\n",
"\n",
"print(\"Test Accuracy:\", accuracy_score(y_test, y_predicted))\n",
"\n",
"p, r, f1, s = precision_recall_fscore_support(y_test, y_predicted, \n",
" average='micro',\n",
" labels=[x for x in np.unique(y_train) ])\n",
"\n",
"print('p r f1 %.1f %.2f %.3f' % (np.average(p, weights=s)*100.0, \n",
" np.average(r, weights=s)*100.0, \n",
" np.average(f1, weights=s)*100.0))\n",
"\n",
"\n",
"print(classification_report(y_test, y_predicted, labels=[x for x in np.unique(y_train)]))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "nlp-class-env",
"language": "python",
"name": "nlp-class-env"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

703
data/stop_word_fr.txt Normal file
View File

@ -0,0 +1,703 @@
s'est
ans
faire
avoir
an
d'une
d'un
c'est
qu'il
a
abord
absolument
afin
ah
ai
aie
aient
aies
ailleurs
ainsi
ait
allaient
allo
allons
allô
alors
anterieur
anterieure
anterieures
apres
après
as
assez
attendu
au
aucun
aucune
aucuns
aujourd
aujourd'hui
aupres
auquel
aura
aurai
auraient
aurais
aurait
auras
aurez
auriez
aurions
aurons
auront
aussi
autre
autrefois
autrement
autres
autrui
aux
auxquelles
auxquels
avaient
avais
avait
avant
avec
avez
aviez
avions
avoir
avons
ayant
ayez
ayons
b
bah
bas
basee
bat
beau
beaucoup
bien
bigre
bon
boum
bravo
brrr
c
car
ce
ceci
cela
celle
celle-ci
celle-là
celles
celles-ci
celles-là
celui
celui-ci
celui-là
celà
celà 
cent
cependant
certain
certaine
certaines
certains
certes
ces
cet
cette
ceux
ceux-ci
ceux-là
chacun
chacune
chaque
cher
chers
chez
chiche
chut
chère
chères
ci
cinq
cinquantaine
cinquante
cinquantième
cinquième
clac
clic
combien
comme
comment
comparable
comparables
compris
concernant
contre
couic
crac
d
da
dans
de
debout
dedans
dehors
deja
delà
depuis
dernier
derniere
derriere
derrière
des
desormais
desquelles
desquels
dessous
dessus
deux
deuxième
deuxièmement
devant
devers
devra
devrait
different
differentes
differents
différent
différente
différentes
différents
dire
directe
directement
dit
dite
dits
divers
diverse
diverses
dix
dix-huit
dix-neuf
dix-sept
dixième
doit
doivent
donc
dont
dos
douze
douzième
dring
droite
du
duquel
durant
dès
début
désormais
e
effet
egale
egalement
egales
eh
elle
elle-même
elles
elles-mêmes
en
encore
enfin
entre
envers
environ
es
essai
est
et
etant
etc
etre
eu
eue
eues
euh
eurent
eus
eusse
eussent
eusses
eussiez
eussions
eut
eux
eux-mêmes
exactement
excepté
extenso
exterieur
eûmes
eût
eûtes
f
fais
faisaient
faisant
fait
faites
façon
feront
fi
flac
floc
fois
font
force
furent
fus
fusse
fussent
fusses
fussiez
fussions
fut
fûmes
fût
fûtes
g
gens
h
ha
haut
hein
hem
hep
hi
ho
holà
hop
hormis
hors
hou
houp
hue
hui
huit
huitième
hum
hurrah
hélas
i
ici
il
ils
importe
j
je
jusqu
jusque
juste
k
l
la
laisser
laquelle
las
le
lequel
les
lesquelles
lesquels
leur
leurs
longtemps
lors
lorsque
lui
lui-meme
lui-même
lès
m
ma
maint
maintenant
mais
malgre
malgré
maximale
me
meme
memes
merci
mes
mien
mienne
miennes
miens
mille
mince
mine
minimale
moi
moi-meme
moi-même
moindres
moins
mon
mot
moyennant
multiple
multiples
même
mêmes
n
na
naturel
naturelle
naturelles
ne
neanmoins
necessaire
necessairement
neuf
neuvième
ni
nombreuses
nombreux
nommés
non
nos
notamment
notre
nous
nous-mêmes
nouveau
nouveaux
nul
néanmoins
nôtre
nôtres
o
oh
ohé
ollé
olé
on
ont
onze
onzième
ore
ou
ouf
ouias
oust
ouste
outre
ouvert
ouverte
ouverts
o|
p
paf
pan
par
parce
parfois
parle
parlent
parler
parmi
parole
parseme
partant
particulier
particulière
particulièrement
pas
passé
pendant
pense
permet
personne
personnes
peu
peut
peuvent
peux
pff
pfft
pfut
pif
pire
pièce
plein
plouf
plupart
plus
plusieurs
plutôt
possessif
possessifs
possible
possibles
pouah
pour
pourquoi
pourrais
pourrait
pouvait
prealable
precisement
premier
première
premièrement
pres
probable
probante
procedant
proche
près
psitt
pu
puis
puisque
pur
pure
q
qu
quand
quant
quant-à-soi
quanta
quarante
quatorze
quatre
quatre-vingt
quatrième
quatrièmement
que
quel
quelconque
quelle
quelles
quelqu'un
quelque
quelques
quels
qui
quiconque
quinze
quoi
quoique
r
rare
rarement
rares
relative
relativement
remarquable
rend
rendre
restant
reste
restent
restrictif
retour
revoici
revoilà
rien
s
sa
sacrebleu
sait
sans
sapristi
sauf
se
sein
seize
selon
semblable
semblaient
semble
semblent
sent
sept
septième
sera
serai
seraient
serais
serait
seras
serez
seriez
serions
serons
seront
ses
seul
seule
seulement
si
sien
sienne
siennes
siens
sinon
six
sixième
soi
soi-même
soia
soient
sois
soit
soixante
sommes
son
sont
sous
souvent
soyez
soyons
specifique
specifiques
speculatif
stop
strictement
subtiles
suffisant
suffisante
suffit
suis
suit
suivant
suivante
suivantes
suivants
suivre
sujet
superpose
sur
surtout
t
ta
tac
tandis
tant
tardive
te
tel
telle
tellement
telles
tels
tenant
tend
tenir
tente
tes
tic
tien
tienne
tiennes
tiens
toc
toi
toi-même
ton
touchant
toujours
tous
tout
toute
toutefois
toutes
treize
trente
tres
trois
troisième
troisièmement
trop
très
tsoin
tsouin
tu
u
un
une
unes
uniformement
unique
uniques
uns
v
va
vais
valeur
vas
vers
via
vif
vifs
vingt
vivat
vive
vives
vlan
voici
voie
voient
voilà
vont
vos
votre
vous
vous-mêmes
vu
vôtre
vôtres
w
x
y
z
zut
zutalors
à
â
ça
ès
étaient
étais
était
étant
état
étiez
étions
été
étée
étées
étés
êtes
être
êtreêtre
ô
ôau

46
nn_utils.py Normal file
View File

@ -0,0 +1,46 @@
from keras.callbacks import Callback, EarlyStopping, ModelCheckpoint
class TrainingHistory(Callback):
def __init__(self, x_test, y_test, CLASSES_LIST):
super(Callback, self).__init__()
self.x_test = x_test
self.y_test = y_test
self.CLASSES_LIST = CLASSES_LIST
def on_train_begin(self, logs={}):
self.losses = []
self.epoch_losses = []
self.epoch_val_losses = []
self.val_losses = []
self.predictions = []
self.epochs = []
self.f1 = []
self.i = 0
self.save_every = 50
def on_epoch_end(self, epoch, logs={}):
y_predicted = self.model.predict(self.x_test).argmax(1)
print(y_predicted.shape)
print("Test Accuracy:", accuracy_score(self.y_test, y_predicted))
p, r, f1, s = precision_recall_fscore_support(self.y_test, y_predicted,
average='micro',
labels=[x for x in
self.CLASSES_LIST])
print('p r f1 %.1f %.1f %.1f' % (np.average(p, weights=s)*100.0,
np.average(r, weights=s)*100.0,
np.average(f1, weights=s)*100.0))
try:
print(classification_report(self.y_test, y_predicted, labels=[x for x in
self.CLASSES_LIST]))
except:
print('ZERO')

5
requirements.txt Normal file
View File

@ -0,0 +1,5 @@
wordcloud==1.8.2.2
ipykernel==6.19.4
pandas==1.5.2
seaborn==0.12.2
scikit-learn==1.2.0