updated notebook
This commit is contained in:
parent
8a7572760d
commit
87dd736d83
521
TextClassification_LeMonde.ipynb
Normal file
521
TextClassification_LeMonde.ipynb
Normal file
|
@ -0,0 +1,521 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Text classification on LeMonde2003 Dataset\n",
|
||||
"\n",
|
||||
"In this notebook, we \n",
|
||||
"apply classification algorithms to newspaper articles published in 2003 in *Le Monde*. \n",
|
||||
"\n",
|
||||
"The data are here : https://cloud.teklia.com/index.php/s/X9BWJTP2PoSRQBm/download/LeMonde2003_9classes.csv.gz\n",
|
||||
"\n",
|
||||
"Download it into the data directory : \n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"wget https://cloud.teklia.com/index.php/s/X9BWJTP2PoSRQBm/download/LeMonde2003_9classes.csv.gz\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"These articles concern different subjects but we will consider only articles related to the following subjects : entreprises (ENT), international (INT), arts (ART), société (SOC), France (FRA), sports (SPO), livres (LIV), télévision (TEL) and the font page articles (UNE).\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"> * Load the CSV file `data/LeMonde2003_9classes.csv.gz` containing the articles using pandas [pd.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). How many articles do you have ? \n",
|
||||
"> * Plot the frequency histogram of the categories using seaborn [countplot](https://seaborn.pydata.org/tutorial/categorical.html) : `sns.countplot(data=df,y='category')`\n",
|
||||
"> * Display the text of some of the article with the corresponding class using pandas [sample](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html)\n",
|
||||
"> * Using the [WordCloud library](https://amueller.github.io/word_cloud/index.html), display a word cloud for the most frequent classes. You can remove the stop words using the `stopwords` option, using the list of stop words in French in `data/stop_word_fr.txt`.\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"\n",
|
||||
"# load dataframe from CSV file\n",
|
||||
"# YOUR CODE HERE\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import seaborn as sns\n",
|
||||
"%matplotlib inline\n",
|
||||
"\n",
|
||||
"# Plot the statistics of category\n",
|
||||
"# YOUR CODE HERE"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Print examples of the articles\n",
|
||||
"pd.set_option('display.max_colwidth', None)\n",
|
||||
"# YOUR CODE HERE\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from wordcloud import WordCloud\n",
|
||||
"# Display one wordcloud for each of the most frequent classes\n",
|
||||
"\n",
|
||||
"from wordcloud import WordCloud\n",
|
||||
"STOPWORDS = [x.strip() for x in open('data/stop_word_fr.txt').readlines()]\n",
|
||||
"import matplotlib.pyplot as plt\n",
|
||||
"\n",
|
||||
"# plot a word cloud for each category\n",
|
||||
"for cat in ['ENT', 'INT', 'ART', 'SOC', 'FRA']:\n",
|
||||
" # YOUR CODE HERE"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Bag-of-word representation\n",
|
||||
"\n",
|
||||
"In order to apply machine learning algorithms to text, documents must be transformed into vectors. The most simple and standard way to transform a document into a vector is the *bag-of-word* encoding.\n",
|
||||
"\n",
|
||||
"The idea is very simple : \n",
|
||||
"\n",
|
||||
"1. define the set of all the possible words that can appear in a document; denote its size by `max_features`.\n",
|
||||
"2. for each document, encode it with a vector of size `max_features`, with the value of the ith component of the vector equal to the number of time the ith word appears in the document.\n",
|
||||
"\n",
|
||||
"See [the wikipedia article on Bag-of-word](https://en.wikipedia.org/wiki/Bag-of-words_model) for an example.\n",
|
||||
"\n",
|
||||
"Scikit-learn proposes different methods to encode text into vectors : [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [TfidfTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html).\n",
|
||||
"\n",
|
||||
"The encoder must first be trained on the train set and applied to the different sets, for example with the 200 words : \n",
|
||||
"\n",
|
||||
"\tfrom sklearn.feature_extraction.text import CountVectorizer\n",
|
||||
"\tvectorizer = CountVectorizer(max_features=200)\n",
|
||||
" vectorizer.fit(X_train)\n",
|
||||
" X_train_counts = vectorizer.transform(X_train)\n",
|
||||
" X_test_counts = vectorizer.transform(X_test)\n",
|
||||
" \n",
|
||||
"**Question**:\n",
|
||||
"\n",
|
||||
"> * Split the dataset LeMonde2003 into train set (80%), dev set (10%) and test set (10%) using scikit-learn [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)\n",
|
||||
"> * For each set, transform the text of the articles into vectors using the `CountVectorizer`, considering the 1000 most frequent words. \n",
|
||||
"> * Train a naive bayes classifier on the data. \n",
|
||||
"> * Evaluate the classification accuracy on the train, dev and test sets using the [score](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB.score) method. \n",
|
||||
"\n",
|
||||
"> ***Important*** : the test set must not be used during the training phase, and learning the vector representation of the words is part of the training. The dev set should be an evaluation of the test set.\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sklearn.model_selection import train_test_split\n",
|
||||
"# Split the dataset, create X (features) and y (target), print the size\n",
|
||||
"# YOUR CODE HERE\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sklearn.feature_extraction.text import CountVectorizer\n",
|
||||
"# Create document vectors\n",
|
||||
"# YOUR CODE HERE\n",
|
||||
"# create the vectorizer object\n",
|
||||
"\n",
|
||||
"# fit on train data\n",
|
||||
"\n",
|
||||
"# apply it on train and dev data\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sklearn.naive_bayes import MultinomialNB\n",
|
||||
"# train a Naive Bayes classifier\n",
|
||||
"# YOUR CODE HERE\n",
|
||||
"# create the MultinomialNB\n",
|
||||
"\n",
|
||||
"# Train \n",
|
||||
"\n",
|
||||
"# Evaluate \n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## TF-IDF representation\n",
|
||||
"\n",
|
||||
"The `CountVectorizer` encodes the text using the raw frequencies of the words. However, words that are very frequent and appear in all the documents will have a strong weight whereas they are not discriminative. The *Term-Frequency Inverse-Document-Frequency* weighting scheme take into accound the number of documents in which a given word occurs. A word that appear in many document will have less weight. See [the wikipedia page](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) for more details.\n",
|
||||
"\n",
|
||||
"With scikit-learn, the `TfidfTransformer` is applied after the `CountVectorizer` :\n",
|
||||
"\n",
|
||||
"\tfrom sklearn.feature_extraction.text import TfidfTransformer\n",
|
||||
"\ttf_transformer = TfidfTransformer().fit(X_train_counts)\n",
|
||||
" \tX_train_tf = tf_transformer.transform(X_train_counts)\n",
|
||||
"\tX_test_tf = tf_transformer.transform(X_test_counts)\n",
|
||||
"\t\n",
|
||||
"**Question**:\n",
|
||||
"\n",
|
||||
"> * Use the TF-IDF representation to train a Multinomial Naive Bayes classifier. Report your best test error rate and the error rates for all the configurations tested."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sklearn.feature_extraction.text import TfidfTransformer\n",
|
||||
"# YOUR CODE HERE\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Error analysis\n",
|
||||
"\n",
|
||||
"The classification error rate give an evaluation of the performance for all the classes. But since the classes are not equally distributed, they may not be equally well modelized. In order to get a better idea of the performance of the classifier, detailed metrics must be used : \n",
|
||||
"\n",
|
||||
"* [metrics.classification_report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) provides a detailed analysis per class : the precision (amongst all the example classified as class X, how many are really from the classX) and the recall (amongst all the example that are from the class X, how many are classified as class X) and the F-Score which is as a weighted harmonic mean of the precision and recall.\n",
|
||||
"* [metrics.confusion_matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) which give the confusions between the classes. It can be displayed in color with [plot_confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html#sklearn.metrics.plot_confusion_matrix).\n",
|
||||
"\n",
|
||||
"**Question**:\n",
|
||||
"\n",
|
||||
"> * Report the `classification_report` for your classifier. Which classes have the best scores ? Why ?\n",
|
||||
"> * Report the `confusion_matrix` for your classifier. Which classes are the most confused ? Why ?\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sklearn.metrics import classification_report, ConfusionMatrixDisplay\n",
|
||||
"\n",
|
||||
"# YOUR CODE HERE\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Data re-configuration\n",
|
||||
"After the error analysis, we came to the conclusion that one of the class can not be distinguised from the others. There is no use trying to solve an impossible problem.\n",
|
||||
"\n",
|
||||
"**Questions**:\n",
|
||||
"\n",
|
||||
"> * Remove the class `ÙNE` from the original dataset using pandas [replace](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html)\n",
|
||||
"> * Plot the class statitics with seaborn\n",
|
||||
"> * Create new splits\n",
|
||||
"> * Retrain a NaiveBayes classifier using [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) with the 1000 most frequent words."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
|
||||
"# YOUR CODE HERE\n",
|
||||
"\n",
|
||||
"# Filter out the UNE class\n",
|
||||
"\n",
|
||||
"# Plot the statistics of classes\n",
|
||||
"\n",
|
||||
"# Make the splits and print the sizes for checking\n",
|
||||
"\n",
|
||||
"# Apply TfidfVectorizer\n",
|
||||
"\n",
|
||||
"# Train MultinomialNB\n",
|
||||
"\n",
|
||||
"# Print accuracy\n",
|
||||
"\n",
|
||||
"# Print confusion matric\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Hyperparameter optimization\n",
|
||||
"\n",
|
||||
"The classification process has many parameters : alpha for the classifier, max_features, max_df, min_df, using idf or not, ngram orders for the Count of TfIDF transformer. These parameters can be optimized by a grid search using GridSearchCV.\n",
|
||||
"\n",
|
||||
"**Question**:\n",
|
||||
"\n",
|
||||
"> * Using the template code below, find the best values for the parameter max_features, max_df, min_df, use_idf, ngram_range, alpha\n",
|
||||
"> * Refit the best model on all the train+dev data and print accuracy on test set\n",
|
||||
"\n",
|
||||
"Note that for developping the code, the number of training samples is limited to 1000\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"df_filtered_train_dev.iloc[:1000].text\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"Once your code is correct, you can train on the full training set.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#### Hyperameters optimization with GridSearchCV = parallel processing\n",
|
||||
"from sklearn.model_selection import GridSearchCV\n",
|
||||
"from sklearn.pipeline import Pipeline\n",
|
||||
"from pprint import pprint\n",
|
||||
"from time import time\n",
|
||||
"import logging\n",
|
||||
"# Display progress logs on stdout\n",
|
||||
"logging.basicConfig(level=logging.INFO,\n",
|
||||
" format='%(asctime)s %(levelname)s %(message)s')\n",
|
||||
"\n",
|
||||
"# create train_dev and test set for using Cross-Validation\n",
|
||||
"df_filtered_train_dev, df_filtered_test = train_test_split(df_filtered.dropna() ,test_size=0.10, random_state=42)\n",
|
||||
"print ('train_dev size',df_filtered_train_dev.shape)\n",
|
||||
"print ('test size',df_filtered_test.shape)\n",
|
||||
"# keep only 1000 training data for debuging\n",
|
||||
"X_train_dev, y_train_dev =df_filtered_train_dev.iloc[:1000].text, df_filtered_train.iloc[:1000].category\n",
|
||||
"X_test, y_test =df_filtered_test.text, df_filtered_test.category\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"pipeline = Pipeline([\n",
|
||||
" ('tfidf', TfidfVectorizer()),\n",
|
||||
" ('clf', MultinomialNB()),\n",
|
||||
"])\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"parameters = {\n",
|
||||
" 'tfidf__max_features': (500, 1000, 5000, 10000, None),\n",
|
||||
" # YOUR CODE HERE\n",
|
||||
"}\n",
|
||||
"if __name__ == \"__main__\":\n",
|
||||
" # multiprocessing requires the fork to happen in a __main__ protected\n",
|
||||
" # block\n",
|
||||
"\n",
|
||||
" # find the best parameters for both the feature extraction and the\n",
|
||||
" # classifier\n",
|
||||
" grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=2, cv=3)\n",
|
||||
"\n",
|
||||
" print(\"Performing grid search...\")\n",
|
||||
" print(\"pipeline:\", [name for name, _ in pipeline.steps])\n",
|
||||
" print(\"parameters:\")\n",
|
||||
" pprint(parameters)\n",
|
||||
" t0 = time()\n",
|
||||
" grid_search.fit(X_train_dev, y_train_dev)\n",
|
||||
" print(\"done in %0.3fs\" % (time() - t0))\n",
|
||||
" print()\n",
|
||||
"\n",
|
||||
" print(\"Best score: %0.3f\" % grid_search.best_score_)\n",
|
||||
" print(\"Best parameters set:\")\n",
|
||||
" best_parameters = grid_search.best_estimator_.get_params()\n",
|
||||
" for param_name in sorted(parameters.keys()):\n",
|
||||
" print(\"\\t%s: %r\" % (param_name, best_parameters[param_name]))\n",
|
||||
" df = pd.DataFrame(grid_search.cv_results_)\n",
|
||||
" print (df[['rank_test_score','param_tfidf__max_features','mean_test_score']].sort_values('rank_test_score'))\n",
|
||||
" \n",
|
||||
" # use refit and print accuracy on test set\n",
|
||||
" # YOUR CODE HERE"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Classification with Neural networks\n",
|
||||
"\n",
|
||||
"Neural networks can be trained to learn both the vector representation of the words (instead of tf-idf) and how to classify the documents. The code below allows you to train a neural text classifier using word embeddings using Keras. Most of the code is written, you only have to define the architecture of the network with the correct parameters before training it : \n",
|
||||
"\n",
|
||||
"**Question**:\n",
|
||||
"\n",
|
||||
"> * Define a neural network in the function `get_model()` with the following parameters : \n",
|
||||
"> * use only the 10 000 most frequent words in the documents\n",
|
||||
"> * use 1024 as the maximal number of words in the articles\n",
|
||||
"> * use an embedding size of 300: [embedding layer](https://keras.io/layers/embeddings/)\n",
|
||||
"> * use a dropout of 0.5: [dropout layer](https://keras.io/layers/core/#dropout)\n",
|
||||
"> * use 32 convolutional filters of size 2 x EMBED_SIZE: [1D convolutional layer](https://keras.io/layers/convolutional/#conv1d)\n",
|
||||
"> * use a max pooling of size 2 : [1D Max Pooling](https://keras.io/layers/pooling/#maxpooling1d)\n",
|
||||
"> * Train the model and compare its accuracy to the Naive Bayes models.\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import ast\n",
|
||||
"import os\n",
|
||||
"from nn_utils import TrainingHistory\n",
|
||||
"from keras.layers import Dense, Embedding, Input\n",
|
||||
"from keras.layers import GRU, Dropout, MaxPooling1D, Conv1D, Flatten\n",
|
||||
"from keras.models import Model\n",
|
||||
"import numpy as np\n",
|
||||
"import itertools\n",
|
||||
"from keras.utils import np_utils\n",
|
||||
"from sklearn.metrics import (classification_report, \n",
|
||||
" precision_recall_fscore_support, \n",
|
||||
" accuracy_score)\n",
|
||||
"\n",
|
||||
"from keras.preprocessing import text, sequence\n",
|
||||
"\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Model parameters\n",
|
||||
"MAX_FEATURES = # YOUR CODE HERE\n",
|
||||
"MAX_TEXT_LENGTH = # YOUR CODE HERE\n",
|
||||
"EMBED_SIZE = # YOUR CODE HERE\n",
|
||||
"BATCH_SIZE = 16\n",
|
||||
"EPOCHS = 10\n",
|
||||
"VALIDATION_SPLIT = 0.1"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"def get_train_test(train_raw_text, test_raw_text):\n",
|
||||
" \n",
|
||||
" tokenizer = text.Tokenizer(num_words=MAX_FEATURES)\n",
|
||||
"\n",
|
||||
" tokenizer.fit_on_texts(list(train_raw_text))\n",
|
||||
" train_tokenized = tokenizer.texts_to_sequences(train_raw_text)\n",
|
||||
" test_tokenized = tokenizer.texts_to_sequences(test_raw_text)\n",
|
||||
" return sequence.pad_sequences(train_tokenized, maxlen=MAX_TEXT_LENGTH), \\\n",
|
||||
" sequence.pad_sequences(test_tokenized, maxlen=MAX_TEXT_LENGTH)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def get_model():\n",
|
||||
"\n",
|
||||
" inp = Input(shape=(# YOUR CODE HERE,))\n",
|
||||
" model = Embedding(# YOUR CODE HERE, # YOUR CODE HERE)(inp)\n",
|
||||
" model = Dropout(# YOUR CODE HERE)(model)\n",
|
||||
" model = Conv1D(filters=# YOUR CODE HERE, kernel_size=# YOUR CODE HERE, padding='same', activation='relu')(model)\n",
|
||||
" model = MaxPooling1D(pool_size=# YOUR CODE HERE)(model)\n",
|
||||
" model = Flatten()(model)\n",
|
||||
" model = Dense(7, activation=\"softmax\")(model)\n",
|
||||
" model = Model(inputs=inp, outputs=model)\n",
|
||||
" \n",
|
||||
" model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
|
||||
" model.summary()\n",
|
||||
" return model\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"def train_fit_predict(model, x_train, x_test, y, history):\n",
|
||||
" \n",
|
||||
" model.fit(x_train, y,\n",
|
||||
" batch_size=BATCH_SIZE,\n",
|
||||
" epochs=EPOCHS, verbose=1,\n",
|
||||
" validation_split=VALIDATION_SPLIT)\n",
|
||||
"\n",
|
||||
" return model.predict(x_test)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"# Get the list of different classes\n",
|
||||
"CLASSES_LIST = np.unique(y_train)\n",
|
||||
"n_out = len(CLASSES_LIST)\n",
|
||||
"print(CLASSES_LIST)\n",
|
||||
"\n",
|
||||
"# Convert clas string to index\n",
|
||||
"from sklearn import preprocessing\n",
|
||||
"le = preprocessing.LabelEncoder()\n",
|
||||
"le.fit(CLASSES_LIST)\n",
|
||||
"y_train = le.transform(y_train) \n",
|
||||
"y_test = le.transform(y_test) \n",
|
||||
"train_y_cat = np_utils.to_categorical(y_train, n_out)\n",
|
||||
"\n",
|
||||
"# get the textual data in the correct format for NN\n",
|
||||
"x_vec_train, x_vec_test = get_train_test(X_train, X_test)\n",
|
||||
"print(len(x_vec_train), len(x_vec_test))\n",
|
||||
"\n",
|
||||
"# define the NN topology\n",
|
||||
"model = get_model()\n",
|
||||
"\n",
|
||||
"# Define training procedure\n",
|
||||
"history = TrainingHistory(x_vec_test, y_test, CLASSES_LIST)\n",
|
||||
"\n",
|
||||
"# Train and predict\n",
|
||||
"y_predicted = train_fit_predict(model, x_vec_train, x_vec_test, train_y_cat, history).argmax(1)\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"print(\"Test Accuracy:\", accuracy_score(y_test, y_predicted))\n",
|
||||
"\n",
|
||||
"p, r, f1, s = precision_recall_fscore_support(y_test, y_predicted, \n",
|
||||
" average='micro',\n",
|
||||
" labels=[x for x in np.unique(y_train) ])\n",
|
||||
"\n",
|
||||
"print('p r f1 %.1f %.2f %.3f' % (np.average(p, weights=s)*100.0, \n",
|
||||
" np.average(r, weights=s)*100.0, \n",
|
||||
" np.average(f1, weights=s)*100.0))\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"print(classification_report(y_test, y_predicted, labels=[x for x in np.unique(y_train)]))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "nlp-class-env",
|
||||
"language": "python",
|
||||
"name": "nlp-class-env"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.10.9"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 2
|
||||
}
|
703
data/stop_word_fr.txt
Normal file
703
data/stop_word_fr.txt
Normal file
|
@ -0,0 +1,703 @@
|
|||
s'est
|
||||
ans
|
||||
faire
|
||||
avoir
|
||||
an
|
||||
d'une
|
||||
d'un
|
||||
c'est
|
||||
qu'il
|
||||
a
|
||||
abord
|
||||
absolument
|
||||
afin
|
||||
ah
|
||||
ai
|
||||
aie
|
||||
aient
|
||||
aies
|
||||
ailleurs
|
||||
ainsi
|
||||
ait
|
||||
allaient
|
||||
allo
|
||||
allons
|
||||
allô
|
||||
alors
|
||||
anterieur
|
||||
anterieure
|
||||
anterieures
|
||||
apres
|
||||
après
|
||||
as
|
||||
assez
|
||||
attendu
|
||||
au
|
||||
aucun
|
||||
aucune
|
||||
aucuns
|
||||
aujourd
|
||||
aujourd'hui
|
||||
aupres
|
||||
auquel
|
||||
aura
|
||||
aurai
|
||||
auraient
|
||||
aurais
|
||||
aurait
|
||||
auras
|
||||
aurez
|
||||
auriez
|
||||
aurions
|
||||
aurons
|
||||
auront
|
||||
aussi
|
||||
autre
|
||||
autrefois
|
||||
autrement
|
||||
autres
|
||||
autrui
|
||||
aux
|
||||
auxquelles
|
||||
auxquels
|
||||
avaient
|
||||
avais
|
||||
avait
|
||||
avant
|
||||
avec
|
||||
avez
|
||||
aviez
|
||||
avions
|
||||
avoir
|
||||
avons
|
||||
ayant
|
||||
ayez
|
||||
ayons
|
||||
b
|
||||
bah
|
||||
bas
|
||||
basee
|
||||
bat
|
||||
beau
|
||||
beaucoup
|
||||
bien
|
||||
bigre
|
||||
bon
|
||||
boum
|
||||
bravo
|
||||
brrr
|
||||
c
|
||||
car
|
||||
ce
|
||||
ceci
|
||||
cela
|
||||
celle
|
||||
celle-ci
|
||||
celle-là
|
||||
celles
|
||||
celles-ci
|
||||
celles-là
|
||||
celui
|
||||
celui-ci
|
||||
celui-là
|
||||
celà
|
||||
celà
|
||||
cent
|
||||
cependant
|
||||
certain
|
||||
certaine
|
||||
certaines
|
||||
certains
|
||||
certes
|
||||
ces
|
||||
cet
|
||||
cette
|
||||
ceux
|
||||
ceux-ci
|
||||
ceux-là
|
||||
chacun
|
||||
chacune
|
||||
chaque
|
||||
cher
|
||||
chers
|
||||
chez
|
||||
chiche
|
||||
chut
|
||||
chère
|
||||
chères
|
||||
ci
|
||||
cinq
|
||||
cinquantaine
|
||||
cinquante
|
||||
cinquantième
|
||||
cinquième
|
||||
clac
|
||||
clic
|
||||
combien
|
||||
comme
|
||||
comment
|
||||
comparable
|
||||
comparables
|
||||
compris
|
||||
concernant
|
||||
contre
|
||||
couic
|
||||
crac
|
||||
d
|
||||
da
|
||||
dans
|
||||
de
|
||||
debout
|
||||
dedans
|
||||
dehors
|
||||
deja
|
||||
delà
|
||||
depuis
|
||||
dernier
|
||||
derniere
|
||||
derriere
|
||||
derrière
|
||||
des
|
||||
desormais
|
||||
desquelles
|
||||
desquels
|
||||
dessous
|
||||
dessus
|
||||
deux
|
||||
deuxième
|
||||
deuxièmement
|
||||
devant
|
||||
devers
|
||||
devra
|
||||
devrait
|
||||
different
|
||||
differentes
|
||||
differents
|
||||
différent
|
||||
différente
|
||||
différentes
|
||||
différents
|
||||
dire
|
||||
directe
|
||||
directement
|
||||
dit
|
||||
dite
|
||||
dits
|
||||
divers
|
||||
diverse
|
||||
diverses
|
||||
dix
|
||||
dix-huit
|
||||
dix-neuf
|
||||
dix-sept
|
||||
dixième
|
||||
doit
|
||||
doivent
|
||||
donc
|
||||
dont
|
||||
dos
|
||||
douze
|
||||
douzième
|
||||
dring
|
||||
droite
|
||||
du
|
||||
duquel
|
||||
durant
|
||||
dès
|
||||
début
|
||||
désormais
|
||||
e
|
||||
effet
|
||||
egale
|
||||
egalement
|
||||
egales
|
||||
eh
|
||||
elle
|
||||
elle-même
|
||||
elles
|
||||
elles-mêmes
|
||||
en
|
||||
encore
|
||||
enfin
|
||||
entre
|
||||
envers
|
||||
environ
|
||||
es
|
||||
essai
|
||||
est
|
||||
et
|
||||
etant
|
||||
etc
|
||||
etre
|
||||
eu
|
||||
eue
|
||||
eues
|
||||
euh
|
||||
eurent
|
||||
eus
|
||||
eusse
|
||||
eussent
|
||||
eusses
|
||||
eussiez
|
||||
eussions
|
||||
eut
|
||||
eux
|
||||
eux-mêmes
|
||||
exactement
|
||||
excepté
|
||||
extenso
|
||||
exterieur
|
||||
eûmes
|
||||
eût
|
||||
eûtes
|
||||
f
|
||||
fais
|
||||
faisaient
|
||||
faisant
|
||||
fait
|
||||
faites
|
||||
façon
|
||||
feront
|
||||
fi
|
||||
flac
|
||||
floc
|
||||
fois
|
||||
font
|
||||
force
|
||||
furent
|
||||
fus
|
||||
fusse
|
||||
fussent
|
||||
fusses
|
||||
fussiez
|
||||
fussions
|
||||
fut
|
||||
fûmes
|
||||
fût
|
||||
fûtes
|
||||
g
|
||||
gens
|
||||
h
|
||||
ha
|
||||
haut
|
||||
hein
|
||||
hem
|
||||
hep
|
||||
hi
|
||||
ho
|
||||
holà
|
||||
hop
|
||||
hormis
|
||||
hors
|
||||
hou
|
||||
houp
|
||||
hue
|
||||
hui
|
||||
huit
|
||||
huitième
|
||||
hum
|
||||
hurrah
|
||||
hé
|
||||
hélas
|
||||
i
|
||||
ici
|
||||
il
|
||||
ils
|
||||
importe
|
||||
j
|
||||
je
|
||||
jusqu
|
||||
jusque
|
||||
juste
|
||||
k
|
||||
l
|
||||
la
|
||||
laisser
|
||||
laquelle
|
||||
las
|
||||
le
|
||||
lequel
|
||||
les
|
||||
lesquelles
|
||||
lesquels
|
||||
leur
|
||||
leurs
|
||||
longtemps
|
||||
lors
|
||||
lorsque
|
||||
lui
|
||||
lui-meme
|
||||
lui-même
|
||||
là
|
||||
lès
|
||||
m
|
||||
ma
|
||||
maint
|
||||
maintenant
|
||||
mais
|
||||
malgre
|
||||
malgré
|
||||
maximale
|
||||
me
|
||||
meme
|
||||
memes
|
||||
merci
|
||||
mes
|
||||
mien
|
||||
mienne
|
||||
miennes
|
||||
miens
|
||||
mille
|
||||
mince
|
||||
mine
|
||||
minimale
|
||||
moi
|
||||
moi-meme
|
||||
moi-même
|
||||
moindres
|
||||
moins
|
||||
mon
|
||||
mot
|
||||
moyennant
|
||||
multiple
|
||||
multiples
|
||||
même
|
||||
mêmes
|
||||
n
|
||||
na
|
||||
naturel
|
||||
naturelle
|
||||
naturelles
|
||||
ne
|
||||
neanmoins
|
||||
necessaire
|
||||
necessairement
|
||||
neuf
|
||||
neuvième
|
||||
ni
|
||||
nombreuses
|
||||
nombreux
|
||||
nommés
|
||||
non
|
||||
nos
|
||||
notamment
|
||||
notre
|
||||
nous
|
||||
nous-mêmes
|
||||
nouveau
|
||||
nouveaux
|
||||
nul
|
||||
néanmoins
|
||||
nôtre
|
||||
nôtres
|
||||
o
|
||||
oh
|
||||
ohé
|
||||
ollé
|
||||
olé
|
||||
on
|
||||
ont
|
||||
onze
|
||||
onzième
|
||||
ore
|
||||
ou
|
||||
ouf
|
||||
ouias
|
||||
oust
|
||||
ouste
|
||||
outre
|
||||
ouvert
|
||||
ouverte
|
||||
ouverts
|
||||
o|
|
||||
où
|
||||
p
|
||||
paf
|
||||
pan
|
||||
par
|
||||
parce
|
||||
parfois
|
||||
parle
|
||||
parlent
|
||||
parler
|
||||
parmi
|
||||
parole
|
||||
parseme
|
||||
partant
|
||||
particulier
|
||||
particulière
|
||||
particulièrement
|
||||
pas
|
||||
passé
|
||||
pendant
|
||||
pense
|
||||
permet
|
||||
personne
|
||||
personnes
|
||||
peu
|
||||
peut
|
||||
peuvent
|
||||
peux
|
||||
pff
|
||||
pfft
|
||||
pfut
|
||||
pif
|
||||
pire
|
||||
pièce
|
||||
plein
|
||||
plouf
|
||||
plupart
|
||||
plus
|
||||
plusieurs
|
||||
plutôt
|
||||
possessif
|
||||
possessifs
|
||||
possible
|
||||
possibles
|
||||
pouah
|
||||
pour
|
||||
pourquoi
|
||||
pourrais
|
||||
pourrait
|
||||
pouvait
|
||||
prealable
|
||||
precisement
|
||||
premier
|
||||
première
|
||||
premièrement
|
||||
pres
|
||||
probable
|
||||
probante
|
||||
procedant
|
||||
proche
|
||||
près
|
||||
psitt
|
||||
pu
|
||||
puis
|
||||
puisque
|
||||
pur
|
||||
pure
|
||||
q
|
||||
qu
|
||||
quand
|
||||
quant
|
||||
quant-à-soi
|
||||
quanta
|
||||
quarante
|
||||
quatorze
|
||||
quatre
|
||||
quatre-vingt
|
||||
quatrième
|
||||
quatrièmement
|
||||
que
|
||||
quel
|
||||
quelconque
|
||||
quelle
|
||||
quelles
|
||||
quelqu'un
|
||||
quelque
|
||||
quelques
|
||||
quels
|
||||
qui
|
||||
quiconque
|
||||
quinze
|
||||
quoi
|
||||
quoique
|
||||
r
|
||||
rare
|
||||
rarement
|
||||
rares
|
||||
relative
|
||||
relativement
|
||||
remarquable
|
||||
rend
|
||||
rendre
|
||||
restant
|
||||
reste
|
||||
restent
|
||||
restrictif
|
||||
retour
|
||||
revoici
|
||||
revoilà
|
||||
rien
|
||||
s
|
||||
sa
|
||||
sacrebleu
|
||||
sait
|
||||
sans
|
||||
sapristi
|
||||
sauf
|
||||
se
|
||||
sein
|
||||
seize
|
||||
selon
|
||||
semblable
|
||||
semblaient
|
||||
semble
|
||||
semblent
|
||||
sent
|
||||
sept
|
||||
septième
|
||||
sera
|
||||
serai
|
||||
seraient
|
||||
serais
|
||||
serait
|
||||
seras
|
||||
serez
|
||||
seriez
|
||||
serions
|
||||
serons
|
||||
seront
|
||||
ses
|
||||
seul
|
||||
seule
|
||||
seulement
|
||||
si
|
||||
sien
|
||||
sienne
|
||||
siennes
|
||||
siens
|
||||
sinon
|
||||
six
|
||||
sixième
|
||||
soi
|
||||
soi-même
|
||||
soia
|
||||
soient
|
||||
sois
|
||||
soit
|
||||
soixante
|
||||
sommes
|
||||
son
|
||||
sont
|
||||
sous
|
||||
souvent
|
||||
soyez
|
||||
soyons
|
||||
specifique
|
||||
specifiques
|
||||
speculatif
|
||||
stop
|
||||
strictement
|
||||
subtiles
|
||||
suffisant
|
||||
suffisante
|
||||
suffit
|
||||
suis
|
||||
suit
|
||||
suivant
|
||||
suivante
|
||||
suivantes
|
||||
suivants
|
||||
suivre
|
||||
sujet
|
||||
superpose
|
||||
sur
|
||||
surtout
|
||||
t
|
||||
ta
|
||||
tac
|
||||
tandis
|
||||
tant
|
||||
tardive
|
||||
te
|
||||
tel
|
||||
telle
|
||||
tellement
|
||||
telles
|
||||
tels
|
||||
tenant
|
||||
tend
|
||||
tenir
|
||||
tente
|
||||
tes
|
||||
tic
|
||||
tien
|
||||
tienne
|
||||
tiennes
|
||||
tiens
|
||||
toc
|
||||
toi
|
||||
toi-même
|
||||
ton
|
||||
touchant
|
||||
toujours
|
||||
tous
|
||||
tout
|
||||
toute
|
||||
toutefois
|
||||
toutes
|
||||
treize
|
||||
trente
|
||||
tres
|
||||
trois
|
||||
troisième
|
||||
troisièmement
|
||||
trop
|
||||
très
|
||||
tsoin
|
||||
tsouin
|
||||
tu
|
||||
té
|
||||
u
|
||||
un
|
||||
une
|
||||
unes
|
||||
uniformement
|
||||
unique
|
||||
uniques
|
||||
uns
|
||||
v
|
||||
va
|
||||
vais
|
||||
valeur
|
||||
vas
|
||||
vers
|
||||
via
|
||||
vif
|
||||
vifs
|
||||
vingt
|
||||
vivat
|
||||
vive
|
||||
vives
|
||||
vlan
|
||||
voici
|
||||
voie
|
||||
voient
|
||||
voilà
|
||||
vont
|
||||
vos
|
||||
votre
|
||||
vous
|
||||
vous-mêmes
|
||||
vu
|
||||
vé
|
||||
vôtre
|
||||
vôtres
|
||||
w
|
||||
x
|
||||
y
|
||||
z
|
||||
zut
|
||||
zutalors
|
||||
à
|
||||
â
|
||||
ça
|
||||
ès
|
||||
étaient
|
||||
étais
|
||||
était
|
||||
étant
|
||||
état
|
||||
étiez
|
||||
étions
|
||||
été
|
||||
étée
|
||||
étées
|
||||
étés
|
||||
êtes
|
||||
être
|
||||
êtreêtre
|
||||
ô
|
||||
ôau
|
46
nn_utils.py
Normal file
46
nn_utils.py
Normal file
|
@ -0,0 +1,46 @@
|
|||
from keras.callbacks import Callback, EarlyStopping, ModelCheckpoint
|
||||
|
||||
|
||||
class TrainingHistory(Callback):
|
||||
|
||||
def __init__(self, x_test, y_test, CLASSES_LIST):
|
||||
super(Callback, self).__init__()
|
||||
self.x_test = x_test
|
||||
self.y_test = y_test
|
||||
self.CLASSES_LIST = CLASSES_LIST
|
||||
|
||||
def on_train_begin(self, logs={}):
|
||||
self.losses = []
|
||||
self.epoch_losses = []
|
||||
self.epoch_val_losses = []
|
||||
self.val_losses = []
|
||||
self.predictions = []
|
||||
self.epochs = []
|
||||
self.f1 = []
|
||||
self.i = 0
|
||||
self.save_every = 50
|
||||
|
||||
def on_epoch_end(self, epoch, logs={}):
|
||||
|
||||
y_predicted = self.model.predict(self.x_test).argmax(1)
|
||||
print(y_predicted.shape)
|
||||
|
||||
print("Test Accuracy:", accuracy_score(self.y_test, y_predicted))
|
||||
|
||||
p, r, f1, s = precision_recall_fscore_support(self.y_test, y_predicted,
|
||||
average='micro',
|
||||
labels=[x for x in
|
||||
self.CLASSES_LIST])
|
||||
|
||||
print('p r f1 %.1f %.1f %.1f' % (np.average(p, weights=s)*100.0,
|
||||
np.average(r, weights=s)*100.0,
|
||||
np.average(f1, weights=s)*100.0))
|
||||
|
||||
try:
|
||||
print(classification_report(self.y_test, y_predicted, labels=[x for x in
|
||||
self.CLASSES_LIST]))
|
||||
except:
|
||||
print('ZERO')
|
||||
|
||||
|
||||
|
5
requirements.txt
Normal file
5
requirements.txt
Normal file
|
@ -0,0 +1,5 @@
|
|||
wordcloud==1.8.2.2
|
||||
ipykernel==6.19.4
|
||||
pandas==1.5.2
|
||||
seaborn==0.12.2
|
||||
scikit-learn==1.2.0
|
Loading…
Reference in New Issue
Block a user