324 lines
14 KiB
Plaintext
324 lines
14 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Text classification on LeMonde2003 Dataset\n",
|
|
"\n",
|
|
"In this notebook, we \n",
|
|
"apply classification algorithms to newspaper articles published in 2003 in *Le Monde*. \n",
|
|
"\n",
|
|
"The data are here : https://cloud.teklia.com/index.php/s/X9BWJTP2PoSRQBm/download/LeMonde2003_9classes.csv.gz\n",
|
|
"\n",
|
|
"Download it into the data directory : \n",
|
|
"\n",
|
|
"```\n",
|
|
"wget https://cloud.teklia.com/index.php/s/X9BWJTP2PoSRQBm/download/LeMonde2003_9classes.csv.gz\n",
|
|
"```\n",
|
|
"\n",
|
|
"These articles concern different subjects but we will consider only articles related to the following subjects : entreprises (ENT), international (INT), arts (ART), société (SOC), France (FRA), sports (SPO), livres (LIV), télévision (TEL) and the font page articles (UNE).\n",
|
|
"\n",
|
|
"\n",
|
|
"> * Load the CSV file `data/LeMonde2003_9classes.csv.gz` containing the articles using pandas [pd.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html). How many articles do you have ? \n",
|
|
"> * Plot the frequency histogram of the categories using seaborn [countplot](https://seaborn.pydata.org/tutorial/categorical.html) : `sns.countplot(data=df,y='category')`\n",
|
|
"> * Display the text of some of the article with the corresponding class using pandas [sample](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html)\n",
|
|
"> * Using the [WordCloud library](https://amueller.github.io/word_cloud/index.html), display a word cloud for the most frequent classes. You can remove the stop words using the `stopwords` option, using the list of stop words in French in `data/stop_word_fr.txt`.\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pandas as pd\n",
|
|
"\n",
|
|
"# load dataframe from CSV file\n",
|
|
"# YOUR CODE HERE\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import seaborn as sns\n",
|
|
"%matplotlib inline\n",
|
|
"\n",
|
|
"# Plot the statistics of category\n",
|
|
"# YOUR CODE HERE"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Print examples of the articles\n",
|
|
"pd.set_option('display.max_colwidth', None)\n",
|
|
"# YOUR CODE HERE\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from wordcloud import WordCloud\n",
|
|
"# Display one wordcloud for each of the most frequent classes\n",
|
|
"\n",
|
|
"from wordcloud import WordCloud\n",
|
|
"STOPWORDS = [x.strip() for x in open('data/stop_word_fr.txt').readlines()]\n",
|
|
"import matplotlib.pyplot as plt\n",
|
|
"\n",
|
|
"# plot a word cloud for each category\n",
|
|
"for cat in ['ENT', 'INT', 'ART', 'SOC', 'FRA']:\n",
|
|
" # YOUR CODE HERE"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Bag-of-word representation\n",
|
|
"\n",
|
|
"In order to apply machine learning algorithms to text, documents must be transformed into vectors. The most simple and standard way to transform a document into a vector is the *bag-of-word* encoding.\n",
|
|
"\n",
|
|
"The idea is very simple : \n",
|
|
"\n",
|
|
"1. define the set of all the possible words that can appear in a document; denote its size by `max_features`.\n",
|
|
"2. for each document, encode it with a vector of size `max_features`, with the value of the ith component of the vector equal to the number of time the ith word appears in the document.\n",
|
|
"\n",
|
|
"See [the wikipedia article on Bag-of-word](https://en.wikipedia.org/wiki/Bag-of-words_model) for an example.\n",
|
|
"\n",
|
|
"Scikit-learn proposes different methods to encode text into vectors : [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [TfidfTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html).\n",
|
|
"\n",
|
|
"The encoder must first be trained on the train set and applied to the different sets, for example with the 200 words : \n",
|
|
"\n",
|
|
"\tfrom sklearn.feature_extraction.text import CountVectorizer\n",
|
|
"\tvectorizer = CountVectorizer(max_features=200)\n",
|
|
" vectorizer.fit(X_train)\n",
|
|
" X_train_counts = vectorizer.transform(X_train)\n",
|
|
" X_test_counts = vectorizer.transform(X_test)\n",
|
|
" \n",
|
|
"**Question**:\n",
|
|
"\n",
|
|
"> * Split the dataset LeMonde2003 into train set (80%), dev set (10%) and test set (10%) using scikit-learn [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)\n",
|
|
"> * For each set, transform the text of the articles into vectors using the `CountVectorizer`, considering the 1000 most frequent words. \n",
|
|
"> * Train a naive bayes classifier on the data. \n",
|
|
"> * Evaluate the classification accuracy on the train, dev and test sets using the [score](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB.score) method. \n",
|
|
"\n",
|
|
"> ***Important*** : the test set must not be used during the training phase, and learning the vector representation of the words is part of the training. The dev set should be an evaluation of the test set.\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.model_selection import train_test_split\n",
|
|
"# Split the dataset, create X (features) and y (target), print the size\n",
|
|
"# YOUR CODE HERE\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.feature_extraction.text import CountVectorizer\n",
|
|
"# Create document vectors\n",
|
|
"# YOUR CODE HERE\n",
|
|
"# create the vectorizer object\n",
|
|
"\n",
|
|
"# fit on train data\n",
|
|
"\n",
|
|
"# apply it on train and dev data\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.naive_bayes import MultinomialNB\n",
|
|
"# train a Naive Bayes classifier\n",
|
|
"# YOUR CODE HERE\n",
|
|
"# create the MultinomialNB\n",
|
|
"\n",
|
|
"# Train \n",
|
|
"\n",
|
|
"# Evaluate \n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## TF-IDF representation\n",
|
|
"\n",
|
|
"The `CountVectorizer` encodes the text using the raw frequencies of the words. However, words that are very frequent and appear in all the documents will have a strong weight whereas they are not discriminative. The *Term-Frequency Inverse-Document-Frequency* weighting scheme take into accound the number of documents in which a given word occurs. A word that appear in many document will have less weight. See [the wikipedia page](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) for more details.\n",
|
|
"\n",
|
|
"With scikit-learn, the `TfidfTransformer` is applied after the `CountVectorizer` :\n",
|
|
"\n",
|
|
"\tfrom sklearn.feature_extraction.text import TfidfTransformer\n",
|
|
"\ttf_transformer = TfidfTransformer().fit(X_train_counts)\n",
|
|
" \tX_train_tf = tf_transformer.transform(X_train_counts)\n",
|
|
"\tX_test_tf = tf_transformer.transform(X_test_counts)\n",
|
|
"\t\n",
|
|
"**Question**:\n",
|
|
"\n",
|
|
"> * Use the TF-IDF representation to train a Multinomial Naive Bayes classifier. Report your best test error rate and the error rates for all the configurations tested."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.feature_extraction.text import TfidfTransformer\n",
|
|
"# YOUR CODE HERE\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Error analysis\n",
|
|
"\n",
|
|
"The classification error rate give an evaluation of the performance for all the classes. But since the classes are not equally distributed, they may not be equally well modelized. In order to get a better idea of the performance of the classifier, detailed metrics must be used : \n",
|
|
"\n",
|
|
"* [metrics.classification_report](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html) provides a detailed analysis per class : the precision (amongst all the example classified as class X, how many are really from the classX) and the recall (amongst all the example that are from the class X, how many are classified as class X) and the F-Score which is as a weighted harmonic mean of the precision and recall.\n",
|
|
"* [metrics.confusion_matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) which give the confusions between the classes. It can be displayed in color with [plot_confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.plot_confusion_matrix.html#sklearn.metrics.plot_confusion_matrix).\n",
|
|
"\n",
|
|
"**Question**:\n",
|
|
"\n",
|
|
"> * Report the `classification_report` for your classifier. Which classes have the best scores ? Why ?\n",
|
|
"> * Report the `confusion_matrix` for your classifier. Which classes are the most confused ? Why ?\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.metrics import classification_report, ConfusionMatrixDisplay\n",
|
|
"\n",
|
|
"# YOUR CODE HERE\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Data re-configuration\n",
|
|
"After the error analysis, we came to the conclusion that one of the class can not be distinguised from the others. There is no use trying to solve an impossible problem.\n",
|
|
"\n",
|
|
"**Questions**:\n",
|
|
"\n",
|
|
"> * Remove the class `UNE` from the original dataset and merge the semantically close classes 'FRANCE' and 'SOCIETE'\n",
|
|
"> * Plot the class statitics with seaborn\n",
|
|
"> * Create new splits\n",
|
|
"> * Retrain a NaiveBayes classifier using [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) with the 1000 most frequent words."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
|
|
"# YOUR CODE HERE\n",
|
|
"\n",
|
|
"# Filter out the UNE class\n",
|
|
"\n",
|
|
"# Plot the statistics of classes\n",
|
|
"\n",
|
|
"# Make the splits and print the sizes for checking\n",
|
|
"\n",
|
|
"# Apply TfidfVectorizer\n",
|
|
"\n",
|
|
"# Train MultinomialNB\n",
|
|
"\n",
|
|
"# Print accuracy\n",
|
|
"\n",
|
|
"# Print confusion matric\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"metadata": {},
|
|
"source": [
|
|
"## What about the LLMs?\n",
|
|
"\n",
|
|
"You must write the answer to this question in your notebook on the notes.teklia.com website.\n",
|
|
"\n",
|
|
"LLMs are reputed to have revolutionised automatic language processing. Since the introduction of BERT-type models, all language processing applications have been based on LLMs, of varying degrees of sophistication and size. These models are trained on multiple tasks and are therefore capable of performing new tasks without learning, simply from a prompt. This is known as \"zero-shot learning\" because there is no learning phase as such. We are going to test these models on our classification task.\n",
|
|
"\n",
|
|
"Huggingface is a Franco-American company that develops tools for building applications based on Deep Learning. In particular, it hosts the huggingface.co portal, which contains numerous Deep Learning models. These models can be used very easily thanks to the [Transformer] library (https://huggingface.co/docs/transformers/quicktour) developed by HuggingFace.\n",
|
|
"\n",
|
|
"Using a transform model in zero-shot learning with HuggingFace is very simple: [see documentation](https://huggingface.co/tasks/zero-shot-classification)\n",
|
|
"\n",
|
|
"However, you need to choose a suitable model from the list of models compatible with Zero-Shot classification. HuggingFace offers [numerous models](https://huggingface.co/models?pipeline_tag=zero-shot-classification). \n",
|
|
"\n",
|
|
"The classes proposed to the model must also provide sufficient semantic information for the model to understand them.\n",
|
|
"\n",
|
|
"**Question**:\n",
|
|
"\n",
|
|
"* Write a code to classify an example of text from an article in Le Monde using a model transformed using zero-sot learning with the HuggingFace library.\n",
|
|
"* choose a model and explain your choice\n",
|
|
"* choose a formulation for the classes to be predicted\n",
|
|
"* show that the model predicts a class for the text of the article (correct or incorrect, analyse the results)\n",
|
|
"* evaluate the performance of your model on 100 articles (a test set).\n",
|
|
"* note model sizes, processing times and classification results\n",
|
|
"\n",
|
|
"\n",
|
|
"Notes :\n",
|
|
"* make sure that you use the correct Tokenizer when using a model \n",
|
|
"* start testing with a small number of articles and the first 100's of characters for faster experiments."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "nlp-class-env",
|
|
"language": "python",
|
|
"name": "nlp-class-env"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.9"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 2
|
|
}
|