new version for 2024

2024-01-11 09:25:17 +00:00 · 2024-01-11 09:25:17 +00:00 · 6405f9dcce
commit 6405f9dcce
parent 123a6612a0
2 changed files with 23 additions and 220 deletions
--- a/TextClassification_LeMonde.ipynb
+++ b/TextClassification_LeMonde.ipynb
@ -228,7 +228,7 @@
    "\n",
    "**Questions**:\n",
    "\n",
-    "> * Remove the class `ÙNE` from the original dataset\n",
+    "> * Remove the class `UNE` from the original dataset and merge the semantically close classes 'FRANCE' and 'SOCIETE'\n",
    "> * Plot the class statitics with seaborn\n",
    "> * Create new splits\n",
    "> * Retrain a NaiveBayes classifier using [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) with the 1000 most frequent words."
@ -262,231 +262,33 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Hyperparameter optimization\n",
+    "## What about the LLMs?\n",
    "\n",
-    "The classification process has many parameters : alpha for the classifier, max_features, max_df, min_df, using idf or not, ngram orders for the Count of TfIDF transformer. These parameters can be optimized by a grid search using GridSearchCV.\n",
+    "You must write the answer to this question in your notebook on the notes.teklia.com website.\n",
+    "\n",
+    "LLMs are reputed to have revolutionised automatic language processing. Since the introduction of BERT-type models, all language processing applications have been based on LLMs, of varying degrees of sophistication and size. These models are trained on multiple tasks and are therefore capable of performing new tasks without learning, simply from a prompt. This is known as \"zero-shot learning\" because there is no learning phase as such. We are going to test these models on our classification task.\n",
+    "\n",
+    "Huggingface is a Franco-American company that develops tools for building applications based on Deep Learning. In particular, it hosts the huggingface.co portal, which contains numerous Deep Learning models. These models can be used very easily thanks to the [Transformer] library (https://huggingface.co/docs/transformers/quicktour) developed by HuggingFace.\n",
+    "\n",
+    "Using a transform model in zero-shot learning with HuggingFace is very simple: [see documentation](https://huggingface.co/tasks/zero-shot-classification)\n",
+    "\n",
+    "However, you need to choose a suitable model from the list of models compatible with Zero-Shot classification. HuggingFace offers [numerous models](https://huggingface.co/models?pipeline_tag=zero-shot-classification). \n",
+    "\n",
+    "The classes proposed to the model must also provide sufficient semantic information for the model to understand them.\n",
    "\n",
    "**Question**:\n",
    "\n",
-    "> * Using the template code below, find the best values for the parameter max_features, max_df, min_df, use_idf, ngram_range, alpha\n",
-    "> * Refit the best model on all the train+dev data and print accuracy on test set\n",
-    "\n",
-    "Note that for developping the code, the number of training samples is limited to 1000\n",
-    "\n",
-    "```\n",
-    "df_filtered_train_dev.iloc[:1000].text\n",
-    "```\n",
-    "\n",
-    "Once your code is correct, you can train on the full training set.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "#### Hyperameters optimization with GridSearchCV = parallel processing\n",
-    "from sklearn.model_selection import GridSearchCV\n",
-    "from sklearn.pipeline import Pipeline\n",
-    "from pprint import pprint\n",
-    "from time import time\n",
-    "import logging\n",
-    "# Display progress logs on stdout\n",
-    "logging.basicConfig(level=logging.INFO,\n",
-    "                    format='%(asctime)s %(levelname)s %(message)s')\n",
-    "\n",
-    "# create train_dev and test set for using Cross-Validation\n",
-    "df_filtered_train_dev, df_filtered_test = train_test_split(df_filtered.dropna() ,test_size=0.10, random_state=42)\n",
-    "print ('train_dev size',df_filtered_train_dev.shape)\n",
-    "print ('test size',df_filtered_test.shape)\n",
-    "# keep only 1000 training data for debuging\n",
-    "X_train_dev, y_train_dev =df_filtered_train_dev.iloc[:1000].text, df_filtered_train.iloc[:1000].category\n",
-    "X_test, y_test =df_filtered_test.text, df_filtered_test.category\n",
+    "* Write a code to classify an example of text from an article in Le Monde using a model transformed using zero-sot learning with the HuggingFace library.\n",
+    "* choose a model and explain your choice\n",
+    "* choose a formulation for the classes to be predicted\n",
+    "* show that the model predicts a class for the text of the article (correct or incorrect, analyse the results)\n",
+    "* evaluate the performance of your model on 100 articles (a test set).\n",
+    "* note model sizes, processing times and classification results\n",
    "\n",
    "\n",
-    "\n",
-    "pipeline = Pipeline([\n",
-    "    ('tfidf', TfidfVectorizer()),\n",
-    "    ('clf', MultinomialNB()),\n",
-    "])\n",
-    "\n",
-    "\n",
-    "parameters = {\n",
-    "    'tfidf__max_features': (500, 1000, 5000, 10000, None),\n",
-    "    # YOUR CODE HERE\n",
-    "}\n",
-    "if __name__ == \"__main__\":\n",
-    "    # multiprocessing requires the fork to happen in a __main__ protected\n",
-    "    # block\n",
-    "\n",
-    "    # find the best parameters for both the feature extraction and the\n",
-    "    # classifier\n",
-    "    grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=2, cv=3)\n",
-    "\n",
-    "    print(\"Performing grid search...\")\n",
-    "    print(\"pipeline:\", [name for name, _ in pipeline.steps])\n",
-    "    print(\"parameters:\")\n",
-    "    pprint(parameters)\n",
-    "    t0 = time()\n",
-    "    grid_search.fit(X_train_dev, y_train_dev)\n",
-    "    print(\"done in %0.3fs\" % (time() - t0))\n",
-    "    print()\n",
-    "\n",
-    "    print(\"Best score: %0.3f\" % grid_search.best_score_)\n",
-    "    print(\"Best parameters set:\")\n",
-    "    best_parameters = grid_search.best_estimator_.get_params()\n",
-    "    for param_name in sorted(parameters.keys()):\n",
-    "        print(\"\\t%s: %r\" % (param_name, best_parameters[param_name]))\n",
-    "    df = pd.DataFrame(grid_search.cv_results_)\n",
-    "    print (df[['rank_test_score','param_tfidf__max_features','mean_test_score']].sort_values('rank_test_score'))\n",
-    "    \n",
-    "    # use refit and print accuracy on test set\n",
-    "    # YOUR CODE HERE"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Classification with Neural networks\n",
-    "\n",
-    "Neural networks can be trained to learn both the vector representation of the words (instead of tf-idf) and how to classify the documents. The code below allows you to train a neural text classifier using word embeddings using Keras. Most of the code is written, you only have to define the architecture of the network with the correct parameters before training it : \n",
-    "\n",
-    "**Question**:\n",
-    "\n",
-    "> * Define a neural network in the function `get_model()` with the following parameters : \n",
-    ">  * use only the 10 000 most frequent words in the documents\n",
-    ">  * use 1024 as the maximal number of words in the articles\n",
-    ">  * use an embedding size of 300:  [embedding layer](https://keras.io/layers/embeddings/)\n",
-    ">  * use a dropout of 0.5:  [dropout layer](https://keras.io/layers/core/#dropout)\n",
-    ">  * use 32 convolutional filters of size 2 x EMBED_SIZE: [1D convolutional layer](https://keras.io/layers/convolutional/#conv1d)\n",
-    ">  * use a max pooling of size 2 : [1D Max Pooling](https://keras.io/layers/pooling/#maxpooling1d)\n",
-    "> * Train the model and compare its accuracy to the Naive Bayes models.\n",
-    "\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import ast\n",
-    "import os\n",
-    "from nn_utils import TrainingHistory\n",
-    "from keras.layers import Dense, Embedding, Input\n",
-    "from keras.layers import GRU, Dropout, MaxPooling1D, Conv1D, Flatten\n",
-    "from keras.models import Model\n",
-    "import numpy as np\n",
-    "import itertools\n",
-    "from keras.utils import np_utils\n",
-    "from sklearn.metrics import (classification_report, \n",
-    "                             precision_recall_fscore_support, \n",
-    "                             accuracy_score)\n",
-    "\n",
-    "from keras.preprocessing import text, sequence\n",
-    "from keras.utils import pad_sequences\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# Model parameters\n",
-    "MAX_FEATURES = # YOUR CODE HERE\n",
-    "MAX_TEXT_LENGTH = # YOUR CODE HERE\n",
-    "EMBED_SIZE  = # YOUR CODE HERE\n",
-    "BATCH_SIZE = 16\n",
-    "EPOCHS = 10\n",
-    "VALIDATION_SPLIT = 0.1"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def get_train_test(train_raw_text, test_raw_text):\n",
-    "    \n",
-    "    tokenizer = text.Tokenizer(num_words=MAX_FEATURES)\n",
-    "\n",
-    "    tokenizer.fit_on_texts(list(train_raw_text))\n",
-    "    train_tokenized = tokenizer.texts_to_sequences(train_raw_text)\n",
-    "    test_tokenized = tokenizer.texts_to_sequences(test_raw_text)\n",
-    "    return pad_sequences(train_tokenized, maxlen=MAX_TEXT_LENGTH), \\\n",
-    "           pad_sequences(test_tokenized, maxlen=MAX_TEXT_LENGTH)\n",
-    "\n",
-    "\n",
-    "\n",
-    "def get_model():\n",
-    "\n",
-    "    inp = Input(shape=(# YOUR CODE HERE,))\n",
-    "    model = Embedding(# YOUR CODE HERE, # YOUR CODE HERE)(inp)\n",
-    "    model = Dropout(# YOUR CODE HERE)(model)\n",
-    "    model = Conv1D(filters=# YOUR CODE HERE, kernel_size=# YOUR CODE HERE, padding='same', activation='relu')(model)\n",
-    "    model = MaxPooling1D(pool_size=# YOUR CODE HERE)(model)\n",
-    "    model = Flatten()(model)\n",
-    "    model = Dense(7, activation=\"softmax\")(model)\n",
-    "    model = Model(inputs=inp, outputs=model)\n",
-    "    \n",
-    "    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])\n",
-    "    model.summary()\n",
-    "    return model\n",
-    "\n",
-    "\n",
-    "def train_fit_predict(model, x_train, x_test, y, history):\n",
-    "    \n",
-    "    model.fit(x_train, y,\n",
-    "              batch_size=BATCH_SIZE,\n",
-    "              epochs=EPOCHS, verbose=1,\n",
-    "              validation_split=VALIDATION_SPLIT)\n",
-    "\n",
-    "    return model.predict(x_test)\n",
-    "\n",
-    "\n",
-    "# Get the list of different classes\n",
-    "CLASSES_LIST = np.unique(y_train)\n",
-    "n_out = len(CLASSES_LIST)\n",
-    "print(CLASSES_LIST)\n",
-    "\n",
-    "# Convert clas string to index\n",
-    "from sklearn import preprocessing\n",
-    "le = preprocessing.LabelEncoder()\n",
-    "le.fit(CLASSES_LIST)\n",
-    "y_train = le.transform(y_train) \n",
-    "y_test = le.transform(y_test) \n",
-    "train_y_cat = np_utils.to_categorical(y_train, n_out)\n",
-    "\n",
-    "# get the textual data in the correct format for NN\n",
-    "x_vec_train, x_vec_test = get_train_test(X_train, X_test)\n",
-    "print(len(x_vec_train), len(x_vec_test))\n",
-    "\n",
-    "# define the NN topology\n",
-    "model = get_model()\n",
-    "\n",
-    "# Define training procedure\n",
-    "history = TrainingHistory(x_vec_test, y_test, CLASSES_LIST)\n",
-    "\n",
-    "# Train and predict\n",
-    "y_predicted = train_fit_predict(model, x_vec_train, x_vec_test, train_y_cat, history).argmax(1)\n",
-    "\n",
-    "\n",
-    "print(\"Test Accuracy:\", accuracy_score(y_test, y_predicted))\n",
-    "\n",
-    "p, r, f1, s = precision_recall_fscore_support(y_test, y_predicted, \n",
-    "                                              average='micro',\n",
-    "                                              labels=[x for x in np.unique(y_train) ])\n",
-    "\n",
-    "print('p r f1 %.1f %.2f %.3f' % (np.average(p, weights=s)*100.0, \n",
-    "                                 np.average(r, weights=s)*100.0, \n",
-    "                                 np.average(f1, weights=s)*100.0))\n",
-    "\n",
-    "\n",
-    "print(classification_report(y_test, y_predicted, labels=[x for x in np.unique(y_train)]))"
+    "Notes :\n",
+    "* make sure that you use the correct Tokenizer when using a model \n",
+    "* start testing with a small number of articles and the first 100's of characters for faster experiments."
   ]
  },
  {
--- a/requirements.txt
+++ b/requirements.txt
@ -2,3 +2,4 @@ ipykernel==6.28.0
 pandas==2.1.4
 scikit-learn==1.3.2
 wordcloud==1.9.3
+seaborn==0.13.1