completed readme

2024-04-03 19:12:06 +00:00 · 2024-04-03 19:12:06 +00:00 · a3caa64c95
commit a3caa64c95
parent 15f950d87f
1 changed files with 19 additions and 9 deletions
--- a/README.md
+++ b/README.md
@ -1,8 +1,11 @@
-# Business data challenge 2023-2024 | ENSAE Paris <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/LOGO-ENSAE.png/900px-LOGO-ENSAE.png" width="100">
+# Business data challenge 2023-2024 | ENSAE Paris <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/LOGO-ENSAE.png/900px-LOGO-ENSAE.png" width="30">
 # Arenametrix : customer segmentation
 <p align="center">
    <img src="https://dev.arenametrix.fr/assets/logo_ax-806e8204f49bcc2c5e8cd34e9748d16a6038404e37fdb2dc9d61455bb06c6461.png" width=300>
 </p>
 ## Team 1 
@ -38,11 +41,18 @@ As the segments have to be linked to a probability of future purchase, we direct
 Codes have to be run in an order following their numbers. Each of them is described below : 
 - 1_Input_cleaning.py \
-clean the raw data and generate dataframes that will be used to build datasets with insightful variables.
+Clean raw data and generate dataframes that will be used to build datasets with insightful variables. Datasets are exported to location 0_Input/.
- 2_Dataset_generation.py. 
+- 2_Datasets_generation.py \
- 3_Modelling_datasets.py to generate test and train sets for the 3 types of activities.
+Use dataframes previously created and aggregate them to create test and train set for each company. Databases are exported to location 1_Temp/1_0_Modelling_Datasets/ in a folder containing all 5 databases for a type of activity.
- 4_Descriptive_statistics.py to generate graphics describing the data
+- 3_Modelling_datasets.py \
- 5_Modelling.py. 3 ML models will be fitted on the data, and results will be exported for all 3 types of activities
+For each type of activity, the test and train sets of the 5 tenants are concatenated. Databases are exported to location 1_Temp/1_0_Modelling_Datasets/.
- 6_Segmentation_and_Marketing_Personae.py. The test set will be fitted with the optimal parameters computed previously. That will allow to compute a propensity score (probability of a future purchase). Segmentation is performed according to the scores provided. This scripts exports graphics describing the marketing personae associated to the segments as well as their business value.
+- 4_Descriptive_statistics.py \
- 7_Sales_Forecast.py. The scores will be adjusted to better fit the overall probability of a purchase. This score adjusted is used to estimate, for each customer, the number of tickets sold and the revenue generated during the incoming year. Results are aggregated at segment level.
+Generate graphics providing some descriptive statistics about the data at the activity level. All graphics are exported to location 2_Output/2_0_Descriptive_Statistics/.
 - 5_Modelling.py. 3 ML \
 Models will be fitted on the data, and results will be exported for all 3 types of activities. \
 3 pipelines are built, one by type of model (Naive Bayes, Random Forest, Logistic Regression). For the 2 latter ML methods, cross validation was performed to ensure generalization. Graphics displaying the quality of the training are provided. Optimal parameters found are saved in a pickle file (which will be used in the 6th step to add propensity scores to the test set and then determine the segments of the customers). All these files are exported to location 2_Output/2_1_Modeling_results/
 - 6_Segmentation_and_Marketing_Personae.py \
 The test set will be fitted with the optimal parameters computed previously, and a propensity score (probability of a future purchase) will be assigned to each customer of this dataset. Segmentation is performed according to the scores provided. Graphics describing the marketing personae associated to the segments as well as their business value are exported to location 2_Output/2_2_Segmentation_and_Marketing_Personae/. 
 - 7_Sales_Forecast.py. \
 To ensure a decent recall, and because of the unbalancing of the target variable y (the global probabiliy of purchase is between 4 and 14 %), the probabilities of purchasing are overestimated.The scores will therefore be adjusted so that their mean approximates the overall probability of a purchase. This score adjusted is used to estimate, for each customer, the number of tickets sold and the revenue generated during the incoming year. Results are aggregated at segment level. An histogram displaying the adjusted propensity scores and 2 tables summarizing the forecast outcome are exported to location 2_Output/2_3_Sales_Forecast/.