Business data challenge 2023-2024 | ENSAE Paris

Arenametrix : customer segmentation

Team 1

Antoine JOUBREL
Alexis REVELLE
Fanta RODRIGUE
Thomas PIQUÉ

Coaches

Elia LAPENTA
Michael VISSER

Support team

Patrice MICHEL (Datastorm)
Hassan MAISSORO (Datastorm)
Alexandre PRINC (Arenametrix)

Microeconomics coordinator

Yuanzhe TANG

Description of the problematic

The goal of this project is to create segments of customers from 15 companies belonging to 3 different types of activities (sports companies, museum, and music companies).

More detailled instructions provided by Arenamtrix

Definition of “marketing personae” that can be match with a probability to buy a future event
Matching between future event and people in the database (with for instance a probability to buy a future event)
And thus, a forecast of the quantity of ticket sold by event by “marketing personae” or by a segment of the database
BONUS : What is the best timing to send a communication to each contact in the database and each “marketing personae”
BONUS : What should we tell to each contact in the database and each “marketing personae”to make them come back

Our approach

We opted for a sector-based approach, which means that 3 segmentations have been performed (one for each type of activity). As the segments have to be linked to a probability of future purchase, we directly used the probability of purchase during the incoming year to make segments. The first step of the modelization is a pipeline that fits 3 ML models (naive bayes, random forest, and logistic regression) on the data to predict whether the customer will purchase during the year. We then use the probability of purchase estimated to split the customers into 4 segments. For each segment, we can estimate the potential number of tickets and revenue for the incoming year.

How run the code

Codes have to be run in an order following their numbers. Each of them is described below :

1_Input_cleaning.py
Clean raw data and generate dataframes that will be used to build datasets with insightful variables. Datasets are exported to location 0_Input/.
2_Datasets_generation.py
Use dataframes previously created and aggregate them to create test and train set for each company. Databases are exported to location 1_Temp/1_0_Modelling_Datasets/ in a folder containing all 5 databases for a type of activity.
3_Modelling_datasets.py
For each type of activity, the test and train sets of the 5 tenants are concatenated. Databases are exported to location 1_Temp/1_0_Modelling_Datasets/.
4_Descriptive_statistics.py
Generate graphics providing some descriptive statistics about the data at the activity level. All graphics are exported to location 2_Output/2_0_Descriptive_Statistics/.
5_Modelling.py
3 ML models will be fitted on the data, and results will be exported for all 3 types of activities.
3 pipelines are built, one by type of model (Naive Bayes, Random Forest, Logistic Regression). For the 2 latter ML methods, cross validation was performed to ensure generalization. Graphics displaying the quality of the training are provided. Optimal parameters found are saved in a pickle file (which will be used in the 6th step to add propensity scores to the test set and then determine the segments of the customers). All these files are exported to location 2_Output/2_1_Modeling_results/
6_Segmentation_and_Marketing_Personae.py
The test set will be fitted with the optimal parameters computed previously, and a propensity score (probability of a future purchase) will be assigned to each customer of this dataset. Segmentation is performed according to the scores provided. Graphics describing the marketing personae associated to the segments as well as their business value are exported to location 2_Output/2_2_Segmentation_and_Marketing_Personae/.
7_Sales_Forecast.py
To ensure a decent recall, and because of the unbalancing of the target variable y (the global probability of purchase is between 4 and 14 %), the probabilities of purchasing are overestimated.The scores will therefore be adjusted so that their mean approximates the overall probability of a purchase. This score adjusted is used to estimate, for each customer, the number of tickets sold and the revenue generated during the incoming year. Results are aggregated at segment level. A histogram displaying the adjusted propensity scores and 2 tables summarizing the forecast outcome are exported to location 2_Output/2_3_Sales_Forecast/.

4.7 KiB Raw Permalink Blame History