BDC-team-1/README.md

# Business data challenge 2023-2024 | ENSAE Paris <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/LOGO-ENSAE.png/900px-LOGO-ENSAE.png" width="100">

# Arenametrix : customer segmentation

<img src="https://dev.arenametrix.fr/assets/logo_ax-806e8204f49bcc2c5e8cd34e9748d16a6038404e37fdb2dc9d61455bb06c6461.png" width=300>

## Team 1 

* Antoine JOUBREL
* Alexis REVELLE
* Fanta RODRIGUE
* Thomas PIQUÉ


## Coaches 

* Elia LAPENTA 
* Michael VISSER

## Datastorm support team

* Patrice MICHEL
* Hassan MAISSORO

## Microeconomics coordinator

* Yuanzhe TANG


### Description of the problematic
The goal of this project is to create segments of customers from 15 companies belonging to 3 different types of activities (sports companies, museum, and music companies). 

### Our approach
We opted for a sector-based approach, which means that 3 segmentations have been performed (one for each type of activity).
As the segments have to be linked to a probability of future purchase, we directly used the probability of purchase during the incoming year to make segments. The first step of the modelization is a pipeline that fits 3 ML models (naive bayes, random forest, and logistic regression) on the data to predict whether the customer will purchase during the year. We then use the probability of purchase estimated to split the customers into 4 segments. For each segment, we can estimate the potential number of tickets and revenue for the incoming year. 

### How run the code 
Codes have to be run in an order following their numbers. Each of them is described below : 

- 1_Input_cleaning.py \
clean the raw data and generate dataframes that will be used to build datasets with insightful variables.
- 2_Dataset_generation.py. 
- 3_Modelling_datasets.py to generate test and train sets for the 3 types of activities.
- 4_Descriptive_statistics.py to generate graphics describing the data
- 5_Modelling.py. 3 ML models will be fitted on the data, and results will be exported for all 3 types of activities
- 6_Segmentation_and_Marketing_Personae.py. The test set will be fitted with the optimal parameters computed previously. That will allow to compute a propensity score (probability of a future purchase). Segmentation is performed according to the scores provided. This scripts exports graphics describing the marketing personae associated to the segments as well as their business value.
- 7_Sales_Forecast.py. The scores will be adjusted to better fit the overall probability of a purchase. This score adjusted is used to estimate, for each customer, the number of tickets sold and the revenue generated during the incoming year. Results are aggregated at segment level.
test some changes 2024-04-03 20:37:19 +02:00			`# Business data challenge 2023-2024 \| ENSAE Paris <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/LOGO-ENSAE.png/900px-LOGO-ENSAE.png" width="100">`

added README 2024-03-28 17:48:22 +01:00			`# Arenametrix : customer segmentation`

test some changes 2024-04-03 20:37:19 +02:00			`<img src="https://dev.arenametrix.fr/assets/logo_ax-806e8204f49bcc2c5e8cd34e9748d16a6038404e37fdb2dc9d61455bb06c6461.png" width=300>`

			`## Team 1`
added README 2024-03-28 17:48:22 +01:00
			`* Antoine JOUBREL`
			`* Alexis REVELLE`
			`* Fanta RODRIGUE`
			`* Thomas PIQUÉ`


test some changes 2024-04-03 20:37:19 +02:00			`## Coaches`
added README 2024-03-28 17:48:22 +01:00
			`* Elia LAPENTA`
			`* Michael VISSER`

test some changes 2024-04-03 20:37:19 +02:00			`## Datastorm support team`

			`* Patrice MICHEL`
			`* Hassan MAISSORO`

			`## Microeconomics coordinator`

			`* Yuanzhe TANG`

added README 2024-03-28 17:48:22 +01:00
			`### Description of the problematic`
			`The goal of this project is to create segments of customers from 15 companies belonging to 3 different types of activities (sports companies, museum, and music companies).`

			`### Our approach`
			`We opted for a sector-based approach, which means that 3 segmentations have been performed (one for each type of activity).`
			As the segments have to be linked to a probability of future purchase, we directly used the probability of purchase during the incoming year to make segments. The first step of the modelization is a pipeline that fits 3 ML models (naive bayes, random forest, and logistic regression) on the data to predict whether the customer will purchase during the year. We then use the probability of purchase estimated to split the customers into 4 segments. For each segment, we can estimate the potential number of tickets and revenue for the incoming year.

			`### How run the code`
test some changes 2024-04-03 20:37:19 +02:00			`Codes have to be run in an order following their numbers. Each of them is described below :`

			`- 1_Input_cleaning.py \`
			`clean the raw data and generate dataframes that will be used to build datasets with insightful variables.`
			`- 2_Dataset_generation.py.`
			`- 3_Modelling_datasets.py to generate test and train sets for the 3 types of activities.`
			`- 4_Descriptive_statistics.py to generate graphics describing the data`
			`- 5_Modelling.py. 3 ML models will be fitted on the data, and results will be exported for all 3 types of activities`
			`- 6_Segmentation_and_Marketing_Personae.py. The test set will be fitted with the optimal parameters computed previously. That will allow to compute a propensity score (probability of a future purchase). Segmentation is performed according to the scores provided. This scripts exports graphics describing the marketing personae associated to the segments as well as their business value.`
			`- 7_Sales_Forecast.py. The scores will be adjusted to better fit the overall probability of a purchase. This score adjusted is used to estimate, for each customer, the number of tickets sold and the revenue generated during the incoming year. Results are aggregated at segment level.`
added README 2024-03-28 17:48:22 +01:00