Compare commits
11 Commits
main
...
branche_ma
| Author | SHA1 | Date | |
|---|---|---|---|
| 9f65727338 | |||
| d6f3a09c53 | |||
| f648560774 | |||
| bc94a3a472 | |||
| 416546f034 | |||
| c42adce1f7 | |||
| 350d911fcb | |||
| 55acc63ff5 | |||
| 89c5d190e0 | |||
| 2c25a7ec1c | |||
| 07ecb353ea |
3
.gitignore
vendored
3
.gitignore
vendored
|
|
@ -1,3 +0,0 @@
|
|||
data/
|
||||
data_exploration/
|
||||
*.csv
|
||||
3194
.ipynb_checkpoints/dataloader-checkpoint.ipynb
Normal file
3194
.ipynb_checkpoints/dataloader-checkpoint.ipynb
Normal file
File diff suppressed because one or more lines are too long
6631
ClusteringV4_400plus_Ancien_finMars.ipynb
Normal file
6631
ClusteringV4_400plus_Ancien_finMars.ipynb
Normal file
File diff suppressed because one or more lines are too long
3782
Clustering_2Feb (1).ipynb
Normal file
3782
Clustering_2Feb (1).ipynb
Normal file
File diff suppressed because one or more lines are too long
5188
Clustering_Fund_400+_corrected.ipynb
Normal file
5188
Clustering_Fund_400+_corrected.ipynb
Normal file
File diff suppressed because one or more lines are too long
3994
Clustering_Fund_Family_+k3_k10.ipynb
Normal file
3994
Clustering_Fund_Family_+k3_k10.ipynb
Normal file
File diff suppressed because one or more lines are too long
8083
Clustering_Fund_Family_V6 (1).ipynb
Normal file
8083
Clustering_Fund_Family_V6 (1).ipynb
Normal file
File diff suppressed because one or more lines are too long
8083
Clustering_Fund_Family_V6_.ipynb
Normal file
8083
Clustering_Fund_Family_V6_.ipynb
Normal file
File diff suppressed because one or more lines are too long
89
README.md
89
README.md
|
|
@ -1,89 +0,0 @@
|
|||
# Carmignac × ENSAE Data Challenge 2025.
|
||||
|
||||
## 1 - Data Repair Pipeline
|
||||
|
||||
Registrar Account ID repair for AUM and Flows data.
|
||||
|
||||
---
|
||||
|
||||
### 1.1 - Problem
|
||||
|
||||
AUM and Flows tables must satisfy the stock-flow identity for every `(account, ISIN, month)`:
|
||||
|
||||
```
|
||||
Q(t-1) + F(t-1→t) = Q(t)
|
||||
```
|
||||
|
||||
In practice this identity is violated because Registrar Account IDs change over time — a single distributor may appear under different codes at different dates. The pipeline detects these ruptures, scores data quality, and resolves identity changes through targeted *surgeries*.
|
||||
|
||||
---
|
||||
|
||||
### 1.2 - Files
|
||||
|
||||
| File | Role |
|
||||
|------|------|
|
||||
| `carmignac_diagnostics.py` | Step 0 — market-level audit, broken months, error account |
|
||||
| `carmignac_repair.py` | Steps 1–3 — universe, score propagation, surgery |
|
||||
| `carmignac_branch.py` | Apply the repaired mapping to the raw AUM file |
|
||||
| `carmignac_analysis.py` | HTML report from pipeline outputs |
|
||||
|
||||
---
|
||||
|
||||
### 1.3 - Usage
|
||||
|
||||
Run in order:
|
||||
|
||||
```bash
|
||||
# 1. Diagnostics (produces broken_months CSV fed into repair)
|
||||
python carmignac_diagnostics.py \
|
||||
--aum raw_AUM.csv \
|
||||
--flows raw_flows.csv
|
||||
|
||||
# 2. Repair
|
||||
python carmignac_repair.py \
|
||||
--aum raw_AUM.csv \
|
||||
--flows raw_flows.csv \
|
||||
--broken-months carmignac_broken_months.csv
|
||||
|
||||
# 3. Apply mapping to raw AUM
|
||||
python carmignac_branch.py \
|
||||
--aum raw_AUM.csv \
|
||||
--mapping carmignac_mapping.csv \
|
||||
--surgery carmignac_surgery_log.csv
|
||||
|
||||
# 4. Analysis report
|
||||
python carmignac_analysis.py \
|
||||
--error-account-isin carmignac_error_account.csv \
|
||||
--error-account-agg carmignac_error_account_agg.csv
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 1.4 - Outputs
|
||||
|
||||
| File | Content |
|
||||
|------|---------|
|
||||
| `carmignac_broken_months.csv` | (ISIN, month) pairs where the aggregate stock-flow equation is broken |
|
||||
| `carmignac_error_account.csv` | Cumulative unresolved residuals per ISIN |
|
||||
| `carmignac_error_account_agg.csv` | Same, aggregated over all ISINs |
|
||||
| `carmignac_scores.csv` | Data quality score σ_r(t) for every (account, month) |
|
||||
| `carmignac_mapping.csv` | Canonical identity mapping (date, reg_orig, reg_used) |
|
||||
| `carmignac_surgery_log.csv` | All surgery operations with Jaccard, score gain, lookback |
|
||||
| `AUM_repaired.csv` | Full AUM with corrected Registrar Account IDs |
|
||||
| `AUM_paths.csv` | Universe accounts with their identity path over time |
|
||||
| `carmignac_diagnostics.html` | Interactive diagnostic report |
|
||||
| `carmignac_report.html` | Interactive repair & surgery report |
|
||||
|
||||
---
|
||||
|
||||
### 1.5 - Key parameters
|
||||
|
||||
| Parameter | Value | Role |
|
||||
|-----------|-------|------|
|
||||
| `MIN_AUM_EUR` | 5 M€ | Universe threshold at t* |
|
||||
| `ALPHA` | 5% | Reconciliation tolerance |
|
||||
| `SCORE_DROP_THRESHOLD` | 0.5 | Surgery trigger |
|
||||
| `MIN_JACCARD` | 0.3 | ISIN portfolio pre-filter |
|
||||
| `MAX_SURGERY_LOOKBACK` | 6 months | Max search window for predecessor code |
|
||||
| `SYMMETRY_ATTENUATION` | 0.05 | Error discount for symmetric transfers |
|
||||
| `BROKEN_MONTH_ATTENUATION` | 0.20 | Error discount for broken market months |
|
||||
572
Stat_Desc1.ipynb
Normal file
572
Stat_Desc1.ipynb
Normal file
|
|
@ -0,0 +1,572 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "e637deae-9168-4fb2-b95f-4e42d8d72d9e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# DATA COLLECTION "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 1,
|
||||
"id": "9f99615b-5a9d-434a-baa0-dca55edf7699",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "f8508d94-74a7-4bb0-8b81-c2e06850c25f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd \n",
|
||||
"chemin_fichier = \"s3://projet-bdc-data/carmignac/AUM ENSAE V2 -20251105.csv\"\n",
|
||||
"df_aum2 = pd.read_csv(chemin_fichier, sep=';', engine='python')\n",
|
||||
"df_aum2"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "4644da13-5aea-4ca0-9fcf-947324766292",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"chemin_fichier = \"s3://projet-bdc-data/carmignac/Flows ENSAE V2 -20251105.csv\"\n",
|
||||
"df_flows2 = pd.read_csv(chemin_fichier, sep=';', engine='python')\n",
|
||||
"df_flows2"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "59d31eaf-c06c-4ebe-9f8c-cb9158a50976",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## DATA ANALYSIS"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "5773b911-6b84-448d-962f-8228eeac0250",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df_aum2.shape"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "6f571810-c373-4d30-8ca5-c3a074b95b08",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df_aum2.columns"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "af25fd07-a613-4adc-b88b-93a8d300379c",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df_flows2.shape"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "c6d0fe83-2957-430b-89cf-cd30833b7cab",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df_flows2.columns"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "5fac74b0-662f-48d0-a234-7edc3c3e86ad",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#dict avec valeurs unique de chaque col \n",
|
||||
"rows = []\n",
|
||||
"\n",
|
||||
"for col in df_aum2.columns:\n",
|
||||
" uniques = df_aum2[col].unique()\n",
|
||||
" rows.append({\n",
|
||||
" \"Colonne\": col,\n",
|
||||
" \"Nbr Lignes\": df_aum2.shape[0], #4.8millions\n",
|
||||
" \"Nb valeurs uniques\": len(uniques),\n",
|
||||
" \"Exemples de valeurs\": uniques,\n",
|
||||
" \"Nan Values\" : df_aum2[col].isna().sum()\n",
|
||||
" })\n",
|
||||
"\n",
|
||||
"df_uniques = pd.DataFrame(rows)\n",
|
||||
"df_uniques"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "5de53ba3-b3db-4935-acac-435b05b909e2",
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#dict avec valeurs unique de chaque col \n",
|
||||
"rowsf = []\n",
|
||||
"\n",
|
||||
"for col in df_flows2.columns:\n",
|
||||
" uniques = df_flows2[col].unique()\n",
|
||||
" rowsf.append({\n",
|
||||
" \"Colonne\": col,\n",
|
||||
" \"Nbr Lignes\": df_flows2.shape[0], #4.8millions\n",
|
||||
" \"Nb valeurs uniques\": len(uniques),\n",
|
||||
" \"Exemples de valeurs\": uniques,\n",
|
||||
" \"Nan Values\" : df_flows2[col].isna().sum()\n",
|
||||
" })\n",
|
||||
"\n",
|
||||
"df_unique_flows = pd.DataFrame(rowsf)\n",
|
||||
"df_unique_flows"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "e07a2b28-13f7-49f6-a55b-7d09a506407b",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df1_aum = df_uniques[['Colonne', 'Nbr Lignes', 'Nb valeurs uniques']]\n",
|
||||
"df2_flows = df_unique_flows[['Colonne', 'Nbr Lignes', 'Nb valeurs uniques']]\n",
|
||||
"\n",
|
||||
"df_merged = df1_aum.merge(df2_flows, on='Colonne', suffixes=('_aum', '_flows'))\n",
|
||||
"df_merged"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4ce2ad22-08e6-4e63-96b2-c2301172516e",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# ETUDE ET ANALYSE DES ANOMALIES"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "c883dfc2-b9b9-4d3e-80d3-0140cd222492",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df_aum2['Centralisation Date'] = pd.to_datetime(df_aum2['Centralisation Date'])\n",
|
||||
"df_flows2['Centralisation Date'] = pd.to_datetime(df_flows2['Centralisation Date'])\n",
|
||||
"key_cols = ['Registrar Account - ID', 'Product - Isin']"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "f47b276d-cce6-433c-87c5-860810d71d34",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"cols= ['Company - Id', 'Company - Ultimate Parent Id',\n",
|
||||
" 'Registrar Account - ID', 'Registrar Account - Region','Product - Isin']\n",
|
||||
"\n",
|
||||
"doublons_aum2 = df_aum2[df_aum2.duplicated(subset=cols + ['Centralisation Date'], keep=False)]\n",
|
||||
"doublons_flows2 = df_flows2[df_flows2.duplicated(subset=cols + ['Centralisation Date'], keep=False)]\n",
|
||||
"\n",
|
||||
"print(\" Cols: \", cols)\n",
|
||||
"print(\"Doublons AUM:\", doublons_aum2.shape[0])\n",
|
||||
"print(\"Doublons Flows:\", doublons_flows2.shape[0])"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "d2f355e6-30c5-420e-a3db-9095dd5e0147",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# # Comptes avec same flux et same product ISIN mais IDs différents ---> candidats pseudo-client (FAUX)\n",
|
||||
"# #plusieurs comptes diff réalisent EXACTEMENT le même flux sur le même produit le même jour.\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "4b9173e1-1c01-4ef3-adcd-9c587a97dd5d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Stat Descrp FLOWS "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "8df98c34-b1f7-4fb9-bbc7-c9bbe762022a",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df = df_flows2.copy()\n",
|
||||
"df[\"Date\"] = pd.to_datetime(df[\"Centralisation Date\"])\n",
|
||||
"\n",
|
||||
"# Groupby par ISIN et Date\n",
|
||||
"grouped = df.groupby([\"Product - Isin\", \"Date\"])\n",
|
||||
"\n",
|
||||
"transfers = []\n",
|
||||
"\n",
|
||||
"for (isin, date), group in grouped:\n",
|
||||
" # Sépare flux positifs et négatifs\n",
|
||||
" entrants = group[group[\"Value € - NetFlows\"] > 0][[\"Registrar Account - ID\", \"Value € - NetFlows\"]]\n",
|
||||
" sortants = group[group[\"Value € - NetFlows\"] < 0][[\"Registrar Account - ID\", \"Value € - NetFlows\"]]\n",
|
||||
"\n",
|
||||
" # On cherche des paires +M / -M\n",
|
||||
" for _, row_sortie in sortants.iterrows():\n",
|
||||
" montant_sortie = row_sortie[\"Value € - NetFlows\"]\n",
|
||||
" compte_sortant = row_sortie[\"Registrar Account - ID\"]\n",
|
||||
"\n",
|
||||
" # Chercher un +M qui matche exactement le -M\n",
|
||||
" match = entrants[entrants[\"Value € - NetFlows\"] == -montant_sortie]\n",
|
||||
"\n",
|
||||
" if len(match) > 0:\n",
|
||||
" for _, row_entree in match.iterrows():\n",
|
||||
" transfers.append({\n",
|
||||
" \"ISIN\": isin,\n",
|
||||
" \"Date\": date,\n",
|
||||
" \"Compte sortant\": compte_sortant,\n",
|
||||
" \"Montant sortie\": montant_sortie,\n",
|
||||
" \"Compte entrant\": row_entree[\"Registrar Account - ID\"],\n",
|
||||
" \"Montant entrée\": row_entree[\"Value € - NetFlows\"]\n",
|
||||
" })\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"transf_compte = pd.DataFrame(transfers)\n",
|
||||
"transf_compte\n",
|
||||
"\n",
|
||||
"#df initiale : 2 574 461 \n",
|
||||
"# 27 880 rows "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "df0c0bbb-4cff-4205-86e0-0393f83a4cc7",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Extraire tous les comptes sortants et entrants\n",
|
||||
"all_accounts = pd.concat([\n",
|
||||
" transf_compte[\"Compte sortant\"],\n",
|
||||
" transf_compte[\"Compte entrant\"]\n",
|
||||
"])\n",
|
||||
"\n",
|
||||
"# Comptes uniques impliqués dans au moins un transfert\n",
|
||||
"unique_accounts = all_accounts.unique()\n",
|
||||
"print(f\"Nombre de comptes uniques impliqués dans les transferts : {len(unique_accounts)}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "6dacddcb-74f1-441f-adbe-83275c8f9216",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import matplotlib.pyplot as plt\n",
|
||||
"transf_compte[\"Date\"] = pd.to_datetime(transf_compte[\"Date\"])\n",
|
||||
"\n",
|
||||
"# Nombre de transferts par jour\n",
|
||||
"transfers_per_day = transf_compte.groupby(\"Date\").size()\n",
|
||||
"\n",
|
||||
"plt.figure(figsize=(14,5))\n",
|
||||
"transfers_per_day.plot(kind=\"line\")\n",
|
||||
"plt.title(\"Nombre de transferts détectés par jour\")\n",
|
||||
"plt.xlabel(\"Date\")\n",
|
||||
"plt.ylabel(\"Nombre de transferts\")\n",
|
||||
"plt.show()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "be0708c4-95df-4915-82f9-a6347187bd70",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# Détection de jours anormaux (ex: > 95e percentile)\n",
|
||||
"threshold = transfers_per_day.quantile(0.95)\n",
|
||||
"anomalous_days = transfers_per_day[transfers_per_day > threshold]\n",
|
||||
"print(\"Jours anormaux (avec beaucoup de transferts) :\")\n",
|
||||
"display(anomalous_days)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "9416dd81-8f73-4e87-a5e8-b640882fbba4",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Etude de la saisonalite"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "f7da5d09-7c97-4fa2-921d-886928fdf80f",
|
||||
"metadata": {
|
||||
"scrolled": true
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"\n",
|
||||
"transf_compte[\"Date\"] = pd.to_datetime(transf_compte[\"Date\"])\n",
|
||||
"transf_comptee = transf_compte[transf_compte[\"Date\"].dt.year >= 2021]\n",
|
||||
"# Nombre de transferts par jour\n",
|
||||
"transfers_per_day = transf_comptee.groupby(\"Date\").size().rename(\"n_transfers\")\n",
|
||||
"\n",
|
||||
"# Détection des jours anormaux (au-dessus du 95e percentile)\n",
|
||||
"threshold = transfers_per_day.quantile(0.95)\n",
|
||||
"anomalous_days = transfers_per_day[transfers_per_day > threshold]\n",
|
||||
"\n",
|
||||
"# Ajouter weekday et month\n",
|
||||
"anomalous_table = anomalous_days.reset_index()\n",
|
||||
"anomalous_table[\"weekday\"] = anomalous_table[\"Date\"].dt.day_name()\n",
|
||||
"anomalous_table[\"month\"] = anomalous_table[\"Date\"].dt.month_name()\n",
|
||||
"\n",
|
||||
"pd.set_option('display.max_rows', None)\n",
|
||||
"print(\"Jours anormaux (weekday + month) :\")\n",
|
||||
"display(anomalous_table.sort_values(\"n_transfers\").tail(20))"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "1fa564f5-5844-4a1b-bdca-b392b829d734",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"#Nombre total de comptes impliqués \n",
|
||||
"\n",
|
||||
"all_accounts = pd.concat([\n",
|
||||
" transf_compte[\"Compte sortant\"],\n",
|
||||
" transf_compte[\"Compte entrant\"]\n",
|
||||
"]).unique()\n",
|
||||
"\n",
|
||||
"print(f\"Nombre total de comptes impliqués dans au moins un transfert : {len(all_accounts)}\")\n",
|
||||
"\n",
|
||||
"# Nombre de comptes impliqués par jour \n",
|
||||
"\n",
|
||||
"accounts_per_day = transf_compte.groupby(\"Date\").agg(\n",
|
||||
" comptes_uniques=(\"Compte sortant\", lambda x: set(x)) # temp\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# On ajoute aussi les comptes entrants\n",
|
||||
"accounts_per_day[\"comptes_uniques\"] = accounts_per_day.index.map(\n",
|
||||
" lambda d: set(transf_compte[transf_compte[\"Date\"] == d][\"Compte sortant\"]) |\n",
|
||||
" set(transf_compte[transf_compte[\"Date\"] == d][\"Compte entrant\"])\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"accounts_per_day[\"n_comptes\"] = accounts_per_day[\"comptes_uniques\"].apply(len)\n",
|
||||
"\n",
|
||||
"# Plot\n",
|
||||
"plt.figure(figsize=(14,5))\n",
|
||||
"plt.plot(accounts_per_day.index, accounts_per_day[\"n_comptes\"], marker=\"o\")\n",
|
||||
"plt.title(\"Nombre de comptes impliqués dans des transferts par jour\")\n",
|
||||
"plt.xlabel(\"Date\")\n",
|
||||
"plt.ylabel(\"Nombre de comptes uniques\")\n",
|
||||
"plt.grid(True)\n",
|
||||
"plt.show()\n",
|
||||
"\n",
|
||||
"print(\"Aperçu :\")\n",
|
||||
"accounts_per_day.head()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "c898b0c5-0a8e-4640-bc52-9490ee80e53d",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# MERGE AUM & FLOWS "
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "ce33dbf8-1c59-416a-adc4-6eb7c1ea9d8e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import pandas as pd\n",
|
||||
"import numpy as np\n",
|
||||
"import matplotlib.pyplot as plt\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"df_aum2 = df_aum2.rename(columns={\n",
|
||||
" \"Registrar Account - ID\": \"Account_ID\",\n",
|
||||
" \"Product - Isin\": \"ISIN\",\n",
|
||||
" \"Centralisation Date\": \"Date\",\n",
|
||||
" \"Value - AUM €\": \"AUM_EUR\"\n",
|
||||
"})\n",
|
||||
"\n",
|
||||
"df_flows2 = df_flows2.rename(columns={\n",
|
||||
" \"Registrar Account - ID\": \"Account_ID\",\n",
|
||||
" \"Product - Isin\": \"ISIN\",\n",
|
||||
" \"Centralisation Date\": \"Date\",\n",
|
||||
" \"Value € - NetFlows\": \"Flow_EUR\"\n",
|
||||
"})\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"df_aum2[\"Date\"] = pd.to_datetime(df_aum2[\"Date\"])\n",
|
||||
"df_flows2[\"Date\"] = pd.to_datetime(df_flows2[\"Date\"])\n",
|
||||
"\n",
|
||||
"df_aum2[\"Account_ID\"] = df_aum2[\"Account_ID\"].astype(str)\n",
|
||||
"df_flows2[\"Account_ID\"] = df_flows2[\"Account_ID\"].astype(str)\n",
|
||||
"\n",
|
||||
"df_aum2[\"ISIN\"] = df_aum2[\"ISIN\"].str.upper()\n",
|
||||
"df_flows2[\"ISIN\"] = df_flows2[\"ISIN\"].str.upper()\n",
|
||||
"\n",
|
||||
"\n",
|
||||
"df_merged = pd.merge(\n",
|
||||
" df_aum2[[\"Account_ID\", \"ISIN\", \"Date\", \"AUM_EUR\"]],\n",
|
||||
" df_flows2[[\"Account_ID\", \"ISIN\", \"Date\", \"Flow_EUR\"]],\n",
|
||||
" on=[\"Account_ID\", \"ISIN\", \"Date\"],\n",
|
||||
" how=\"outer\"\n",
|
||||
").sort_values([\"Account_ID\", \"ISIN\", \"Date\"])\n",
|
||||
"\n",
|
||||
"print(\"Merged dataset:\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "7e5d642e-5c16-4c78-8d83-075094902670",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"df_merged"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "ea14866a-1ce6-4b19-9225-d725304af8ec",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"# 2. HISTOGRAMME DES AUM (SANS ISIN)\n",
|
||||
"\n",
|
||||
"# We keep the mean AUM per Account \n",
|
||||
"aum_by_account = (\n",
|
||||
" df_merged.groupby(\"Account_ID\")[\"AUM_EUR\"]\n",
|
||||
" .mean()\n",
|
||||
" .dropna()\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"plt.figure(figsize=(10,6))\n",
|
||||
"plt.hist(aum_by_account, bins=50)\n",
|
||||
"plt.xlabel(\"Mean AUM value (€)\")\n",
|
||||
"plt.ylabel(\"Number of client accounts\")\n",
|
||||
"plt.title(\"Distribution of client average AUM (one value per Account_ID)\")\n",
|
||||
"plt.grid(True)\n",
|
||||
"plt.show()"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "f13c213e-7f72-494a-bcf0-b4cd9feee55d",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"# 3. ANALYSE DES FLOWS POUR UN COMPTE DANS UN FONDS\n",
|
||||
"\n",
|
||||
"account_to_plot = \"YOUR_ACCOUNT_ID_HERE\"\n",
|
||||
"isin_to_plot = \"YOUR_ISIN_HERE\"\n",
|
||||
"\n",
|
||||
"client_flows = df_merged[\n",
|
||||
" (df_merged[\"Account_ID\"] == account_to_plot) &\n",
|
||||
" (df_merged[\"ISIN\"] == isin_to_plot)\n",
|
||||
"].sort_values(\"Date\")\n",
|
||||
"\n",
|
||||
"plt.figure(figsize=(12,5))\n",
|
||||
"plt.plot(client_flows[\"Date\"], client_flows[\"Flow_EUR\"], marker=\"o\")\n",
|
||||
"plt.axhline(0, color=\"black\", linewidth=1)\n",
|
||||
"plt.xlabel(\"Date\")\n",
|
||||
"plt.ylabel(\"Flow (€)\")\n",
|
||||
"plt.title(f\"Flow movements for Account {account_to_plot}, ISIN {isin_to_plot}\")\n",
|
||||
"plt.grid(True)\n",
|
||||
"plt.show()\n",
|
||||
"\n",
|
||||
"###############################################################################\n",
|
||||
"# 4. ANALYSE MENSUELLE DES FLOWS (ENTRANTS / SORTANTS)\n",
|
||||
"###############################################################################\n",
|
||||
"\n",
|
||||
"df_merged[\"YearMonth\"] = df_merged[\"Date\"].dt.to_period(\"M\")\n",
|
||||
"\n",
|
||||
"flows_monthly = df_merged.groupby(\"YearMonth\").agg(\n",
|
||||
" n_positive_flows=(\"Flow_EUR\", lambda x: (x > 0).sum()),\n",
|
||||
" n_negative_flows=(\"Flow_EUR\", lambda x: (x < 0).sum()),\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"print(\"Monthly flow summary:\")\n",
|
||||
"print(flows_monthly.head())\n",
|
||||
"\n",
|
||||
"# ---- Plot bar chart ----\n",
|
||||
"flows_monthly.index = flows_monthly.index.astype(str)\n",
|
||||
"\n",
|
||||
"plt.figure(figsize=(14,6))\n",
|
||||
"plt.bar(flows_monthly.index, flows_monthly[\"n_positive_flows\"], label=\"Positive flows (inflows)\", alpha=0.7)\n",
|
||||
"plt.bar(flows_monthly.index, flows_monthly[\"n_negative_flows\"], label=\"Negative flows (outflows)\", alpha=0.7)\n",
|
||||
"plt.xticks(rotation=90)\n",
|
||||
"plt.xlabel(\"Year-Month\")\n",
|
||||
"plt.ylabel(\"Number of accounts with flows\")\n",
|
||||
"plt.title(\"Monthly number of accounts with inflows vs outflows\")\n",
|
||||
"plt.legend()\n",
|
||||
"plt.tight_layout()\n",
|
||||
"plt.show()\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "ded9b4f6-df92-479e-bc7b-aaa489dad228",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": []
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3 (ipykernel)",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.13.11"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
756
Stat_Desc2.ipynb
Normal file
756
Stat_Desc2.ipynb
Normal file
File diff suppressed because one or more lines are too long
Binary file not shown.
Binary file not shown.
Binary file not shown.
6776
clus11mars-Copy1 (5).ipynb
Normal file
6776
clus11mars-Copy1 (5).ipynb
Normal file
File diff suppressed because one or more lines are too long
1724
data_1-Copy1.ipynb
Normal file
1724
data_1-Copy1.ipynb
Normal file
File diff suppressed because one or more lines are too long
3658
data_1.ipynb
Normal file
3658
data_1.ipynb
Normal file
File diff suppressed because one or more lines are too long
3156
dataloader.ipynb
Normal file
3156
dataloader.ipynb
Normal file
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
|
|
@ -1,70 +0,0 @@
|
|||
{
|
||||
"cells": [
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "d2701d07",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"# Helper notebook to allow pushing data on S3"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": 4,
|
||||
"id": "5c8fc6c5",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"import os\n",
|
||||
"import s3fs\n",
|
||||
"\n",
|
||||
"def push_file(local_path, s3_path):\n",
|
||||
" fs = s3fs.S3FileSystem(\n",
|
||||
" client_kwargs={'endpoint_url': 'https://' + 'minio-simple.lab.groupe-genes.fr'},\n",
|
||||
" key=os.environ[\"AWS_ACCESS_KEY_ID\"],\n",
|
||||
" secret=os.environ[\"AWS_SECRET_ACCESS_KEY\"],\n",
|
||||
" token=os.environ[\"AWS_SESSION_TOKEN\"]\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" with open(local_path, 'rb') as local_f, fs.open(s3_path, 'wb') as s3_f:\n",
|
||||
" s3_f.write(local_f.read())"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "d43b725e",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"push_file('repair_challenge/alpha_5%/carmignac_broken_months.csv', 'projet-bdc-carmignac-g3//paco/carmignac_broken_months.csv')\n",
|
||||
"push_file('repair_challenge/alpha_5%/carmignac_error_account_agg.csv', 'projet-bdc-carmignac-g3//paco/carmignac_error_account_agg.csv')\n",
|
||||
"push_file('repair_challenge/alpha_5%/carmignac_error_account.csv', 'projet-bdc-carmignac-g3//paco/carmignac_error_account.csv')\n",
|
||||
"push_file('AUM_repaired.csv', 'projet-bdc-carmignac-g3//paco/AUM_repaired.csv')\n",
|
||||
"push_file('AUM_paths.csv', 'projet-bdc-carmignac-g3//paco/AUM_paths.csv')\n",
|
||||
"push_file('AUM_repair_audit.csv', 'projet-bdc-carmignac-g3//paco/AUM_repair_audit.csv')"
|
||||
]
|
||||
}
|
||||
],
|
||||
"metadata": {
|
||||
"kernelspec": {
|
||||
"display_name": "Python 3",
|
||||
"language": "python",
|
||||
"name": "python3"
|
||||
},
|
||||
"language_info": {
|
||||
"codemirror_mode": {
|
||||
"name": "ipython",
|
||||
"version": 3
|
||||
},
|
||||
"file_extension": ".py",
|
||||
"mimetype": "text/x-python",
|
||||
"name": "python",
|
||||
"nbconvert_exporter": "python",
|
||||
"pygments_lexer": "ipython3",
|
||||
"version": "3.13.12"
|
||||
}
|
||||
},
|
||||
"nbformat": 4,
|
||||
"nbformat_minor": 5
|
||||
}
|
||||
File diff suppressed because one or more lines are too long
|
|
@ -1,28 +0,0 @@
|
|||
strategy,n_carmignac_sc,n_competitors,n_index_funds,ms_categories,broad_category
|
||||
CAD,2,27,2,"EAA Fund Asia ex-Japan Equity, EAA Fund Asia ex-Japan Small/Mid-Cap Equity, EAA Fund Asia-Pacific Equity, EAA Fund Asia-Pacific ex-Japan Equity, EAA Fund Global Emerging Markets ex-China Equity",Equity
|
||||
CARE,2,22,0,"EAA Fund Equity Market Neutral EUR, EAA Fund Long/Short Equity - Global, EAA Fund Long/Short Equity - Europe, EAA Fund Macro Trading EUR",Alternative
|
||||
CCNE,1,28,0,"EAA Fund Greater China Equity, EAA Fund China Equity, EAA Fund China Equity - A Shares",Equity
|
||||
CCR,3,36,1,"EAA Fund EUR Corporate Bond, EAA Fund EUR Flexible Bond, EAA Fund Global Flexible Bond - EUR Hedged, EAA Fund EUR High Yield Bond, EAA Fund Global Corporate Bond - EUR Hedged",Fixed Income
|
||||
CEMD,1,34,0,"EAA Fund Global Emerging Markets Bond, EAA Fund Global Emerging Markets Bond - EUR Hedged, EAA Fund Other Bond, EAA Fund Global Emerging Markets Bond - Local Currency, Global Emerging Markets Bond, Global Emerging Markets Bond - EUR Hedged, Global Emerging Markets Bond - Local Currency",Fixed Income
|
||||
CEMP,2,11,0,"EAA Fund Global Emerging Markets Allocation, EAA Fund Other Allocation, EAA Fund Asia Allocation, EAA Fund Greater China Allocation, Global Emerging Markets Allocation",Allocation
|
||||
CE,3,40,1,"EAA Fund Global Emerging Markets Equity, Global Emerging Markets Equity",Equity
|
||||
CFB,2,20,1,"EAA Fund EUR Flexible Bond, EAA Fund EUR Diversified Bond, EAA Fund Global Flexible Bond - EUR Hedged, EAA Fund Global Diversified Bond - EUR Hedged, EUR Flexible Bond",Fixed Income
|
||||
CFG,1,10,0,"EAA Fund Europe ex-UK Small/Mid-Cap Equity, EAA Fund Europe Flex-Cap Equity, EAA Fund Europe Mid-Cap Equity, EAA Fund Europe Small-Cap Equity, EAA Fund Eurozone Large-Cap Equity, EAA Fund Eurozone Mid-Cap Equity, EAA Fund Global Flex-Cap Equity, EAA Fund Global Large-Cap Growth Equity",Equity
|
||||
CGB,2,35,2,"EAA Fund Global Diversified Bond, EAA Fund Global Flexible Bond - EUR Hedged, Global Diversified Bond, EAA Fund Global Flexible Bond, EAA Fund Other Bond, EAA Fund EUR Diversified Bond - Short Term, EAA Fund EUR Flexible Bond, EAA Fund Global Government Bond, EAA Fund Global Corporate Bond - EUR Hedged, EAA Fund Global Diversified Bond - EUR Hedged, EAA Fund Global Government Bond - EUR Hedged",Fixed Income
|
||||
CGC,1,22,0,"EAA Fund Global Large-Cap Growth Equity, EAA Fund Other Equity, EAA Fund Global Large-Cap Blend Equity",Equity
|
||||
CGE,2,52,0,"EAA Fund Europe Large-Cap Blend Equity, EAA Fund Europe Large-Cap Growth Equity, EAA Fund Europe Large-Cap Value Equity, EAA Fund Eurozone Large-Cap Equity, EAA Fund Europe Flex-Cap Equity, EAA Fund Europe Equity Income, Europe Large-Cap Growth Equity",Equity
|
||||
CHX,1,10,0,"EAA Fund Europe Large-Cap Blend Equity, EAA Fund Europe Mid-Cap Equity, EAA Fund Eurozone Flex-Cap Equity, EAA Fund Eurozone Large-Cap Equity, EAA Fund Global Large-Cap Blend Equity, EAA Fund Global Large-Cap Growth Equity, EAA Fund Other Equity, EAA Fund Sector Equity Consumer Goods & Services, EAA Fund Sector Equity Ecology",Equity
|
||||
CIL,2,12,0,"EAA Fund EUR Flexible Allocation - Global, EAA Fund EUR Flexible Allocation, EAA Fund EUR Moderate Allocation - Global, EAA Fund EUR Cautious Allocation - Global, EUR Flexible Allocation - Global",Allocation
|
||||
CI,3,28,0,"EAA Fund Global Large-Cap Growth Equity, EAA Fund Global Large-Cap Value Equity, EAA Fund Global Large-Cap Blend Equity, EAA Fund Other Equity, EAA Fund Global Equity Income, EAA Fund Global Flex-Cap Equity, EAA Fund Europe Flex-Cap Equity",Equity
|
||||
CMAP,1,21,0,"EAA Fund Event Driven, EAA Fund Relative Value Arbitrage",Alternative
|
||||
CMA,1,4,0,EAA Fund Event Driven,Alternative
|
||||
CPE,2,19,0,"EAA Fund EUR Moderate Allocation, EAA Fund EUR Cautious Allocation, EAA Fund EUR Flexible Allocation, EAA Fund EUR Aggressive Allocation, EAA Fund EUR Moderate Allocation - Global, EUR Moderate Allocation",Allocation
|
||||
CPI,2,18,0,"EAA Fund EUR Flexible Allocation - Global, EAA Fund EUR Moderate Allocation - Global, EAA Fund EUR Flexible Allocation, EAA Fund EUR Cautious Allocation - Global, EAA Fund Other Allocation, EAA Fund USD Moderate Allocation, EAA Fund EUR Cautious Allocation, EAA Fund Macro Trading EUR, EAA Fund GBP Flexible Allocation, EAA Fund Global Inflation-Linked Bond - EUR Hedged, EAA Fund Commodities - Broad Basket",Allocation
|
||||
CP,2,34,0,"EAA Fund EUR Moderate Allocation - Global, EAA Fund USD Moderate Allocation, EAA Fund EUR Flexible Allocation - Global, EAA Fund EUR Cautious Allocation - Global, EAA Fund EUR Aggressive Allocation - Global, EAA Fund EUR Cautious Allocation, EAA Fund EUR Flexible Allocation, EAA Fund EUR Diversified Bond, EAA Fund EUR Moderate Allocation, EUR Moderate Allocation - Global",Allocation
|
||||
CS,2,27,2,"EAA Fund EUR Diversified Bond - Short Term, EAA Fund EUR Government Bond - Short Term, EAA Fund Global Flexible Bond - EUR Hedged, EAA Fund EUR Ultra Short-Term Bond, EAA Fund EUR Flexible Bond, EAA Fund EUR Corporate Bond - Short Term, EAA Fund EUR Diversified Bond, EAA Fund EUR Corporate Bond",Fixed Income
|
||||
CTS,2,24,0,"EAA Fund Sector Equity Technology, EAA Fund US Flex-Cap Equity, Sector Equity Technology",Equity
|
||||
PLSEE,2,27,0,"EAA Fund Long/Short Equity - Global, EAA Fund Equity Market Neutral EUR, EAA Fund Long/Short Equity - Europe, EAA Fund Long/Short Equity - Other, EAA Fund Europe Large-Cap Blend Equity",Equity
|
||||
UKCEL,2,27,0,"EAA Fund Europe ex-UK Equity, EAA Fund Europe ex-UK Small/Mid-Cap Equity, EAA Fund Other Equity, EAA Fund Europe Large-Cap Blend Equity",Equity
|
||||
UKCE,2,21,0,EAA Fund Global Emerging Markets Equity,Equity
|
||||
UKCGB,5,26,0,"EAA Fund Global Flexible Bond - GBP Hedged, EAA Fund Global Flexible Bond, EAA Fund Global Diversified Bond, EAA Fund Global Diversified Bond - GBP Hedged, EAA Fund GBP Allocation 0-20% Equity",Fixed Income
|
||||
UKCGEC,3,17,0,"EAA Fund Global Large-Cap Growth Equity, EAA Fund Global Large-Cap Blend Equity",Equity
|
||||
|
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
|
|
@ -1,328 +0,0 @@
|
|||
"""
|
||||
Pipeline Results Analysis
|
||||
=====================================================
|
||||
Analyses the CSV outputs produced by carmignac_repair.py:
|
||||
- carmignac_scores.csv (post-surgery score history)
|
||||
- carmignac_mapping.csv (reg_id mapping history)
|
||||
- carmignac_surgery_log.csv (surgery operations)
|
||||
|
||||
Produces a self-contained HTML report with interactive charts.
|
||||
|
||||
Usage:
|
||||
python carmignac_analysis.py
|
||||
python carmignac_analysis.py --scores path/to/scores.csv \
|
||||
--mapping path/to/mapping.csv \
|
||||
--surgery path/to/surgery_log.csv \
|
||||
--out report.html
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import sys
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
from helpers import build_html_repair
|
||||
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
# 1. LOAD & VALIDATE
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def load_outputs(
|
||||
scores_path, mapping_path, surgery_path, err_isin_path=None, err_agg_path=None
|
||||
):
|
||||
scores = pd.read_csv(scores_path, parse_dates=["date"])
|
||||
mapping = pd.read_csv(mapping_path, parse_dates=["date"])
|
||||
surgery = pd.read_csv(surgery_path, parse_dates=["date"])
|
||||
|
||||
# Normalise dtypes
|
||||
scores["reg_id"] = scores["reg_id"].astype(str)
|
||||
mapping["reg_orig"] = mapping["reg_orig"].astype(str)
|
||||
mapping["reg_used"] = mapping["reg_used"].astype(str)
|
||||
mapping["changed"] = mapping["changed"].astype(bool)
|
||||
surgery["reg_orig"] = surgery["reg_orig"].astype(str)
|
||||
surgery["reg_from"] = surgery["reg_from"].astype(str)
|
||||
surgery["reg_to"] = surgery["reg_to"].astype(str)
|
||||
if "lookback_months" not in surgery.columns:
|
||||
surgery["lookback_months"] = 1 # backwards compat
|
||||
|
||||
# Error account (optional)
|
||||
err_isin = None
|
||||
err_agg = None
|
||||
if err_isin_path and os.path.exists(err_isin_path):
|
||||
err_isin = pd.read_csv(err_isin_path, parse_dates=["date"])
|
||||
err_isin["isin"] = err_isin["isin"].astype(str)
|
||||
if err_agg_path and os.path.exists(err_agg_path):
|
||||
err_agg = pd.read_csv(err_agg_path, parse_dates=["date"])
|
||||
|
||||
return scores, mapping, surgery, err_isin, err_agg
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
# LOAD ERROR ACCOUNT (optional)
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def load_error_account(isin_path, agg_path):
|
||||
"""
|
||||
Loads the error account CSVs produced by carmignac_diagnostics.py.
|
||||
Returns (df_err_isin, df_err_agg) or (None, None) if files not found.
|
||||
"""
|
||||
if not isin_path or not agg_path:
|
||||
return None, None
|
||||
try:
|
||||
ei = pd.read_csv(isin_path, parse_dates=["date"])
|
||||
ea = pd.read_csv(agg_path, parse_dates=["date"])
|
||||
ei["isin"] = ei["isin"].astype(str)
|
||||
print(
|
||||
f"[Load] error account (ISIN) : {len(ei)} rows, "
|
||||
f"{ei['isin'].nunique()} ISINs"
|
||||
)
|
||||
print(f"[Load] error account (agg) : {len(ea)} rows")
|
||||
return ei, ea
|
||||
except Exception as e:
|
||||
print(f"[WARN] Could not load error account: {e}")
|
||||
return None, None
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
# 2. COMPUTE ANALYTICS
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def compute_analytics(scores, mapping, surgery):
|
||||
dates = sorted(scores["date"].unique())
|
||||
|
||||
# ── 2.1 Sum of scores per date (post-surgery) ──────────────
|
||||
sum_post = scores.groupby("date")["score"].sum().reindex(dates).rename("sum_post")
|
||||
|
||||
# ── 2.2 Reconstruct pre-surgery (counterfactual) ───────────
|
||||
# Without surgery, every reg_id that had a hard break would score 0
|
||||
# from that date backwards. We propagate the surgery "gain" as a
|
||||
# cumulative deficit going back in time.
|
||||
gain_by_date = surgery.groupby("date")["gain_vs_no_surgery"].sum()
|
||||
# cumulative deficit = sum of gains for all surgeries at or after date t
|
||||
cumulative_deficit = pd.Series(0.0, index=dates)
|
||||
for d in dates:
|
||||
cumulative_deficit[d] = gain_by_date[gain_by_date.index >= d].sum()
|
||||
sum_pre = (sum_post - cumulative_deficit).clip(lower=0).rename("sum_pre")
|
||||
|
||||
timeline = pd.DataFrame({"sum_post": sum_post, "sum_pre": sum_pre})
|
||||
timeline.index = pd.to_datetime(timeline.index)
|
||||
timeline["recovery_pct"] = np.where(
|
||||
sum_pre < sum_post,
|
||||
(sum_post - sum_pre) / sum_post.clip(lower=1e-9) * 100,
|
||||
0.0,
|
||||
)
|
||||
|
||||
# ── 2.3 Per-date surgery stats ─────────────────────────────
|
||||
surgery_stats = (
|
||||
surgery.groupby("date")
|
||||
.agg(
|
||||
n_surgeries=("reg_orig", "count"),
|
||||
total_gain=("gain_vs_no_surgery", "sum"),
|
||||
avg_gain=("gain_vs_no_surgery", "mean"),
|
||||
avg_jaccard=("jaccard_composite", "mean"),
|
||||
avg_score_before=("score_before", "mean"),
|
||||
avg_score_after=("score_after", "mean"),
|
||||
)
|
||||
.reindex(dates, fill_value=0)
|
||||
)
|
||||
|
||||
# ── 2.4 Score distribution over time ───────────────────────
|
||||
# Wide format: rows=dates, cols=reg_ids
|
||||
pivot = scores.pivot_table(
|
||||
index="date", columns="reg_id", values="score", aggfunc="last"
|
||||
)
|
||||
pivot = pivot.reindex(dates)
|
||||
pivot.index = pd.to_datetime(pivot.index)
|
||||
|
||||
# ── 2.5 Mapping churn ──────────────────────────────────────
|
||||
# For each date, how many reg_ids are remapped (not using their original code)?
|
||||
churn = (
|
||||
mapping.groupby("date")["changed"]
|
||||
.sum()
|
||||
.reindex(dates, fill_value=0)
|
||||
.rename("n_remapped")
|
||||
)
|
||||
|
||||
# ── 2.6 Score entropy (distribution spread) ────────────────
|
||||
def entropy(row):
|
||||
p = row.dropna()
|
||||
p = p[p > 0]
|
||||
if len(p) == 0:
|
||||
return np.nan
|
||||
p = p / p.sum()
|
||||
return -(p * np.log(p)).sum()
|
||||
|
||||
timeline["entropy"] = pivot.apply(entropy, axis=1).values
|
||||
|
||||
# ── 2.7 Individual score trajectories ──────────────────────
|
||||
# Identify which reg_ids were ever remapped
|
||||
ever_remapped = set(mapping.loc[mapping["changed"], "reg_orig"].unique())
|
||||
|
||||
# ── 2.8 Surgery detail table ───────────────────────────────
|
||||
surgery_detail = surgery.copy()
|
||||
surgery_detail["gain_pct_of_score"] = (
|
||||
surgery_detail["gain_vs_no_surgery"]
|
||||
/ surgery_detail["score_before"].clip(lower=1e-9)
|
||||
* 100
|
||||
).round(2)
|
||||
|
||||
return {
|
||||
"timeline": timeline,
|
||||
"surgery_stats": surgery_stats,
|
||||
"pivot": pivot,
|
||||
"churn": churn,
|
||||
"ever_remapped": ever_remapped,
|
||||
"surgery_detail": surgery_detail,
|
||||
"dates": [d.strftime("%Y-%m-%d") for d in dates],
|
||||
}
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
# 3. PRINT CONSOLE SUMMARY
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def print_summary(analytics, surgery):
|
||||
tl = analytics["timeline"]
|
||||
ss = analytics["surgery_stats"]
|
||||
|
||||
print("\n" + "=" * 65)
|
||||
print(" CARMIGNAC PIPELINE — RESULTS SUMMARY")
|
||||
print("=" * 65)
|
||||
|
||||
print(f"\n Date range : {tl.index.min().date()} → {tl.index.max().date()}")
|
||||
print(f" Total months : {len(tl)}")
|
||||
print(f" Reg IDs : {analytics['pivot'].shape[1]}")
|
||||
|
||||
print("\n ── Score (Σ) ──────────────────────────────────────────")
|
||||
print(f" At t_ref (latest) : {tl['sum_post'].iloc[-1]:.6f}")
|
||||
print(f" At t_min (earliest): {tl['sum_post'].iloc[0]:.6f}")
|
||||
print(
|
||||
f" Min (post-surgery) : {tl['sum_post'].min():.6f} "
|
||||
f"({tl['sum_post'].idxmin().date()})"
|
||||
)
|
||||
print(
|
||||
f" Min (pre-surgery) : {tl['sum_pre'].min():.6f} "
|
||||
f"({tl['sum_pre'].idxmin().date()})"
|
||||
)
|
||||
print(f" Max recovery (pct) : {tl['recovery_pct'].max():.2f}%")
|
||||
|
||||
print("\n ── Surgeries ─────────────────────────────────────────")
|
||||
if len(surgery) == 0:
|
||||
print(" No surgeries performed.")
|
||||
else:
|
||||
print(f" Total operations : {len(surgery)}")
|
||||
print(f" Total score gained : {surgery['gain_vs_no_surgery'].sum():.6f}")
|
||||
print(f" Avg Jaccard : {surgery['jaccard_composite'].mean():.4f}")
|
||||
print(f" Avg gain / surgery : {surgery['gain_vs_no_surgery'].mean():.6f}")
|
||||
print()
|
||||
print(
|
||||
f" {'Date':12s} {'Reg orig':12s} {'From':15s} {'To':15s} "
|
||||
f"{'Jaccard':>8s} {'Gain':>10s}"
|
||||
)
|
||||
print(" " + "-" * 78)
|
||||
for _, row in surgery.sort_values("date").iterrows():
|
||||
print(
|
||||
f" {str(row['date'].date()):12s} {row['reg_orig']:12s} "
|
||||
f"{row['reg_from']:15s} {row['reg_to']:15s} "
|
||||
f"{row['jaccard_composite']:8.4f} {row['gain_vs_no_surgery']:10.6f}"
|
||||
)
|
||||
|
||||
print("\n ── Mapping churn ─────────────────────────────────────")
|
||||
ch = analytics["churn"]
|
||||
print(
|
||||
f" Max remapped at one date : {int(ch.max())} ({ch.idxmax().date() if ch.max() > 0 else 'N/A'})"
|
||||
)
|
||||
print(f" Reg IDs ever remapped : {len(analytics['ever_remapped'])}")
|
||||
|
||||
print("\n ── Score entropy (distribution spread) ───────────────")
|
||||
ent = analytics["timeline"]["entropy"]
|
||||
print(f" Mean entropy : {ent.mean():.4f}")
|
||||
print(f" Std entropy : {ent.std():.4f}")
|
||||
print()
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
# MAIN
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Carmignac pipeline results analyser")
|
||||
parser.add_argument("--scores", default="repair_results/carmignac_scores.csv")
|
||||
parser.add_argument("--mapping", default="repair_results/carmignac_mapping.csv")
|
||||
parser.add_argument("--surgery", default="repair_results/carmignac_surgery_log.csv")
|
||||
parser.add_argument("--out", default="repair_results/carmignac_report.html")
|
||||
parser.add_argument(
|
||||
"--error-account-isin",
|
||||
default=None,
|
||||
dest="error_isin",
|
||||
help="Path to carmignac_error_account.csv (optional)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--error-account-agg",
|
||||
default=None,
|
||||
dest="error_agg",
|
||||
help="Path to carmignac_error_account_agg.csv (optional)",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
# Resolve paths relative to this script's directory if files not found
|
||||
base = os.path.dirname(os.path.abspath(__file__))
|
||||
|
||||
def resolve(p, required=True):
|
||||
if p is None:
|
||||
return None
|
||||
if os.path.exists(p):
|
||||
return p
|
||||
alt = os.path.join(base, p)
|
||||
if os.path.exists(alt):
|
||||
return alt
|
||||
if required:
|
||||
sys.exit(f"[ERROR] File not found: {p}")
|
||||
print(f"[WARN] Optional file not found: {p}")
|
||||
return None
|
||||
|
||||
scores_path = resolve(args.scores)
|
||||
mapping_path = resolve(args.mapping)
|
||||
surgery_path = resolve(args.surgery)
|
||||
error_isin_path = resolve(args.error_isin, required=False)
|
||||
error_agg_path = resolve(args.error_agg, required=False)
|
||||
|
||||
print(f"[Load] scores : {scores_path}")
|
||||
print(f"[Load] mapping : {mapping_path}")
|
||||
print(f"[Load] surgery : {surgery_path}")
|
||||
|
||||
scores, mapping, surgery, df_err_isin, df_err_agg = load_outputs(
|
||||
scores_path,
|
||||
mapping_path,
|
||||
surgery_path,
|
||||
err_isin_path=error_isin_path,
|
||||
err_agg_path=error_agg_path,
|
||||
)
|
||||
analytics = compute_analytics(scores, mapping, surgery)
|
||||
|
||||
print_summary(analytics, surgery)
|
||||
|
||||
html = build_html_repair(
|
||||
analytics,
|
||||
surgery,
|
||||
scores,
|
||||
mapping,
|
||||
df_err_isin=df_err_isin,
|
||||
df_err_agg=df_err_agg,
|
||||
)
|
||||
|
||||
out_path = "../" + args.out
|
||||
os.makedirs(os.path.dirname(out_path), exist_ok=True)
|
||||
with open(out_path, "w", encoding="utf-8") as f:
|
||||
f.write(html)
|
||||
print(f"\n[Report] Written to → {out_path}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -1,410 +0,0 @@
|
|||
"""
|
||||
AUM Branching / Repair
|
||||
==================================================
|
||||
Takes as input:
|
||||
- The original AUM file (pre-repair)
|
||||
- The mapping CSV produced by carmignac_repair.py
|
||||
- (Optionally) the surgery log, for audit annotation
|
||||
|
||||
Produces:
|
||||
- A repaired AUM file where every Registrar Account ID is replaced
|
||||
by its canonical identity (reg_orig) as determined by the pipeline.
|
||||
|
||||
Core logic
|
||||
----------
|
||||
The mapping table encodes, for every (date, reg_orig) pair, which
|
||||
physical code (reg_used) was actually present in the data at that date.
|
||||
|
||||
reg_orig = the stable canonical identity (output label)
|
||||
reg_used = the code that appeared in the raw data at that date
|
||||
|
||||
For rows where reg_used != reg_orig (changed=True), the raw code is a
|
||||
historical alias that the surgery pass identified as belonging to
|
||||
reg_orig. The repair simply relabels those rows.
|
||||
|
||||
For accounts not in the repair universe (below the AUM threshold, or
|
||||
excluded categories), rows are passed through unchanged.
|
||||
|
||||
Self-mapped surgeries (reg_from == reg_to in the surgery log) do not
|
||||
require any relabelling — they signal a data quality issue on that
|
||||
month, not a code change.
|
||||
|
||||
Usage
|
||||
-----
|
||||
python carmignac_branch.py # default paths
|
||||
python carmignac_branch.py \\
|
||||
--aum raw_AUM.csv \\
|
||||
--mapping carmignac_mapping.csv \\
|
||||
--surgery carmignac_surgery_log.csv \\
|
||||
--out AUM_repaired.csv
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import sys
|
||||
import pandas as pd
|
||||
|
||||
from helpers import load_inputs_branch
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
# BUILD RENAME LOOKUP
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def build_rename_lookup(mapping):
|
||||
"""
|
||||
Returns a dict {(date, reg_used) -> reg_orig}
|
||||
restricted to rows where reg_used != reg_orig (actual changes).
|
||||
|
||||
For self-mapped surgeries or stable accounts, no entry is needed.
|
||||
"""
|
||||
changed = mapping[mapping["changed"] & (mapping["reg_orig"] != mapping["reg_used"])]
|
||||
|
||||
lookup = {}
|
||||
for _, row in changed.iterrows():
|
||||
key = (row["date"], row["reg_used"])
|
||||
if key in lookup and lookup[key] != row["reg_orig"]:
|
||||
print(
|
||||
f" [WARN] Conflicting mapping at {row['date'].date()} "
|
||||
f"reg_used={row['reg_used']}: "
|
||||
f"{lookup[key]} vs {row['reg_orig']} — keeping first"
|
||||
)
|
||||
else:
|
||||
lookup[key] = row["reg_orig"]
|
||||
|
||||
return lookup
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
# BRANCHING
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def apply_branching(aum, lookup):
|
||||
"""
|
||||
Renames Registrar Account - ID in the AUM dataframe according to
|
||||
the lookup {(date, reg_used) -> reg_orig}.
|
||||
|
||||
Rows not in the lookup are left untouched.
|
||||
|
||||
Returns:
|
||||
- repaired : the full AUM dataframe with corrected IDs
|
||||
- audit : a subset showing only the renamed rows, with both
|
||||
the original and canonical IDs for verification
|
||||
"""
|
||||
aum = aum.copy()
|
||||
aum["Centralisation Date"] = pd.to_datetime(aum["Centralisation Date"])
|
||||
aum["_date_key"] = aum["Centralisation Date"]
|
||||
aum["_reg_key"] = aum["Registrar Account - ID"].astype(str)
|
||||
|
||||
# Vectorised lookup via merge
|
||||
lookup_df = pd.DataFrame(
|
||||
[(d, reg_used, reg_orig) for (d, reg_used), reg_orig in lookup.items()],
|
||||
columns=["_date_key", "_reg_key", "_canonical_id"],
|
||||
)
|
||||
|
||||
merged = aum.merge(lookup_df, on=["_date_key", "_reg_key"], how="left")
|
||||
|
||||
# Audit: rows that were actually renamed
|
||||
renamed_mask = merged["_canonical_id"].notna()
|
||||
audit = merged[renamed_mask].copy()
|
||||
audit["original_reg_id"] = audit["_reg_key"]
|
||||
audit["canonical_reg_id"] = audit["_canonical_id"]
|
||||
audit = audit[
|
||||
[
|
||||
"Centralisation Date",
|
||||
"original_reg_id",
|
||||
"canonical_reg_id",
|
||||
"Product - Isin",
|
||||
"Quantity - AUM",
|
||||
"Value - AUM €",
|
||||
]
|
||||
]
|
||||
|
||||
# Rename
|
||||
merged.loc[renamed_mask, "Registrar Account - ID"] = merged.loc[
|
||||
renamed_mask, "_canonical_id"
|
||||
]
|
||||
|
||||
# Drop helper columns
|
||||
repaired = merged.drop(columns=["_date_key", "_reg_key", "_canonical_id"])
|
||||
|
||||
return repaired, audit
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
# CONSISTENCY CHECK
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def consistency_check(original, repaired, mapping, surgery):
|
||||
"""
|
||||
Sanity checks after branching:
|
||||
|
||||
1. Row count preserved
|
||||
2. No reg_used alias remains in the repaired file (for changed entries)
|
||||
3. For each (reg_orig, isin, date) there is at most one row
|
||||
(branching should not create duplicates)
|
||||
4. Summary of surgery operations applied
|
||||
"""
|
||||
print("\n[Consistency checks]")
|
||||
|
||||
# Row count
|
||||
if len(original) == len(repaired):
|
||||
print(f" ✓ Row count preserved : {len(repaired)}")
|
||||
else:
|
||||
print(f" ✗ Row count changed : {len(original)} → {len(repaired)}")
|
||||
|
||||
# Aliases eliminated
|
||||
changed = mapping[mapping["changed"] & (mapping["reg_orig"] != mapping["reg_used"])]
|
||||
aliases = set(changed["reg_used"].unique())
|
||||
still_present = set(repaired["Registrar Account - ID"].astype(str)) & aliases
|
||||
if not still_present:
|
||||
print(f" ✓ All {len(aliases)} aliased code(s) successfully relabelled")
|
||||
else:
|
||||
print(
|
||||
f" ✗ {len(still_present)} aliased code(s) still present: {still_present}"
|
||||
)
|
||||
|
||||
# Duplicates
|
||||
key_cols = ["Registrar Account - ID", "Product - Isin", "Centralisation Date"]
|
||||
dup_count = repaired.duplicated(subset=key_cols).sum()
|
||||
if dup_count == 0:
|
||||
print(" ✓ No duplicate (reg_id, isin, date) keys")
|
||||
else:
|
||||
print(
|
||||
f" ✗ {dup_count} duplicate (reg_id, isin, date) rows found — inspect manually"
|
||||
)
|
||||
print(
|
||||
repaired[repaired.duplicated(subset=key_cols, keep=False)][
|
||||
key_cols + ["Quantity - AUM"]
|
||||
]
|
||||
.head(10)
|
||||
.to_string(index=False)
|
||||
)
|
||||
|
||||
# Surgery summary
|
||||
if not surgery.empty:
|
||||
print("\n[Surgery operations applied]")
|
||||
for _, op in surgery.sort_values("date").iterrows():
|
||||
self_map = (
|
||||
" [self-map — data quality flag, no rename]"
|
||||
if op["reg_from"] == op["reg_to"]
|
||||
else ""
|
||||
)
|
||||
print(
|
||||
f" {op['date'].date()} | {op['reg_orig']} : "
|
||||
f"{op['reg_from']} → {op['reg_to']}"
|
||||
f" (Jaccard={op['jaccard_composite']:.4f}, "
|
||||
f"gain={op['gain_vs_no_surgery']:.6f}){self_map}"
|
||||
)
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
# EXPORT PATHS (branched accounts only)
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def export_paths(aum, mapping, surgery, repaired):
|
||||
"""
|
||||
Builds a condensed AUM file for ALL accounts in the repair universe
|
||||
(i.e. every reg_orig present in the mapping).
|
||||
|
||||
- Stable accounts (no surgery): single leg where reg_used == reg_orig
|
||||
throughout, pulled directly from the repaired AUM.
|
||||
- Branched accounts (at least one genuine surgery): multiple legs,
|
||||
reg_used shows which physical code was active at each date.
|
||||
|
||||
The output makes every account's full path explicit:
|
||||
|
||||
reg_orig | reg_used | date | isin | qty_aum | ...
|
||||
─────────┼───────────────┼────────────┼──────┼─────────┼───
|
||||
REG_001 | REG_001 | 2020-01-31 | ... | ... | <- stable
|
||||
REG_002 | REG_002_OLD | 2020-01-31 | ... | ... | <- leg 1
|
||||
REG_002 | REG_002 | 2022-07-31 | ... | ... | <- leg 2
|
||||
|
||||
Self-mapped surgeries (reg_from == reg_to) are noted in the summary
|
||||
but do not add extra legs — the account kept its code.
|
||||
|
||||
Returns the paths DataFrame (never None if mapping is non-empty).
|
||||
"""
|
||||
# All canonical accounts in the universe
|
||||
all_accounts = sorted(mapping["reg_orig"].astype(str).unique())
|
||||
|
||||
# Branched accounts (genuine code changes only)
|
||||
branched_accounts = set()
|
||||
if not surgery.empty:
|
||||
genuine = surgery[surgery["reg_from"] != surgery["reg_to"]]
|
||||
branched_accounts = set(genuine["reg_orig"].astype(str).unique())
|
||||
|
||||
print(
|
||||
f"\n[Paths] {len(all_accounts)} account(s) in universe, "
|
||||
f"{len(branched_accounts)} branched: "
|
||||
f"{sorted(branched_accounts) or 'none'}"
|
||||
)
|
||||
|
||||
# Build (date, reg_orig) → reg_used lookup from mapping
|
||||
map_df = mapping[["date", "reg_orig", "reg_used"]].copy()
|
||||
map_df["date"] = pd.to_datetime(map_df["date"])
|
||||
map_df["reg_orig"] = map_df["reg_orig"].astype(str)
|
||||
map_df["reg_used"] = map_df["reg_used"].astype(str)
|
||||
map_df = map_df.rename(columns={"date": "_date_key", "reg_orig": "_reg_key"})
|
||||
|
||||
# Pull all universe rows from the repaired AUM
|
||||
aum_universe = repaired[
|
||||
repaired["Registrar Account - ID"].astype(str).isin(all_accounts)
|
||||
].copy()
|
||||
aum_universe["Centralisation Date"] = pd.to_datetime(
|
||||
aum_universe["Centralisation Date"]
|
||||
)
|
||||
aum_universe["_date_key"] = aum_universe["Centralisation Date"]
|
||||
aum_universe["_reg_key"] = aum_universe["Registrar Account - ID"].astype(str)
|
||||
|
||||
# Join reg_used from mapping
|
||||
paths = aum_universe.merge(
|
||||
map_df[["_date_key", "_reg_key", "reg_used"]],
|
||||
on=["_date_key", "_reg_key"],
|
||||
how="left",
|
||||
).drop(columns=["_date_key", "_reg_key"])
|
||||
|
||||
# For stable accounts, mapping may not cover every AUM date (e.g. sparse
|
||||
# months) — fall back to reg_orig (= Registrar Account - ID) for those.
|
||||
paths["reg_used"] = paths["reg_used"].fillna(
|
||||
paths["Registrar Account - ID"].astype(str)
|
||||
)
|
||||
|
||||
# Rename canonical column
|
||||
paths = paths.rename(columns={"Registrar Account - ID": "reg_orig"})
|
||||
|
||||
# Column order
|
||||
other_cols = [c for c in paths.columns if c not in ("reg_orig", "reg_used")]
|
||||
paths = paths[["reg_orig", "reg_used"] + other_cols]
|
||||
paths = paths.sort_values(["reg_orig", "Centralisation Date", "Product - Isin"])
|
||||
paths = paths.reset_index(drop=True)
|
||||
|
||||
# Summary
|
||||
for acc in all_accounts:
|
||||
sub = paths[paths["reg_orig"] == acc]
|
||||
legs = list(sub["reg_used"].unique())
|
||||
tag = " [branched]" if acc in branched_accounts else " [stable]"
|
||||
print(f" {acc}: {len(sub)} rows, legs = {legs}{tag}")
|
||||
|
||||
return paths
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
# MAIN
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Apply Carmignac repair mapping to the raw AUM file"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--mapping",
|
||||
default="repair_results/carmignac_mapping.csv",
|
||||
help="Path to mapping CSV from carmignac_repair.py",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--surgery",
|
||||
default="repair_results/carmignac_surgery_log.csv",
|
||||
help="Path to surgery log CSV (optional, for audit)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--out", default="AUM_repaired.csv", help="Output path for repaired AUM CSV"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--audit",
|
||||
default="AUM_repair_audit.csv",
|
||||
help="Output path for audit CSV (renamed rows only)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--paths",
|
||||
default="AUM_paths.csv",
|
||||
help="Output path for condensed paths CSV (branched accounts only)",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
def resolve(p, required=True):
|
||||
if os.path.exists(p):
|
||||
return p
|
||||
alt = os.path.join(os.path.dirname(os.path.abspath(__file__)), p)
|
||||
if os.path.exists(alt):
|
||||
return alt
|
||||
if required:
|
||||
sys.exit(f"[ERROR] File not found: {p}")
|
||||
return None
|
||||
|
||||
mapping_path = resolve(args.mapping)
|
||||
surgery_path = resolve(args.surgery, required=False)
|
||||
|
||||
print("=" * 60)
|
||||
print("CARMIGNAC — AUM Branching / Repair")
|
||||
print("=" * 60)
|
||||
print(f" Mapping : {mapping_path}")
|
||||
print(f" Surgery : {surgery_path or '(not provided)'}")
|
||||
|
||||
# Load
|
||||
aum, mapping, surgery = load_inputs_branch(mapping_path, surgery_path)
|
||||
print(f"\n Raw AUM rows : {len(aum)}")
|
||||
print(f" Mapping rows : {len(mapping)}")
|
||||
print(f" Mapping changed rows : {mapping['changed'].sum()}")
|
||||
print(f" Surgery operations : {len(surgery)}")
|
||||
|
||||
# Build lookup
|
||||
lookup = build_rename_lookup(mapping)
|
||||
print(f"\n Rename operations : {len(lookup)}")
|
||||
if lookup:
|
||||
sample = list(lookup.items())[:3]
|
||||
for (d, used), orig in sample:
|
||||
print(f" ({d.date()}, {used}) → {orig}")
|
||||
if len(lookup) > 3:
|
||||
print(f" ... and {len(lookup) - 3} more")
|
||||
|
||||
# Apply
|
||||
repaired, audit = apply_branching(aum, lookup)
|
||||
print(f"\n Rows renamed : {len(audit)}")
|
||||
|
||||
# Checks
|
||||
consistency_check(aum, repaired, mapping, surgery)
|
||||
|
||||
# Save
|
||||
out_dir = os.path.dirname(os.path.abspath(args.out))
|
||||
os.makedirs(out_dir, exist_ok=True)
|
||||
|
||||
repaired.to_csv(args.out, index=False)
|
||||
print(f"\n ✓ Repaired AUM → {args.out}")
|
||||
|
||||
if len(audit) > 0:
|
||||
audit.to_csv(args.audit, index=False)
|
||||
print(f" ✓ Audit log → {args.audit}")
|
||||
else:
|
||||
print("(No rows renamed — audit log not written)")
|
||||
|
||||
# Paths: condensed AUM for branched accounts
|
||||
df_paths = export_paths(aum, mapping, surgery, repaired)
|
||||
if df_paths is not None:
|
||||
df_paths.to_csv(args.paths, index=False)
|
||||
print(f" ✓ Paths file → {args.paths}")
|
||||
|
||||
# Print renamed reg_ids summary
|
||||
if len(audit) > 0:
|
||||
print("\n[Renamed identifiers]")
|
||||
summary = (
|
||||
audit.groupby(["original_reg_id", "canonical_reg_id"])
|
||||
.size()
|
||||
.reset_index(name="n_rows")
|
||||
)
|
||||
for _, row in summary.iterrows():
|
||||
print(
|
||||
f" {row['original_reg_id']:20s} → {row['canonical_reg_id']:20s} "
|
||||
f"({row['n_rows']} rows)"
|
||||
)
|
||||
|
||||
print("\nDone.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -1,632 +0,0 @@
|
|||
"""
|
||||
Broken Months Diagnostics
|
||||
=====================================================
|
||||
Detects months where the aggregate stock-flow equation is violated at the ISIN level (across all accounts)
|
||||
The residual is the "missing flow":
|
||||
missing_{s}(t) = [Q_agg(t) - Q_agg(t-1)] - F_agg(t)
|
||||
|
||||
This is a market-level check, independent of individual account identity.
|
||||
It captures:
|
||||
- Genuinely missing flow records
|
||||
- End-of-month accounting lags (transactions dated at boundary)
|
||||
- Corporate actions (dividends, splits) not reflected in flows
|
||||
|
||||
Outputs
|
||||
-------
|
||||
carmignac_broken_months.csv — machine-readable, loaded by carmignac_repair.py
|
||||
carmignac_diagnostics.html — interactive HTML report
|
||||
|
||||
Usage
|
||||
-----
|
||||
python carmignac_diagnostics.py
|
||||
python carmignac_diagnostics.py \\
|
||||
--aum raw_AUM.csv \\
|
||||
--flows raw_flows.csv \\
|
||||
--out carmignac_broken_months.csv \\
|
||||
--html carmignac_diagnostics.html \\
|
||||
--alpha 0.02
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import sys
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
from helpers import build_html_diagnostics, load_data_diagnostics
|
||||
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
# AGGREGATE AND DETECT BROKEN MONTHS
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def detect_broken_months(aum, flows, alpha=0.02, lag_days=3):
|
||||
"""
|
||||
For each (isin, month-end t), compute:
|
||||
- Q_agg(t) : total shares held across all accounts
|
||||
- Q_agg(t-1) : idem previous month (forward-filled)
|
||||
- F_agg(t) : total net flows recorded in ]EOM(t-1), EOM(t)]
|
||||
- missing(t) : [Q_agg(t) - Q_agg(t-1)] - F_agg(t)
|
||||
- missing_pct : |missing| / max(Q_agg(t), Q_agg(t-1))
|
||||
|
||||
A month is flagged as "broken" when missing_pct > alpha.
|
||||
|
||||
Additionally, a month is flagged as a potential "lag" when:
|
||||
- It is broken with the standard window
|
||||
- But would NOT be broken if flows dated within lag_days of EOM
|
||||
are shifted to the adjacent month
|
||||
|
||||
Parameters :
|
||||
alpha : tolerance threshold (same as ALPHA in carmignac_repair.py)
|
||||
lag_days : number of boundary days to test for accounting lag
|
||||
|
||||
Returns :
|
||||
df_broken : DataFrame with all (isin, date) pairs where missing_pct > alpha
|
||||
df_all : Full DataFrame including non-broken months (for plotting)
|
||||
"""
|
||||
# Monthly calendar
|
||||
t_min = aum["Centralisation Date"].min()
|
||||
t_max = aum["Centralisation Date"].max()
|
||||
all_months = pd.date_range(t_min, t_max, freq="ME")
|
||||
|
||||
# ── Aggregate AUM per (isin, month-end) ──────────────────────
|
||||
aum_agg = (
|
||||
aum.groupby(["Product - Isin", "Centralisation Date"])["Quantity - AUM"]
|
||||
.sum()
|
||||
.reset_index()
|
||||
.rename(
|
||||
columns={
|
||||
"Product - Isin": "isin",
|
||||
"Centralisation Date": "date",
|
||||
"Quantity - AUM": "qty_agg",
|
||||
}
|
||||
)
|
||||
)
|
||||
# Forward-fill sparse panel
|
||||
aum_pivot = aum_agg.pivot(index="date", columns="isin", values="qty_agg")
|
||||
aum_pivot = aum_pivot.reindex(all_months).ffill()
|
||||
|
||||
# ── Aggregate flows per (isin, month-end) — standard window ──
|
||||
def bucket_flows(flows_df, months, lower_offset=0, upper_offset=0):
|
||||
"""Aggregate flows with optional boundary extension (in days)."""
|
||||
fc = flows_df.copy()
|
||||
|
||||
def assign_month(d):
|
||||
# Extended window: ]EOM(t-1) - lower_offset, EOM(t) + upper_offset]
|
||||
for m in months:
|
||||
eom_prev = m - pd.offsets.MonthEnd(1)
|
||||
lo = eom_prev - pd.Timedelta(days=lower_offset)
|
||||
hi = m + pd.Timedelta(days=upper_offset)
|
||||
if lo < d <= hi:
|
||||
return m
|
||||
return pd.NaT
|
||||
|
||||
fc["month_end"] = fc["Centralisation Date"].apply(assign_month)
|
||||
fc = fc.dropna(subset=["month_end"])
|
||||
agg = (
|
||||
fc.groupby(["Product - Isin", "month_end"])["Quantity - NetFlows"]
|
||||
.sum()
|
||||
.reset_index()
|
||||
.rename(
|
||||
columns={
|
||||
"Product - Isin": "isin",
|
||||
"month_end": "date",
|
||||
"Quantity - NetFlows": "flow_agg",
|
||||
}
|
||||
)
|
||||
)
|
||||
return agg
|
||||
|
||||
flows_std = bucket_flows(flows, all_months)
|
||||
flows_lag = bucket_flows(
|
||||
flows, all_months, lower_offset=lag_days, upper_offset=lag_days
|
||||
)
|
||||
|
||||
def flows_to_pivot(df, months):
|
||||
piv = df.pivot(index="date", columns="isin", values="flow_agg")
|
||||
return piv.reindex(months).fillna(0.0)
|
||||
|
||||
fpiv_std = flows_to_pivot(flows_std, all_months)
|
||||
fpiv_lag = flows_to_pivot(flows_lag, all_months)
|
||||
|
||||
# ── Compute residuals ─────────────────────────────────────────
|
||||
rows = []
|
||||
isins = aum_pivot.columns.tolist()
|
||||
|
||||
for i in range(1, len(all_months)):
|
||||
t_curr = all_months[i]
|
||||
t_prev = all_months[i - 1]
|
||||
|
||||
for isin in isins:
|
||||
q_curr = (
|
||||
aum_pivot[isin].get(t_curr, np.nan)
|
||||
if isin in aum_pivot.columns
|
||||
else np.nan
|
||||
)
|
||||
q_prev = (
|
||||
aum_pivot[isin].get(t_prev, np.nan)
|
||||
if isin in aum_pivot.columns
|
||||
else np.nan
|
||||
)
|
||||
|
||||
if pd.isna(q_curr) or pd.isna(q_prev):
|
||||
continue
|
||||
|
||||
delta = q_curr - q_prev
|
||||
|
||||
# Standard window
|
||||
f_std = fpiv_std[isin].get(t_curr, 0.0) if isin in fpiv_std.columns else 0.0
|
||||
missing_std = delta - f_std
|
||||
|
||||
# Extended lag window
|
||||
f_lag = fpiv_lag[isin].get(t_curr, 0.0) if isin in fpiv_lag.columns else 0.0
|
||||
missing_lag = delta - f_lag
|
||||
|
||||
# ── Denominator choice ────────────────────────────────
|
||||
# Normalise by the size of the *movement* (max of delta_AUM
|
||||
# and recorded flow), not by the stock level. This avoids
|
||||
# astronomically large percentages when a position is tiny
|
||||
# but the missing flow is a normal-sized number.
|
||||
#
|
||||
# Interpretation: "what fraction of the expected movement
|
||||
# is unaccounted for?"
|
||||
#
|
||||
# A minimum absolute threshold (min_abs_shares) suppresses
|
||||
# noise from residual micro-positions (rounding artefacts).
|
||||
min_abs_shares = 1.0 # ignore positions smaller than 1 share
|
||||
movement = max(abs(delta), abs(f_std), min_abs_shares)
|
||||
denom_std = movement
|
||||
|
||||
movement_lag = max(abs(delta), abs(f_lag), min_abs_shares)
|
||||
denom_lag = movement_lag
|
||||
|
||||
pct_std = abs(missing_std) / denom_std
|
||||
pct_lag = abs(missing_lag) / denom_lag
|
||||
|
||||
broken_std = pct_std > alpha
|
||||
broken_lag = pct_lag > alpha
|
||||
|
||||
# A "lag" month: broken with standard, NOT broken with extended window
|
||||
is_lag = broken_std and (not broken_lag)
|
||||
|
||||
rows.append(
|
||||
{
|
||||
"date": t_curr,
|
||||
"isin": isin,
|
||||
"q_agg_prev": round(q_prev, 3),
|
||||
"q_agg_curr": round(q_curr, 3),
|
||||
"delta_aum": round(delta, 3),
|
||||
"flow_agg": round(f_std, 3),
|
||||
"missing_flow": round(missing_std, 3),
|
||||
"missing_pct": round(pct_std, 6),
|
||||
"broken": broken_std,
|
||||
"is_lag": is_lag,
|
||||
}
|
||||
)
|
||||
|
||||
df_all = pd.DataFrame(rows)
|
||||
df_broken = df_all[df_all["broken"]].sort_values("missing_pct", ascending=False)
|
||||
return df_broken, df_all
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
# AGGREGATE (CROSS-ISIN) BROKEN MONTHS
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def detect_aggregate_broken_months(aum, flows, alpha=0.02, lag_days=3):
|
||||
"""
|
||||
Same stock-flow check as detect_broken_months, but aggregated
|
||||
across ALL ISINs for each month:
|
||||
|
||||
Q_total(t) - Q_total(t-1) != F_total(t)
|
||||
|
||||
where Q_total(t) = sum over all (reg_id, isin) of Q_{r,s}(t).
|
||||
|
||||
This catches months where the global portfolio is incoherent even
|
||||
if every individual ISIN is fine (e.g. cross-ISIN netting errors),
|
||||
and provides a cleaner high-level view.
|
||||
|
||||
Returns :
|
||||
df_agg : DataFrame indexed by month with columns:
|
||||
q_total_prev, q_total_curr, delta_aum,
|
||||
flow_total, missing_flow, missing_pct, broken, is_lag
|
||||
"""
|
||||
t_min = aum["Centralisation Date"].min()
|
||||
t_max = aum["Centralisation Date"].max()
|
||||
all_months = pd.date_range(t_min, t_max, freq="ME")
|
||||
|
||||
# ── Total AUM per month (all ISIN, all accounts) ─────────────
|
||||
aum_monthly = (
|
||||
aum.groupby("Centralisation Date")["Quantity - AUM"]
|
||||
.sum()
|
||||
.reindex(all_months)
|
||||
.ffill()
|
||||
.rename("q_total")
|
||||
)
|
||||
|
||||
# ── Bucket flows helper (reuse same window logic) ─────────────
|
||||
def bucket_total_flows(flows_df, months, lower_offset=0, upper_offset=0):
|
||||
fc = flows_df.copy()
|
||||
|
||||
def assign_month(d):
|
||||
for m in months:
|
||||
eom_prev = m - pd.offsets.MonthEnd(1)
|
||||
lo = eom_prev - pd.Timedelta(days=lower_offset)
|
||||
hi = m + pd.Timedelta(days=upper_offset)
|
||||
if lo < d <= hi:
|
||||
return m
|
||||
return pd.NaT
|
||||
|
||||
fc["month_end"] = fc["Centralisation Date"].apply(assign_month)
|
||||
fc = fc.dropna(subset=["month_end"])
|
||||
return (
|
||||
fc.groupby("month_end")["Quantity - NetFlows"]
|
||||
.sum()
|
||||
.reindex(months)
|
||||
.fillna(0.0)
|
||||
)
|
||||
|
||||
flow_std = bucket_total_flows(flows, all_months)
|
||||
flow_lag = bucket_total_flows(
|
||||
flows, all_months, lower_offset=lag_days, upper_offset=lag_days
|
||||
)
|
||||
|
||||
# ── Compute residuals ─────────────────────────────────────────
|
||||
rows = []
|
||||
min_abs_shares = 1.0
|
||||
|
||||
for i in range(1, len(all_months)):
|
||||
t_curr = all_months[i]
|
||||
t_prev = all_months[i - 1]
|
||||
|
||||
q_curr = aum_monthly.get(t_curr, np.nan)
|
||||
q_prev = aum_monthly.get(t_prev, np.nan)
|
||||
if pd.isna(q_curr) or pd.isna(q_prev):
|
||||
continue
|
||||
|
||||
delta = q_curr - q_prev
|
||||
|
||||
f_std = flow_std.get(t_curr, 0.0)
|
||||
f_lag = flow_lag.get(t_curr, 0.0)
|
||||
miss_std = delta - f_std
|
||||
miss_lag = delta - f_lag
|
||||
|
||||
movement_std = max(abs(delta), abs(f_std), min_abs_shares)
|
||||
movement_lag = max(abs(delta), abs(f_lag), min_abs_shares)
|
||||
pct_std = abs(miss_std) / movement_std
|
||||
pct_lag = abs(miss_lag) / movement_lag
|
||||
|
||||
broken_std = pct_std > alpha
|
||||
broken_lag = pct_lag > alpha
|
||||
is_lag = broken_std and (not broken_lag)
|
||||
|
||||
rows.append(
|
||||
{
|
||||
"date": t_curr,
|
||||
"q_total_prev": round(q_prev, 3),
|
||||
"q_total_curr": round(q_curr, 3),
|
||||
"delta_aum": round(delta, 3),
|
||||
"flow_total": round(f_std, 3),
|
||||
"missing_flow": round(miss_std, 3),
|
||||
"missing_pct": round(pct_std, 6),
|
||||
"broken": broken_std,
|
||||
"is_lag": is_lag,
|
||||
}
|
||||
)
|
||||
|
||||
df_agg = pd.DataFrame(rows)
|
||||
return df_agg
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
# ERROR ACCOUNT
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def build_error_account(aum, flows, lag_days=3):
|
||||
"""
|
||||
Builds a synthetic "error account" that absorbs the stock-flow
|
||||
residuals that cannot be explained by recorded flows.
|
||||
|
||||
Construction (backwards from t_ref):
|
||||
Stock_error(t_ref) = 0 (by definition)
|
||||
Stock_error(t-1) = Stock_error(t) - Residual(t)
|
||||
|
||||
where Residual(t) = [Σ_r Q_{r,s}(t) - Σ_r Q_{r,s}(t-1)] - Σ_r F_{r,s}(t)
|
||||
for each ISIN s independently.
|
||||
|
||||
By construction, adding this error account to the AUM restores the
|
||||
stock-flow equality at every (isin, month).
|
||||
|
||||
Also computes an aggregated error account (summed over all ISINs).
|
||||
|
||||
Returns
|
||||
-------
|
||||
df_err_isin : DataFrame with columns
|
||||
(date, isin, residual, stock_error, stock_error_pct)
|
||||
where stock_error_pct = stock_error / max(total_isin_aum, 1)
|
||||
|
||||
df_err_agg : DataFrame with columns
|
||||
(date, residual_agg, stock_error_agg, stock_error_agg_pct)
|
||||
"""
|
||||
t_min = aum["Centralisation Date"].min()
|
||||
t_max = aum["Centralisation Date"].max()
|
||||
all_months = pd.date_range(t_min, t_max, freq="ME")
|
||||
|
||||
# ── ISIN-level AUM panel (forward-filled) ────────────────────
|
||||
aum_agg = (
|
||||
aum.groupby(["Product - Isin", "Centralisation Date"])["Quantity - AUM"]
|
||||
.sum()
|
||||
.reset_index()
|
||||
.rename(
|
||||
columns={
|
||||
"Product - Isin": "isin",
|
||||
"Centralisation Date": "date",
|
||||
"Quantity - AUM": "qty",
|
||||
}
|
||||
)
|
||||
)
|
||||
aum_pivot = aum_agg.pivot(index="date", columns="isin", values="qty")
|
||||
aum_pivot = aum_pivot.reindex(all_months).ffill()
|
||||
|
||||
# ── ISIN-level flow aggregation (standard window) ─────────────
|
||||
def bucket_isin_flows(flows_df, months):
|
||||
fc = flows_df.copy()
|
||||
|
||||
def assign_month(d):
|
||||
for m in months:
|
||||
eom_prev = m - pd.offsets.MonthEnd(1)
|
||||
if eom_prev < d <= m:
|
||||
return m
|
||||
return pd.NaT
|
||||
|
||||
fc["month_end"] = fc["Centralisation Date"].apply(assign_month)
|
||||
fc = fc.dropna(subset=["month_end"])
|
||||
return (
|
||||
fc.groupby(["Product - Isin", "month_end"])["Quantity - NetFlows"]
|
||||
.sum()
|
||||
.unstack("Product - Isin")
|
||||
.reindex(months)
|
||||
.fillna(0.0)
|
||||
)
|
||||
|
||||
flow_pivot = bucket_isin_flows(flows, all_months)
|
||||
|
||||
# ── Compute residuals per (isin, month) ───────────────────────
|
||||
isins = aum_pivot.columns.tolist()
|
||||
# residual[t] = delta_AUM[t] - flow[t]
|
||||
residuals = {} # {isin: Series indexed by month}
|
||||
|
||||
for isin in isins:
|
||||
res_series = {}
|
||||
for i in range(1, len(all_months)):
|
||||
t_curr = all_months[i]
|
||||
t_prev = all_months[i - 1]
|
||||
q_curr = aum_pivot[isin].get(t_curr, np.nan)
|
||||
q_prev = aum_pivot[isin].get(t_prev, np.nan)
|
||||
if pd.isna(q_curr) or pd.isna(q_prev):
|
||||
continue
|
||||
delta = q_curr - q_prev
|
||||
f = flow_pivot[isin].get(t_curr, 0.0) if isin in flow_pivot.columns else 0.0
|
||||
res_series[t_curr] = delta - f
|
||||
residuals[isin] = pd.Series(res_series)
|
||||
|
||||
# ── Build error stock backwards from t_ref ────────────────────
|
||||
t_ref = all_months[-1]
|
||||
rows_isin = []
|
||||
|
||||
for isin in isins:
|
||||
res = residuals[isin]
|
||||
# Maximum AUM for this ISIN (for normalisation)
|
||||
max_aum = aum_pivot[isin].max()
|
||||
if pd.isna(max_aum) or max_aum < 1:
|
||||
max_aum = 1.0
|
||||
|
||||
# Propagate backwards: stock(t_ref) = 0
|
||||
stock = 0.0
|
||||
# Build dict keyed by date
|
||||
stock_by_date = {t_ref: 0.0}
|
||||
for i in range(len(all_months) - 2, -1, -1):
|
||||
t_curr = all_months[i + 1]
|
||||
t_prev = all_months[i]
|
||||
r = res.get(t_curr, 0.0)
|
||||
stock = stock - r
|
||||
stock_by_date[t_prev] = stock
|
||||
|
||||
for t in all_months:
|
||||
s = stock_by_date.get(t, np.nan)
|
||||
r = res.get(t, 0.0)
|
||||
rows_isin.append(
|
||||
{
|
||||
"date": t,
|
||||
"isin": isin,
|
||||
"residual": round(r, 4),
|
||||
"stock_error": round(s, 4) if not pd.isna(s) else np.nan,
|
||||
"stock_error_pct": round(abs(s) / max_aum * 100, 4)
|
||||
if not pd.isna(s)
|
||||
else np.nan,
|
||||
}
|
||||
)
|
||||
|
||||
df_err_isin = pd.DataFrame(rows_isin).sort_values(["date", "isin"])
|
||||
|
||||
# ── Aggregated error account ──────────────────────────────────
|
||||
# Total AUM across all ISINs at each month
|
||||
total_aum_by_month = aum_pivot.sum(axis=1)
|
||||
max_total_aum = total_aum_by_month.max()
|
||||
if pd.isna(max_total_aum) or max_total_aum < 1:
|
||||
max_total_aum = 1.0
|
||||
|
||||
# Aggregate residual = sum of ISIN residuals
|
||||
agg_res = {}
|
||||
for i in range(1, len(all_months)):
|
||||
t_curr = all_months[i]
|
||||
total_r = sum(residuals[isin].get(t_curr, 0.0) for isin in isins)
|
||||
agg_res[t_curr] = total_r
|
||||
|
||||
agg_stock = 0.0
|
||||
agg_stock_by_date = {t_ref: 0.0}
|
||||
for i in range(len(all_months) - 2, -1, -1):
|
||||
t_curr = all_months[i + 1]
|
||||
t_prev = all_months[i]
|
||||
agg_stock = agg_stock - agg_res.get(t_curr, 0.0)
|
||||
agg_stock_by_date[t_prev] = agg_stock
|
||||
|
||||
rows_agg = []
|
||||
for t in all_months:
|
||||
s = agg_stock_by_date.get(t, np.nan)
|
||||
r = agg_res.get(t, 0.0)
|
||||
rows_agg.append(
|
||||
{
|
||||
"date": t,
|
||||
"residual_agg": round(r, 4),
|
||||
"stock_error_agg": round(s, 4) if not pd.isna(s) else np.nan,
|
||||
"stock_error_agg_pct": round(abs(s) / max_total_aum * 100, 4)
|
||||
if not pd.isna(s)
|
||||
else np.nan,
|
||||
}
|
||||
)
|
||||
|
||||
df_err_agg = pd.DataFrame(rows_agg).sort_values("date")
|
||||
return df_err_isin, df_err_agg
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
# PRINT SUMMARY
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def print_summary(df_broken, df_all, alpha):
|
||||
total = len(df_all)
|
||||
n_broken = len(df_broken)
|
||||
n_lag = df_broken["is_lag"].sum()
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print(" CARMIGNAC — Broken Months Diagnostics")
|
||||
print("=" * 60)
|
||||
print(f" (isin, month) pairs examined : {total}")
|
||||
print(
|
||||
f" Broken (missing_pct > {alpha:.0%}) : {n_broken} "
|
||||
f"({n_broken / total * 100:.1f}%)"
|
||||
)
|
||||
print(f" Of which likely lag : {n_lag}")
|
||||
print(f" Of which genuine gap : {n_broken - n_lag}")
|
||||
|
||||
if n_broken:
|
||||
print("\n Top 10 by missing_pct:")
|
||||
cols = ["date", "isin", "missing_flow", "missing_pct", "is_lag"]
|
||||
print(df_broken[cols].head(10).to_string(index=False))
|
||||
|
||||
# Monthly breakdown
|
||||
by_month = (
|
||||
df_broken.groupby("date")
|
||||
.agg(
|
||||
n_broken=("isin", "count"),
|
||||
total_missing=("missing_flow", lambda x: x.abs().sum()),
|
||||
)
|
||||
.sort_values("n_broken", ascending=False)
|
||||
.head(5)
|
||||
)
|
||||
if len(by_month):
|
||||
print("\n Most affected months:")
|
||||
print(by_month.to_string())
|
||||
print()
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
# MAIN
|
||||
# ─────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Detect broken months in Carmignac AUM/Flows data"
|
||||
)
|
||||
parser.add_argument(
|
||||
"--out",
|
||||
default="carmignac_broken_months.csv",
|
||||
help="Machine-readable output (loaded by carmignac_repair.py)",
|
||||
)
|
||||
parser.add_argument("--html", default="carmignac_diagnostics.html")
|
||||
parser.add_argument(
|
||||
"--alpha",
|
||||
type=float,
|
||||
default=0.05,
|
||||
help="Tolerance threshold (default 0.05 = 5%%)",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--lag",
|
||||
type=int,
|
||||
default=3,
|
||||
help="Boundary days to test for accounting lag (default 3)",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
def resolve(p):
|
||||
if os.path.exists(p):
|
||||
return p
|
||||
alt = os.path.join(os.path.dirname(os.path.abspath(__file__)), p)
|
||||
if os.path.exists(alt):
|
||||
return alt
|
||||
sys.exit(f"[ERROR] File not found: {p}")
|
||||
|
||||
print("[Load] AUM")
|
||||
print("[Load] Flows")
|
||||
aum, flows = load_data_diagnostics()
|
||||
|
||||
print(
|
||||
f"\n[Detect] Running broken-month detection (α={args.alpha:.1%}, lag=±{args.lag}d)..."
|
||||
)
|
||||
df_broken, df_all = detect_broken_months(
|
||||
aum, flows, alpha=args.alpha, lag_days=args.lag
|
||||
)
|
||||
df_agg = detect_aggregate_broken_months(
|
||||
aum, flows, alpha=args.alpha, lag_days=args.lag
|
||||
)
|
||||
|
||||
print("\n[Error account] Building error account...")
|
||||
df_err_isin, df_err_agg = build_error_account(aum, flows, lag_days=args.lag)
|
||||
|
||||
print_summary(df_broken, df_all, args.alpha)
|
||||
|
||||
n_agg_broken = int(df_agg["broken"].sum())
|
||||
print(
|
||||
f" Aggregate broken months : {n_agg_broken} "
|
||||
f"(of which lags: {int(df_agg['is_lag'].sum())})"
|
||||
)
|
||||
max_err = float(df_err_agg["stock_error_agg"].abs().max())
|
||||
print(
|
||||
f" Max aggregate error stock : {max_err:,.1f} shares "
|
||||
f"({float(df_err_agg['stock_error_agg_pct'].max()):.3f}% of total AUM)"
|
||||
)
|
||||
|
||||
# CSV output — this is what carmignac_repair.py loads
|
||||
if len(df_broken):
|
||||
df_broken.to_csv(args.out, index=False)
|
||||
print(f"[Export] Broken months CSV → {args.out}")
|
||||
else:
|
||||
pd.DataFrame(columns=["date", "isin", "missing_pct", "is_lag"]).to_csv(
|
||||
args.out, index=False
|
||||
)
|
||||
print(f"[Export] No broken months — empty CSV → {args.out}")
|
||||
|
||||
# Error account CSV
|
||||
err_out = args.out.replace("broken_months", "error_account")
|
||||
df_err_isin.to_csv(err_out, index=False)
|
||||
err_agg_out = err_out.replace("error_account", "error_account_agg")
|
||||
df_err_agg.to_csv(err_agg_out, index=False)
|
||||
print(f"[Export] Error account (ISIN) → {err_out}")
|
||||
print(f"[Export] Error account (agg) → {err_agg_out}")
|
||||
|
||||
html = build_html_diagnostics(
|
||||
df_broken, df_all, df_agg, df_err_isin, df_err_agg, args.alpha
|
||||
)
|
||||
with open(args.html, "w", encoding="utf-8") as f:
|
||||
f.write(html)
|
||||
print(f"[Export] HTML report → {args.html}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -1,995 +0,0 @@
|
|||
"""
|
||||
Registrar ID Repair Pipeline
|
||||
=========================================================
|
||||
Étape 1 : Filtrage & univers de référence à t=31/10/2025
|
||||
Étape 2 : Score de cohérence temporelle (propagation vers le passé)
|
||||
Étape 3 : Chirurgie de code (matching 1-to-1)
|
||||
|
||||
À appliquer après le diagnostic de broken months
|
||||
"""
|
||||
|
||||
import os
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
|
||||
from helpers import load_data_repair
|
||||
|
||||
# ─────────────────────────────────────────────
|
||||
# PARAMÈTRES
|
||||
# ─────────────────────────────────────────────
|
||||
ALPHA = 0.05 # tolérance réconciliation : 5% du stock à t
|
||||
MIN_AUM_EUR = 5e6 # seuil filtrage étape 1
|
||||
MIN_JACCARD = 0.3 # seuil minimal similarité portefeuille pour chirurgie
|
||||
SCORE_DROP_THRESHOLD = 0.15 # si score chute de >15% → candidat chirurgie
|
||||
MAX_SURGERY_LOOKBACK = 6 # remonter jusqu'à 6 mois en arrière pour trouver un candidat
|
||||
SYMMETRY_ATTENUATION = (
|
||||
0.05 # facteur d'atténuation si rupture symétrique détectée (cas 1/3)
|
||||
)
|
||||
|
||||
# ── Broken months ──────────────────────────────────────────────
|
||||
# Attenuation factor applied to reconciliation errors on months flagged
|
||||
# as "broken" by carmignac_diagnostics.py. On a broken month the error
|
||||
# is multiplied by this factor before degrading the score, so a genuine
|
||||
# data-quality problem at market level does not unfairly penalise an
|
||||
# account. Set to 1.0 to disable attenuation.
|
||||
BROKEN_MONTH_ATTENUATION = 0.2 # reduce error to 20% on broken months
|
||||
|
||||
# ── Accounting lag window ──────────────────────────────────────
|
||||
# Transactions dated within this many days of a month-end boundary are
|
||||
# considered "boundary" flows. When the standard-window reconciliation
|
||||
# fails but the lag-adjusted reconciliation passes, the error is
|
||||
# attenuated (same factor as broken months).
|
||||
LAG_ATTENUATION = 0.1 # reduce error to 10% on likely lag months
|
||||
|
||||
# ── Fenêtre de chirurgie étendue ───────────────────────────────
|
||||
# Quand aucun bon candidat n'est trouvé à t-1, la chirurgie remonte
|
||||
# jusqu'à MAX_SURGERY_LOOKBACK mois en arrière. Pour chaque mois k
|
||||
# supplémentaire, le score composite est multiplié par un facteur de
|
||||
# confiance décroissant : confidence(k) = 1 - (k-1)/MAX_SURGERY_LOOKBACK.
|
||||
# Carmignac suggère 6 mois (délai maximal de résolution des transferts
|
||||
# asymétriques, lié au cycle de paiement des rétrocessions trimestrielles).
|
||||
MAX_SURGERY_LOOKBACK = 6
|
||||
|
||||
EXCLUDE_REGISTRAR = ["Off Distribution", "Private Clients"]
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────
|
||||
# CHARGEMENT
|
||||
# ─────────────────────────────────────────────
|
||||
def load_broken_months(broken_months_path):
|
||||
"""
|
||||
Loads the broken-months CSV produced by carmignac_diagnostics.py.
|
||||
Returns a set of (date, isin) tuples flagged as broken, and a
|
||||
separate set flagged as likely accounting lags.
|
||||
"""
|
||||
|
||||
if not broken_months_path or not os.path.exists(broken_months_path):
|
||||
print("Could not find the path")
|
||||
return set(), set()
|
||||
try:
|
||||
df = pd.read_csv(broken_months_path, parse_dates=["date"])
|
||||
broken = set(zip(pd.to_datetime(df["date"]), df["isin"].astype(str)))
|
||||
lags = set(
|
||||
zip(
|
||||
pd.to_datetime(df.loc[df["is_lag"], "date"]),
|
||||
df.loc[df["is_lag"], "isin"].astype(str),
|
||||
)
|
||||
)
|
||||
print(
|
||||
f"[Broken months] Loaded {len(broken)} flagged (isin, month) pairs "
|
||||
f"({len(lags)} likely lags)"
|
||||
)
|
||||
return broken, lags
|
||||
except Exception as e:
|
||||
print(f"[Broken months] Could not load '{broken_months_path}': {e}")
|
||||
return set(), set()
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────
|
||||
# ÉTAPE 1 — Univers de référence à T_REF
|
||||
# ─────────────────────────────────────────────
|
||||
def build_reference_universe(aum, t_ref=None):
|
||||
"""
|
||||
Construit l'univers de référence à t_ref (dernière date par défaut).
|
||||
Retourne :
|
||||
- aum_ref : AUM à t_ref pour chaque (reg_id, isin)
|
||||
- weights : poids normalisé par reg_id
|
||||
- universe : ensemble des reg_id retenus (>= MIN_AUM_EUR)
|
||||
"""
|
||||
if t_ref is None:
|
||||
t_ref = aum["date"].max()
|
||||
|
||||
print(f"\n[Étape 1] Date de référence : {t_ref.date()}")
|
||||
|
||||
# Exclure Off Distribution / Private Clients (sur région ou nom)
|
||||
mask_excl = aum["reg_id"].isin(EXCLUDE_REGISTRAR)
|
||||
if "region" in aum.columns:
|
||||
mask_excl |= aum["region"].isin(EXCLUDE_REGISTRAR)
|
||||
aum_clean = aum[~mask_excl].copy()
|
||||
|
||||
# AUM à t_ref
|
||||
aum_ref = aum_clean[aum_clean["date"] == t_ref][
|
||||
["reg_id", "isin", "qty_aum", "val_eur"]
|
||||
].copy()
|
||||
|
||||
# AUM total par reg_id à t_ref
|
||||
aum_by_reg = aum_ref.groupby("reg_id")["val_eur"].sum().rename("total_eur")
|
||||
|
||||
# Filtrage >= MIN_AUM_EUR
|
||||
universe = set(aum_by_reg[aum_by_reg >= MIN_AUM_EUR].index)
|
||||
|
||||
total_eur_universe = aum_by_reg[aum_by_reg.index.isin(universe)].sum()
|
||||
total_eur_all = aum_by_reg.sum()
|
||||
coverage = total_eur_universe / total_eur_all if total_eur_all > 0 else 0
|
||||
|
||||
print(f" Registrar IDs à t_ref : {len(aum_by_reg)}")
|
||||
print(f" Dont >= {MIN_AUM_EUR / 1e6:.0f}M€ : {len(universe)}")
|
||||
print(f" Couverture encours : {coverage:.1%}")
|
||||
|
||||
# Poids initiaux (scores à t_ref)
|
||||
weights = (
|
||||
aum_by_reg[aum_by_reg.index.isin(universe)] / total_eur_universe
|
||||
).to_dict()
|
||||
|
||||
return aum_ref, weights, universe, t_ref
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────
|
||||
# 3. PANEL AUM MENSUEL (forward-fill)
|
||||
# ─────────────────────────────────────────────
|
||||
def build_monthly_panel(aum, universe, t_ref):
|
||||
"""
|
||||
Construit un panel mensuel complet (forward-fill des quantités AUM)
|
||||
pour TOUS les reg_ids présents dans l'historique AUM — y compris les codes
|
||||
historiques hors univers de référence, nécessaires pour la chirurgie.
|
||||
"""
|
||||
# Toutes les fin de mois entre la première date et t_ref
|
||||
date_min = aum["date"].min()
|
||||
all_months = pd.date_range(start=date_min, end=t_ref, freq="ME")
|
||||
|
||||
# Pivot : (reg_id, isin) → série temporelle de qty_aum
|
||||
aum_sorted = aum.sort_values(["reg_id", "isin", "date"])
|
||||
|
||||
# On ne garde que les lignes jusqu'à t_ref
|
||||
aum_sorted = aum_sorted[aum_sorted["date"] <= t_ref]
|
||||
|
||||
# Multi-index pivot
|
||||
panel = aum_sorted.pivot_table(
|
||||
index="date", columns=["reg_id", "isin"], values="qty_aum", aggfunc="last"
|
||||
)
|
||||
|
||||
# Réindexer sur toutes les fins de mois
|
||||
panel = panel.reindex(all_months)
|
||||
|
||||
# Forward-fill : si pas de mouvement, la quantité reste la même
|
||||
panel = panel.ffill()
|
||||
|
||||
# Backward-fill initial pour les comptes qui démarrent après la première date
|
||||
# (on ne remonte pas avant leur première apparition → on garde NaN)
|
||||
|
||||
print(
|
||||
f"\n[Panel mensuel] {len(all_months)} mois, {panel.shape[1]} (reg_id, isin) paires"
|
||||
)
|
||||
|
||||
return panel, all_months
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────
|
||||
# 4. FLOWS AGRÉGÉS PAR MOIS
|
||||
# ─────────────────────────────────────────────
|
||||
def aggregate_flows_monthly(flows, all_months, lag_days=3):
|
||||
"""
|
||||
Agrège les flows infra-mensuels sur chaque fenêtre ]fin_mois(t-1), fin_mois(t)].
|
||||
Retourne deux DataFrames indexés par (fin_mois, reg_id, isin) :
|
||||
- monthly_flows : agrégation standard (fenêtre exacte)
|
||||
- monthly_flows_lag : agrégation avec fenêtre élargie de ±lag_days jours
|
||||
autour de chaque fin de mois. Utilisé pour détecter
|
||||
les ruptures dues à un décalage comptable de fin de mois.
|
||||
"""
|
||||
flows_f = flows[flows["date"] <= all_months[-1]].copy()
|
||||
|
||||
def assign_month(d, lower_offset=0, upper_offset=0):
|
||||
for m in all_months:
|
||||
eom_prev = m - pd.offsets.MonthEnd(1)
|
||||
lo = eom_prev - pd.Timedelta(days=lower_offset)
|
||||
hi = m + pd.Timedelta(days=upper_offset)
|
||||
if lo < d <= hi:
|
||||
return m
|
||||
return pd.NaT
|
||||
|
||||
# Standard window
|
||||
flows_f["month_end"] = flows_f["date"].apply(lambda d: assign_month(d))
|
||||
flows_std = flows_f.dropna(subset=["month_end"])
|
||||
monthly_flows = (
|
||||
flows_std.groupby(["month_end", "reg_id", "isin"])["qty_net"]
|
||||
.sum()
|
||||
.reset_index()
|
||||
)
|
||||
monthly_flows.columns = ["date", "reg_id", "isin", "qty_net_month"]
|
||||
|
||||
# Lag window (±lag_days around each EOM)
|
||||
flows_f2 = flows[flows["date"] <= all_months[-1]].copy()
|
||||
flows_f2["month_end"] = flows_f2["date"].apply(
|
||||
lambda d: assign_month(d, lower_offset=lag_days, upper_offset=lag_days)
|
||||
)
|
||||
flows_lag = flows_f2.dropna(subset=["month_end"])
|
||||
monthly_flows_lag = (
|
||||
flows_lag.groupby(["month_end", "reg_id", "isin"])["qty_net"]
|
||||
.sum()
|
||||
.reset_index()
|
||||
)
|
||||
monthly_flows_lag.columns = ["date", "reg_id", "isin", "qty_net_month"]
|
||||
|
||||
print(
|
||||
f"\n[Flows mensuels] {len(monthly_flows)} enregistrements (standard) | "
|
||||
f"{len(monthly_flows_lag)} (lag window ±{lag_days}d)"
|
||||
)
|
||||
|
||||
return monthly_flows, monthly_flows_lag
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────
|
||||
# ÉTAPE 2 — Score de cohérence temporelle
|
||||
# ─────────────────────────────────────────────
|
||||
def compute_reconciliation_error(qty_t_minus1, qty_t, net_flow, alpha=ALPHA):
|
||||
"""
|
||||
Calcule l'erreur de réconciliation normalisée pour un (reg_id, isin, mois).
|
||||
|
||||
Attendu : qty_t_minus1 + net_flow ≈ qty_t
|
||||
Erreur : |qty_t_minus1 + net_flow - qty_t| / max(|qty_t|, |qty_t_minus1|)
|
||||
|
||||
Retourne :
|
||||
- err_ratio : erreur relative (0 = parfait)
|
||||
- is_break : True si err_ratio > alpha
|
||||
"""
|
||||
denom = max(abs(qty_t), abs(qty_t_minus1), 1e-9)
|
||||
err = abs(qty_t_minus1 + net_flow - qty_t)
|
||||
err_ratio = err / denom
|
||||
return err_ratio, err_ratio > alpha
|
||||
|
||||
|
||||
def score_propagation(
|
||||
panel,
|
||||
monthly_flows,
|
||||
monthly_flows_lag,
|
||||
weights,
|
||||
universe,
|
||||
all_months,
|
||||
broken_months=None,
|
||||
lag_months=None,
|
||||
):
|
||||
"""
|
||||
Propage les scores de t_ref vers t=0 (passé).
|
||||
|
||||
À chaque mois t (en remontant), pour chaque reg_id dans l'univers courant :
|
||||
- Calculer l'erreur de réconciliation pondérée par ISIN
|
||||
- Dégrader le score proportionnellement
|
||||
- Atténuer l'erreur si le mois est flagué comme "broken" ou "lag"
|
||||
|
||||
broken_months : set of (date, isin) pairs flagged as broken by diagnostics
|
||||
lag_months : subset of broken_months identified as likely accounting lags
|
||||
|
||||
Retourne :
|
||||
- scores_history : dict {date → {reg_id → score}}
|
||||
- errors_history : dict {date → {reg_id → err_pondérée}}
|
||||
- mapping : dict {reg_id_original → reg_id_courant} (après chirurgie)
|
||||
"""
|
||||
broken_months = broken_months or set()
|
||||
lag_months = lag_months or set()
|
||||
|
||||
# Initialisation
|
||||
scores = dict(weights) # scores à t_ref
|
||||
scores_history = {all_months[-1]: dict(scores)}
|
||||
errors_history = {}
|
||||
|
||||
# Mapping actuel (identité au départ)
|
||||
mapping = {r: r for r in universe}
|
||||
|
||||
# Flows indexés pour accès rapide
|
||||
flows_idx = monthly_flows.set_index(["date", "reg_id", "isin"])["qty_net_month"]
|
||||
flows_idx_lag = monthly_flows_lag.set_index(["date", "reg_id", "isin"])[
|
||||
"qty_net_month"
|
||||
]
|
||||
|
||||
# ── Pré-calcul des AUM agrégés par (isin, mois) pour détection de symétrie ──
|
||||
# Pour chaque (isin, t), on calcule la somme des variations de stock par compte.
|
||||
# Une rupture symétrique = un compte perd X parts sur un ISIN, un autre en gagne X.
|
||||
# On détecte cela via le résidu net agrégé : si faible → symétrie probable.
|
||||
# Structure : {(t_curr, isin) → {reg_id → delta_qty}}
|
||||
# Calculé à la volée dans la boucle, pas en pré-calcul (trop mémoire pour 400 comptes).
|
||||
|
||||
# Remonter dans le temps
|
||||
for i in range(len(all_months) - 2, -1, -1):
|
||||
t_prev = all_months[i]
|
||||
t_curr = all_months[i + 1]
|
||||
|
||||
# ── Détection de ruptures symétriques à ce pas de temps ──────
|
||||
# Pour chaque ISIN, calculer la variation de stock par compte.
|
||||
# Si la somme des variations positives ≈ somme des variations négatives
|
||||
# → il y a probablement compensation (cas 1 ou 3, pas de perte nette).
|
||||
# On stocke pour chaque (reg_id, isin) si sa rupture est symétrique.
|
||||
symmetric_breaks = set() # ensemble de (reg_id, isin) à atténuer
|
||||
|
||||
for reg in panel.columns.get_level_values(0):
|
||||
for isin in panel[reg].columns:
|
||||
q_t = panel[reg][isin].get(t_curr, np.nan)
|
||||
q_prev = panel[reg][isin].get(t_prev, np.nan)
|
||||
if pd.isna(q_t) or pd.isna(q_prev):
|
||||
continue
|
||||
try:
|
||||
f = flows_idx.loc[(t_curr, reg, isin)]
|
||||
except KeyError:
|
||||
f = 0.0
|
||||
residual = (q_t - q_prev) - f
|
||||
if abs(residual) < ALPHA * max(abs(q_t), abs(q_prev), 1e-9):
|
||||
continue # pas de rupture sur ce compte/ISIN
|
||||
|
||||
# Agrégation par ISIN : si le résidu net agrégé est petit,
|
||||
# les ruptures individuelles se compensent → symétrie.
|
||||
isin_residuals = {}
|
||||
isin_total_abs = {}
|
||||
for reg in panel.columns.get_level_values(0):
|
||||
for isin in panel[reg].columns:
|
||||
q_t = panel[reg][isin].get(t_curr, np.nan)
|
||||
q_prev = panel[reg][isin].get(t_prev, np.nan)
|
||||
if pd.isna(q_t) or pd.isna(q_prev):
|
||||
continue
|
||||
try:
|
||||
f = flows_idx.loc[(t_curr, reg, isin)]
|
||||
except KeyError:
|
||||
f = 0.0
|
||||
residual = (q_t - q_prev) - f
|
||||
denom = max(abs(q_t), abs(q_prev), 1e-9)
|
||||
err = abs(residual) / denom
|
||||
if err < ALPHA:
|
||||
continue
|
||||
isin_residuals[isin] = isin_residuals.get(isin, 0.0) + residual
|
||||
isin_total_abs[isin] = isin_total_abs.get(isin, 0.0) + abs(residual)
|
||||
|
||||
# Un ISIN est "symétrique" si le résidu net < 20% du résidu brut total
|
||||
# (les erreurs individuelles s'annulent en grande partie)
|
||||
symmetric_isins = set()
|
||||
for isin, net in isin_residuals.items():
|
||||
total = isin_total_abs.get(isin, 0.0)
|
||||
if total > 0 and abs(net) / total < 0.20:
|
||||
symmetric_isins.add(isin)
|
||||
|
||||
errors_at_t = {}
|
||||
new_scores = {}
|
||||
|
||||
for reg_orig, reg_curr in mapping.items():
|
||||
score_curr = scores.get(reg_orig, 0)
|
||||
if score_curr == 0:
|
||||
new_scores[reg_orig] = 0
|
||||
continue
|
||||
|
||||
# ISIN détenus par ce reg à t_curr (après mapping)
|
||||
if reg_curr in panel.columns.get_level_values(0):
|
||||
isin_list = panel[reg_curr].columns.tolist()
|
||||
else:
|
||||
# reg_curr n'existe pas du tout dans le panel → rupture totale
|
||||
new_scores[reg_orig] = 0
|
||||
errors_at_t[reg_orig] = 1.0
|
||||
continue
|
||||
|
||||
total_aum_t = 0
|
||||
weighted_err = 0
|
||||
valid_isin_count = 0
|
||||
all_nan_at_prev = True # détecte si le compte n'existait pas à t_prev
|
||||
|
||||
for isin in isin_list:
|
||||
qty_t = panel[reg_curr][isin].get(t_curr, np.nan)
|
||||
qty_t_prev = panel[reg_curr][isin].get(t_prev, np.nan)
|
||||
|
||||
if pd.isna(qty_t):
|
||||
continue
|
||||
|
||||
if not pd.isna(qty_t_prev):
|
||||
all_nan_at_prev = False
|
||||
|
||||
if pd.isna(qty_t_prev):
|
||||
# ISIN existait à t_curr mais pas à t_prev → rupture sur cet ISIN
|
||||
# On le traite comme une erreur maximale pondérée par son AUM
|
||||
weight_isin = abs(qty_t)
|
||||
weighted_err += 1.0 * weight_isin
|
||||
total_aum_t += weight_isin
|
||||
valid_isin_count += 1
|
||||
continue
|
||||
|
||||
if qty_t == 0 and qty_t_prev == 0:
|
||||
continue
|
||||
# Flow agrégé sur ]t_prev, t_curr]
|
||||
try:
|
||||
net_flow = flows_idx.loc[(t_curr, reg_curr, isin)]
|
||||
except KeyError:
|
||||
net_flow = 0.0
|
||||
|
||||
err_ratio, is_break = compute_reconciliation_error(
|
||||
qty_t_prev, qty_t, net_flow, alpha=ALPHA
|
||||
)
|
||||
|
||||
# ── Attenuation on broken / lag / symmetric months ───
|
||||
# Priority: symmetric > broken > lag
|
||||
if err_ratio > 0:
|
||||
key = (t_curr, isin)
|
||||
if isin in symmetric_isins:
|
||||
# Rupture compensée à l'agrégé → cas 1 ou 3,
|
||||
# pas de perte nette de données → atténuation forte
|
||||
err_ratio = err_ratio * SYMMETRY_ATTENUATION
|
||||
elif key in broken_months or key in lag_months:
|
||||
# Try lag-window flow to distinguish lag vs genuine gap
|
||||
try:
|
||||
net_flow_lag = flows_idx_lag.loc[(t_curr, reg_curr, isin)]
|
||||
except KeyError:
|
||||
net_flow_lag = net_flow
|
||||
err_lag, _ = compute_reconciliation_error(
|
||||
qty_t_prev, qty_t, net_flow_lag, alpha=ALPHA
|
||||
)
|
||||
# Use whichever flow window gives the smaller error,
|
||||
# then attenuate the result
|
||||
best_err = min(err_ratio, err_lag)
|
||||
attenuation = (
|
||||
BROKEN_MONTH_ATTENUATION
|
||||
if key in broken_months
|
||||
else LAG_ATTENUATION
|
||||
)
|
||||
err_ratio = best_err * attenuation
|
||||
|
||||
# Pondération par AUM à t_curr
|
||||
weight_isin = abs(qty_t)
|
||||
weighted_err += err_ratio * weight_isin
|
||||
total_aum_t += weight_isin
|
||||
valid_isin_count += 1
|
||||
|
||||
if total_aum_t > 0 and valid_isin_count > 0:
|
||||
avg_err = weighted_err / total_aum_t
|
||||
else:
|
||||
avg_err = 0.0
|
||||
|
||||
errors_at_t[reg_orig] = avg_err
|
||||
|
||||
# Dégradation du score : score(t-1) = score(t) * (1 - err_pondérée)
|
||||
# Clippée entre 0 et score_curr
|
||||
degradation = min(avg_err, 1.0)
|
||||
new_scores[reg_orig] = score_curr * (1.0 - degradation)
|
||||
|
||||
scores = new_scores
|
||||
scores_history[t_prev] = dict(scores)
|
||||
errors_history[t_prev] = dict(errors_at_t)
|
||||
|
||||
total_score = sum(scores.values())
|
||||
print(
|
||||
f" {t_prev.date()} | Σ scores = {total_score:.4f} | "
|
||||
f"Comptes actifs = {sum(1 for v in scores.values() if v > 0)}"
|
||||
)
|
||||
|
||||
return scores_history, errors_history, mapping
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────
|
||||
# ÉTAPE 3 — Chirurgie de code
|
||||
# ─────────────────────────────────────────────
|
||||
def jaccard_isin(set_a, set_b):
|
||||
"""Coefficient de Jaccard entre deux ensembles d'ISIN."""
|
||||
if not set_a or not set_b:
|
||||
return 0.0
|
||||
inter = len(set_a & set_b)
|
||||
union = len(set_a | set_b)
|
||||
return inter / union if union > 0 else 0.0
|
||||
|
||||
|
||||
def find_best_candidate(
|
||||
reg_orig,
|
||||
reg_curr,
|
||||
t_prev,
|
||||
t_curr,
|
||||
panel,
|
||||
flows_idx,
|
||||
all_regs_at_t_prev,
|
||||
mapping_inv,
|
||||
):
|
||||
"""
|
||||
Pour un reg_id dont le score a fortement chuté, cherche le meilleur
|
||||
candidat j à t_prev tel que :
|
||||
- j n'est pas déjà mappé à un autre compte original
|
||||
- Le portefeuille ISIN de j à t_prev est similaire à celui de reg_curr à t_curr
|
||||
- La réconciliation est bonne
|
||||
|
||||
Retourne (best_candidate, best_score_composite) ou (None, 0)
|
||||
"""
|
||||
# ISIN du compte cible à t_curr
|
||||
if reg_curr not in panel.columns.get_level_values(0):
|
||||
return None, 0.0
|
||||
|
||||
isin_curr = set(
|
||||
panel[reg_curr]
|
||||
.columns[
|
||||
panel[reg_curr].loc[t_curr].notna() & (panel[reg_curr].loc[t_curr] != 0)
|
||||
]
|
||||
.tolist()
|
||||
)
|
||||
|
||||
if not isin_curr:
|
||||
return None, 0.0
|
||||
|
||||
best_candidate = None
|
||||
best_composite = 0.0
|
||||
|
||||
for j in all_regs_at_t_prev:
|
||||
# Ne pas réutiliser un code déjà mappé
|
||||
if j in mapping_inv:
|
||||
continue
|
||||
# Ne pas mapper sur soi-même si déjà présent
|
||||
if j == reg_curr:
|
||||
continue
|
||||
|
||||
if j not in panel.columns.get_level_values(0):
|
||||
continue
|
||||
|
||||
# ISIN de j à t_prev
|
||||
col_j = panel[j]
|
||||
isin_j = (
|
||||
set(
|
||||
col_j.columns[
|
||||
col_j.loc[t_prev].notna() & (col_j.loc[t_prev] != 0)
|
||||
].tolist()
|
||||
)
|
||||
if t_prev in col_j.index
|
||||
else set()
|
||||
)
|
||||
|
||||
if not isin_j:
|
||||
continue
|
||||
|
||||
jac = jaccard_isin(isin_curr, isin_j)
|
||||
if jac < MIN_JACCARD:
|
||||
continue
|
||||
|
||||
# Erreur de réconciliation pour les ISIN communs
|
||||
common_isin = isin_curr & isin_j
|
||||
total_aum = 0
|
||||
weighted_err = 0
|
||||
|
||||
for isin in common_isin:
|
||||
qty_t = (
|
||||
panel[reg_curr][isin].get(t_curr, np.nan)
|
||||
if isin in panel[reg_curr].columns
|
||||
else np.nan
|
||||
)
|
||||
qty_t_prev = (
|
||||
panel[j][isin].get(t_prev, np.nan)
|
||||
if isin in panel[j].columns
|
||||
else np.nan
|
||||
)
|
||||
|
||||
if pd.isna(qty_t) or pd.isna(qty_t_prev):
|
||||
continue
|
||||
|
||||
try:
|
||||
net_flow = flows_idx.loc[(t_curr, j, isin)]
|
||||
except KeyError:
|
||||
net_flow = 0.0
|
||||
|
||||
err_ratio, _ = compute_reconciliation_error(qty_t_prev, qty_t, net_flow)
|
||||
weight_isin = abs(qty_t)
|
||||
weighted_err += err_ratio * weight_isin
|
||||
total_aum += weight_isin
|
||||
|
||||
avg_err = weighted_err / total_aum if total_aum > 0 else 1.0
|
||||
|
||||
composite = jac * (1.0 - min(avg_err, 1.0))
|
||||
|
||||
if composite > best_composite:
|
||||
best_composite = composite
|
||||
best_candidate = j
|
||||
|
||||
return best_candidate, best_composite
|
||||
|
||||
|
||||
def _recompute_score_with_candidate(
|
||||
reg_orig, candidate, t_prev, t_curr, panel, flows_idx, score_curr
|
||||
):
|
||||
"""
|
||||
Recalcule proprement l'erreur de réconciliation pour un candidat donné,
|
||||
et retourne le score après chirurgie.
|
||||
"""
|
||||
if candidate not in panel.columns.get_level_values(0):
|
||||
return score_curr * 0 # candidat inexistant
|
||||
|
||||
isin_list_cand = panel[candidate].columns.tolist()
|
||||
isin_list_curr = (
|
||||
panel[reg_orig].columns.tolist()
|
||||
if reg_orig in panel.columns.get_level_values(0)
|
||||
else []
|
||||
)
|
||||
|
||||
total_aum = 0
|
||||
weighted_err = 0
|
||||
|
||||
for isin in isin_list_curr:
|
||||
qty_t = (
|
||||
panel[reg_orig][isin].get(t_curr, np.nan)
|
||||
if isin in panel[reg_orig].columns
|
||||
else np.nan
|
||||
)
|
||||
if pd.isna(qty_t) or qty_t == 0:
|
||||
continue
|
||||
|
||||
qty_t_prev = (
|
||||
panel[candidate][isin].get(t_prev, np.nan)
|
||||
if isin in panel[candidate].columns
|
||||
else np.nan
|
||||
)
|
||||
|
||||
try:
|
||||
net_flow = flows_idx.loc[(t_curr, candidate, isin)]
|
||||
except KeyError:
|
||||
net_flow = 0.0
|
||||
|
||||
if pd.isna(qty_t_prev):
|
||||
err_ratio = 1.0
|
||||
else:
|
||||
err_ratio, _ = compute_reconciliation_error(qty_t_prev, qty_t, net_flow)
|
||||
|
||||
weight_isin = abs(qty_t)
|
||||
weighted_err += err_ratio * weight_isin
|
||||
total_aum += weight_isin
|
||||
|
||||
avg_err = weighted_err / total_aum if total_aum > 0 else 1.0
|
||||
return score_curr * (1.0 - min(avg_err, 1.0))
|
||||
|
||||
|
||||
def run_surgery_pass(
|
||||
scores_history,
|
||||
errors_history,
|
||||
panel,
|
||||
monthly_flows,
|
||||
monthly_flows_lag,
|
||||
weights,
|
||||
universe,
|
||||
all_months,
|
||||
broken_months=None,
|
||||
lag_months=None,
|
||||
):
|
||||
"""
|
||||
Deuxième passe : pour chaque mois avec des ruptures fortes,
|
||||
tente une chirurgie de code et recalcule les scores.
|
||||
|
||||
Corrections par rapport à la passe naïve :
|
||||
- Après chirurgie, le score est recalculé proprement (pas juste composite)
|
||||
- Le mapping propagé en arrière utilise le bon code à chaque étape
|
||||
- Pré-filtre ISIN pour performance sur grand dataset
|
||||
|
||||
Retourne :
|
||||
- mapping_history : {date → {reg_orig → reg_used}}
|
||||
- surgery_log : liste des opérations effectuées
|
||||
- scores_final : scores au dernier mois
|
||||
"""
|
||||
flows_idx = monthly_flows.set_index(["date", "reg_id", "isin"])["qty_net_month"]
|
||||
flows_idx_lag = monthly_flows_lag.set_index(["date", "reg_id", "isin"])[
|
||||
"qty_net_month"
|
||||
]
|
||||
|
||||
# Tous les reg_ids présents dans le panel (univers + codes historiques)
|
||||
all_regs_in_panel = set(panel.columns.get_level_values(0))
|
||||
|
||||
# Pré-calcul : ensemble d'ISIN par reg_id à chaque date (pour pré-filtre rapide)
|
||||
# {reg_id → {date → set(isin)}}
|
||||
reg_isin_at_date = {}
|
||||
for reg in all_regs_in_panel:
|
||||
reg_isin_at_date[reg] = {}
|
||||
col = panel[reg]
|
||||
for date in col.index:
|
||||
active = set(
|
||||
col.columns[(col.loc[date].notna()) & (col.loc[date] != 0)].tolist()
|
||||
)
|
||||
if active:
|
||||
reg_isin_at_date[reg][date] = active
|
||||
|
||||
# Mapping courant : reg_orig → reg_used
|
||||
mapping = {r: r for r in universe}
|
||||
mapping_inv = {r: r for r in universe}
|
||||
|
||||
surgery_log = []
|
||||
mapping_history = {all_months[-1]: dict(mapping)}
|
||||
scores_history_corrected = {all_months[-1]: dict(weights)}
|
||||
|
||||
# Scores courants (initialisés à t_ref)
|
||||
scores = dict(weights)
|
||||
|
||||
for i in range(len(all_months) - 2, -1, -1):
|
||||
t_prev = all_months[i]
|
||||
t_curr = all_months[i + 1]
|
||||
|
||||
new_scores = {}
|
||||
new_mapping = {}
|
||||
|
||||
for reg_orig in list(mapping.keys()):
|
||||
reg_curr = mapping[reg_orig]
|
||||
score_curr = scores.get(reg_orig, 0)
|
||||
|
||||
if score_curr == 0:
|
||||
new_scores[reg_orig] = 0
|
||||
new_mapping[reg_orig] = reg_curr
|
||||
continue
|
||||
|
||||
# Erreur sans chirurgie (depuis étape 2)
|
||||
err = errors_history.get(t_prev, {}).get(reg_orig, 0.0)
|
||||
score_prev_no_surgery = score_curr * (1.0 - min(err, 1.0))
|
||||
drop_ratio = (
|
||||
1.0 - (score_prev_no_surgery / score_curr) if score_curr > 0 else 0
|
||||
)
|
||||
|
||||
if drop_ratio > SCORE_DROP_THRESHOLD:
|
||||
# ── ISIN du compte courant à t_curr (pour pré-filtre) ──
|
||||
isin_curr = reg_isin_at_date.get(reg_curr, {}).get(t_curr, set())
|
||||
|
||||
# ── Candidats disponibles ──
|
||||
# On exclut les codes déjà mappés à un autre compte,
|
||||
# mais reg_curr lui-même est un candidat valide (self-mapping).
|
||||
available = (all_regs_in_panel - set(mapping_inv.keys())) | {reg_curr}
|
||||
|
||||
best_candidate = None
|
||||
best_score_after = score_prev_no_surgery # baseline = pas de chirurgie
|
||||
best_composite = 0.0
|
||||
best_lookback = 0 # nombre de mois remontés pour trouver ce candidat
|
||||
|
||||
# ── Fenêtre de recherche étendue : jusqu'à MAX_SURGERY_LOOKBACK mois ──
|
||||
# On cherche d'abord à t-1 (k=1), puis t-2 … t-MAX si rien trouvé.
|
||||
# La confiance décroît avec la distance : confidence(k) = 1 - (k-1)/MAX
|
||||
for k in range(1, MAX_SURGERY_LOOKBACK + 1):
|
||||
if i - (k - 1) < 0:
|
||||
break # on a atteint le début de l'historique
|
||||
t_lookup = all_months[
|
||||
i - (k - 1)
|
||||
] # date candidate = t_prev - (k-1)
|
||||
confidence = 1.0 - (k - 1) / MAX_SURGERY_LOOKBACK
|
||||
|
||||
for j in available:
|
||||
# Pré-filtre rapide : overlap ISIN minimal
|
||||
isin_j = reg_isin_at_date.get(j, {}).get(t_lookup, set())
|
||||
if not isin_curr or not isin_j:
|
||||
continue
|
||||
inter = len(isin_curr & isin_j)
|
||||
if inter == 0:
|
||||
continue
|
||||
jac = inter / len(isin_curr | isin_j)
|
||||
if jac < MIN_JACCARD:
|
||||
continue
|
||||
|
||||
# Score après chirurgie avec ce candidat à t_lookup
|
||||
# (on utilise t_curr comme référence de stock, t_lookup comme prior)
|
||||
score_after_raw = _recompute_score_with_candidate(
|
||||
reg_curr, j, t_lookup, t_curr, panel, flows_idx, score_curr
|
||||
)
|
||||
# Appliquer le facteur de confiance lié à la distance temporelle
|
||||
score_after = (
|
||||
score_curr * confidence * (score_after_raw / score_curr)
|
||||
if score_curr > 0
|
||||
else score_after_raw
|
||||
)
|
||||
composite = (
|
||||
jac * confidence * (score_after_raw / score_curr)
|
||||
if score_curr > 0
|
||||
else 0
|
||||
)
|
||||
|
||||
if score_after > best_score_after:
|
||||
best_score_after = score_after
|
||||
best_candidate = j
|
||||
best_composite = composite
|
||||
best_lookback = k
|
||||
|
||||
# Si on a trouvé un bon candidat à cette distance, on s'arrête
|
||||
if best_candidate is not None:
|
||||
break
|
||||
|
||||
if best_candidate:
|
||||
lookback_note = (
|
||||
f", lookback={best_lookback}m" if best_lookback > 1 else ""
|
||||
)
|
||||
surgery_log.append(
|
||||
{
|
||||
"date": t_prev,
|
||||
"reg_orig": reg_orig,
|
||||
"reg_from": reg_curr,
|
||||
"reg_to": best_candidate,
|
||||
"jaccard_composite": round(best_composite, 4),
|
||||
"score_before": round(score_curr, 6),
|
||||
"score_after": round(best_score_after, 6),
|
||||
"drop_without_surgery": round(drop_ratio, 4),
|
||||
"gain_vs_no_surgery": round(
|
||||
best_score_after - score_prev_no_surgery, 6
|
||||
),
|
||||
"lookback_months": best_lookback,
|
||||
}
|
||||
)
|
||||
print(
|
||||
f" 🔧 CHIRURGIE {t_prev.date()} | {reg_orig} : "
|
||||
f"{reg_curr} → {best_candidate} "
|
||||
f"(composite={best_composite:.3f}, "
|
||||
f"score {score_curr:.4f}→{best_score_after:.4f}"
|
||||
f"{lookback_note})"
|
||||
)
|
||||
|
||||
# Mise à jour mapping
|
||||
if best_candidate != reg_curr:
|
||||
if reg_curr in mapping_inv:
|
||||
del mapping_inv[reg_curr]
|
||||
mapping_inv[best_candidate] = reg_orig
|
||||
new_mapping[reg_orig] = best_candidate
|
||||
new_scores[reg_orig] = best_score_after
|
||||
else:
|
||||
new_mapping[reg_orig] = reg_curr
|
||||
new_scores[reg_orig] = score_prev_no_surgery
|
||||
else:
|
||||
new_mapping[reg_orig] = reg_curr
|
||||
new_scores[reg_orig] = score_prev_no_surgery
|
||||
|
||||
mapping = new_mapping
|
||||
mapping_inv = {v: k for k, v in mapping.items()}
|
||||
scores = new_scores
|
||||
mapping_history[t_prev] = dict(mapping)
|
||||
scores_history_corrected[t_prev] = dict(scores)
|
||||
|
||||
total_score = sum(s for s in scores.values() if not np.isnan(s))
|
||||
n_surgeries = sum(1 for op in surgery_log if op["date"] == t_prev)
|
||||
print(
|
||||
f" {t_prev.date()} | Σ scores = {total_score:.4f} | "
|
||||
f"Chirurgies = {n_surgeries}"
|
||||
)
|
||||
|
||||
return mapping_history, surgery_log, scores, scores_history_corrected
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────
|
||||
# EXPORT RÉSULTATS
|
||||
# ─────────────────────────────────────────────
|
||||
def export_results(
|
||||
scores_history, mapping_history, surgery_log, all_months, out_prefix="carmignac"
|
||||
):
|
||||
"""Exporte les résultats clés en CSV."""
|
||||
|
||||
# Score history
|
||||
rows = []
|
||||
for date, sc in scores_history.items():
|
||||
for reg, score in sc.items():
|
||||
rows.append({"date": date, "reg_id": reg, "score": score})
|
||||
df_scores = (
|
||||
pd.DataFrame(rows)
|
||||
if rows
|
||||
else pd.DataFrame(columns=["date", "reg_id", "score"])
|
||||
)
|
||||
if not df_scores.empty:
|
||||
df_scores = df_scores.sort_values(["date", "score"], ascending=[True, False])
|
||||
df_scores.to_csv(
|
||||
f"repair_challenge/repair_results/{out_prefix}_scores.csv", index=False
|
||||
)
|
||||
|
||||
# Mapping history
|
||||
rows_m = []
|
||||
for date, mp in mapping_history.items():
|
||||
for reg_orig, reg_used in mp.items():
|
||||
rows_m.append(
|
||||
{
|
||||
"date": date,
|
||||
"reg_orig": reg_orig,
|
||||
"reg_used": reg_used,
|
||||
"changed": reg_orig != reg_used,
|
||||
}
|
||||
)
|
||||
df_mapping = (
|
||||
pd.DataFrame(rows_m)
|
||||
if rows_m
|
||||
else pd.DataFrame(columns=["date", "reg_orig", "reg_used", "changed"])
|
||||
)
|
||||
if not df_mapping.empty:
|
||||
df_mapping = df_mapping.sort_values(["date", "reg_orig"])
|
||||
df_mapping.to_csv(
|
||||
f"repair_challenge/repair_results/{out_prefix}_mapping.csv", index=False
|
||||
)
|
||||
|
||||
# Surgery log
|
||||
if surgery_log:
|
||||
df_surgery = pd.DataFrame(surgery_log).sort_values("date")
|
||||
df_surgery.to_csv(
|
||||
f"repair_challenge/repair_results/{out_prefix}_surgery_log.csv", index=False
|
||||
)
|
||||
print(f"\n[Export] {len(surgery_log)} opérations de chirurgie sauvegardées.")
|
||||
else:
|
||||
print("\n[Export] Aucune chirurgie effectuée sur ce subset.")
|
||||
|
||||
print(f"[Export] Scores → {out_prefix}_scores.csv")
|
||||
print(f"[Export] Mapping → {out_prefix}_mapping.csv")
|
||||
|
||||
return df_scores, df_mapping
|
||||
|
||||
|
||||
# ─────────────────────────────────────────────
|
||||
# PIPELINE PRINCIPAL
|
||||
# ─────────────────────────────────────────────
|
||||
def run_pipeline(
|
||||
broken_months_path="repair_challenge/alpha_5%/carmignac_broken_months.csv",
|
||||
):
|
||||
print("=" * 60)
|
||||
print("CARMIGNAC — Pipeline de réparation des Registrar IDs")
|
||||
print("=" * 60)
|
||||
|
||||
# Chargement
|
||||
aum, flows = load_data_repair()
|
||||
|
||||
# Broken months (optional — produced by carmignac_diagnostics.py)
|
||||
broken_months, lag_months = load_broken_months(broken_months_path)
|
||||
|
||||
# Étape 1 — Univers de référence
|
||||
aum_ref, weights, universe, t_ref = build_reference_universe(aum)
|
||||
|
||||
print("\n Top 5 comptes par poids :")
|
||||
for reg, w in sorted(weights.items(), key=lambda x: -x[1])[:5]:
|
||||
print(f" {reg} : {w:.4f} ({w * 100:.2f}%)")
|
||||
|
||||
# Panel mensuel
|
||||
panel, all_months = build_monthly_panel(aum, universe, t_ref)
|
||||
|
||||
# Flows mensuels agrégés (standard + lag window)
|
||||
monthly_flows, monthly_flows_lag = aggregate_flows_monthly(flows, all_months)
|
||||
|
||||
# Étape 2 — Score de cohérence (sans chirurgie)
|
||||
print("\n[Étape 2] Propagation des scores (sans chirurgie)...")
|
||||
scores_history, errors_history, _ = score_propagation(
|
||||
panel,
|
||||
monthly_flows,
|
||||
monthly_flows_lag,
|
||||
weights,
|
||||
universe,
|
||||
all_months,
|
||||
broken_months=broken_months,
|
||||
lag_months=lag_months,
|
||||
)
|
||||
|
||||
# Étape 3 — Chirurgie
|
||||
print("\n[Étape 3] Passe de chirurgie...")
|
||||
mapping_history, surgery_log, final_scores, scores_history_corrected = (
|
||||
run_surgery_pass(
|
||||
scores_history,
|
||||
errors_history,
|
||||
panel,
|
||||
monthly_flows,
|
||||
monthly_flows_lag,
|
||||
weights,
|
||||
universe,
|
||||
all_months,
|
||||
broken_months=broken_months,
|
||||
lag_months=lag_months,
|
||||
)
|
||||
)
|
||||
|
||||
# Export — on utilise les scores corrigés (post-chirurgie) comme référence
|
||||
print("\n[Export des résultats...]")
|
||||
df_scores, df_mapping = export_results(
|
||||
scores_history_corrected, mapping_history, surgery_log, all_months
|
||||
)
|
||||
|
||||
# Résumé final
|
||||
print("\n" + "=" * 60)
|
||||
print("RÉSUMÉ FINAL")
|
||||
print("=" * 60)
|
||||
print(
|
||||
f" Dates couvertes : {all_months[0].date()} → {all_months[-1].date()}"
|
||||
)
|
||||
print(f" Comptes dans l'univers : {len(universe)}")
|
||||
print(f" Chirurgies effectuées : {len(surgery_log)}")
|
||||
score_by_date = {
|
||||
d: sum(s for s in sc.values() if s == s)
|
||||
for d, sc in scores_history_corrected.items()
|
||||
}
|
||||
print(f" Σ scores à t_ref : {score_by_date[t_ref]:.4f}")
|
||||
print(f" Σ scores à t_min : {score_by_date[all_months[0]]:.4f}")
|
||||
|
||||
return df_scores, df_mapping, surgery_log, scores_history_corrected, mapping_history
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
df_scores, df_mapping, surgery_log, scores_history, mapping_history = run_pipeline(
|
||||
broken_months_path="repair_challenge/alpha_5%/carmignac_broken_months.csv" # optional
|
||||
)
|
||||
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
File diff suppressed because it is too large
Load Diff
File diff suppressed because one or more lines are too long
Loading…
Reference in New Issue
Block a user