2026-03-10 18:45:51 +01:00
{
2026-04-05 17:49:37 +02:00
"nbformat": 4,
"nbformat_minor": 5,
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.13.0"
}
},
2026-03-10 18:45:51 +01:00
"cells": [
2026-04-05 17:49:37 +02:00
{
"cell_type": "markdown",
"metadata": {},
"source": "# Level Shift Repair — Improved Version\n\n**Three improvements over first attempt:**\n- Gap stability tolerance is now **relative** (1%) instead of absolute (1e-6)\n- Detection is now **iterative**: multiple level shifts per trajectory are corrected\n- Minimum relative gap threshold lowered to **2%** to capture smaller persistent anomalies\n\n**Sections:**\n1. Imports & Data Loading\n2. Build Panel\n3. Improved Repair Algorithm\n4. Rebuild Stocks File\n5. Validation & Diagnostics\n6. Figure"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## 0. Imports & Data Loading"
},
2026-03-10 18:45:51 +01:00
{
"cell_type": "code",
2026-04-05 17:49:37 +02:00
"execution_count": null,
2026-03-10 18:45:51 +01:00
"metadata": {},
"outputs": [],
2026-04-05 17:49:37 +02:00
"source": "import pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport plotly.graph_objects as go\n\nstocks = pd.read_csv(\"stocks.csv\", low_memory=False)\nflows = pd.read_csv(\"flows.csv\", low_memory=False)\n\nstocks[\"Centralisation Date\"] = pd.to_datetime(stocks[\"Centralisation Date\"])\nflows[\"Centralisation Date\"] = pd.to_datetime(flows[\"Centralisation Date\"])\n\nprint(f\"Stocks: {stocks.shape}\")\nprint(f\"Flows: {flows.shape}\")"
2026-03-10 18:45:51 +01:00
},
{
2026-04-05 17:49:37 +02:00
"cell_type": "markdown",
2026-03-10 18:45:51 +01:00
"metadata": {},
2026-04-05 17:49:37 +02:00
"source": "## 1. Build Panel (Account x ISIN x Date)"
2026-03-10 18:45:51 +01:00
},
{
"cell_type": "code",
2026-04-05 17:49:37 +02:00
"execution_count": null,
2026-03-10 18:45:51 +01:00
"metadata": {},
"outputs": [],
2026-04-05 17:49:37 +02:00
"source": "KEY = [\"Registrar Account - ID\", \"Product - Isin\", \"Centralisation Date\"]\nGROUP = [\"Registrar Account - ID\", \"Product - Isin\"]\n\nstocks_panel = stocks[KEY + [\"Quantity - AUM\"]].copy()\n\nflows_panel = (\n flows[KEY + [\"Quantity - NetFlows\"]]\n .groupby(KEY, as_index=False)[\"Quantity - NetFlows\"]\n .sum()\n)\n\ndf = stocks_panel.merge(flows_panel, on=KEY, how=\"left\")\ndf[\"Quantity - NetFlows\"] = df[\"Quantity - NetFlows\"].fillna(0)\ndf = df.sort_values(KEY).reset_index(drop=True)\n\n# Remove negative AUM (data quality filter)\ndf = df[df[\"Quantity - AUM\"] >= 0].copy()\n\nprint(f\"Panel size: {df.shape}\")\nprint(f\"Accounts: {df['Registrar Account - ID'].nunique():,}\")\nprint(f\"ISINs: {df['Product - Isin'].nunique():,}\")"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## 2. Improved Repair Algorithm\n\n**Fix 1 — Relative gap stability tolerance**\nThe original algorithm used `GAP_TOL = 1e-6` (absolute), requiring the gap to be perfectly constant.\nIn practice gaps fluctuate slightly due to valuation effects and rounding.\nWe now use a relative tolerance of 1%:\n`|gap(t+k) - gap(t)| / |gap(t)| < REL_STABILITY_TOL`\n\n**Fix 2 — Iterative detection**\nThe original algorithm stopped after the first detected shift.\nWe now loop until no more shifts are found, correcting multiple successive migrations.\n\n**Fix 3 — Lower detection threshold**\nLowered from 5% to 2% to capture smaller but clearly persistent anomalies."
2026-03-10 18:45:51 +01:00
},
{
"cell_type": "code",
2026-04-05 17:49:37 +02:00
"execution_count": null,
2026-03-10 18:45:51 +01:00
"metadata": {},
"outputs": [],
2026-04-05 17:49:37 +02:00
"source": "# ============================================================\n# PARAMETERS\n# ============================================================\n\nREL_GAP_THR = 0.02 # minimum relative gap to trigger detection (2%, was 5%)\nREL_STABILITY_TOL = 0.01 # relative tolerance on gap stability (1%, was absolute 1e-6)\nMIN_PERSISTENCE = 3 # minimum consecutive stable periods\nMAX_ITERATIONS = 10 # maximum repair iterations per trajectory\n\n# ============================================================\n# REPAIR FUNCTION (IMPROVED)\n# ============================================================\n\ndef repair_group(g):\n \"\"\"\n Iteratively detect and correct persistent level shifts.\n Improvements: relative stability tolerance, iterative detection, lower threshold.\n \"\"\"\n g = g.copy()\n obs = g[\"Quantity - AUM\"].values.copy()\n flows_ = g[\"Quantity - NetFlows\"].values\n n = len(obs)\n\n corrected = obs.copy()\n repair_flag = np.zeros(n, dtype=bool)\n n_repairs = 0\n\n for _ in range(MAX_ITERATIONS):\n\n # Build expected path from current corrected series\n expected = np.empty(n)\n expected[0] = np.nan\n for t in range(1, n):\n expected[t] = corrected[t-1] + flows_[t-1]\n\n gap = corrected - expected\n rel_gap = np.abs(gap) / np.maximum(np.abs(expected), 1.0)\n\n # Search for first persistent level shift\n idx = None\n shift = None\n\n for i in range(1, n - MIN_PERSISTENCE):\n if rel_gap[i] <= REL_GAP_THR:\n continue\n\n # Check gap stability with RELATIVE tolerance (Fix 1)\n window = gap[i:i + MIN_PERSISTENCE]\n ref = gap[i]\n\n if abs(ref) < 1.0:\n stable = np.all(np.abs(window - ref) < 1.0)\n else:\n stable = np.all(np.abs(window - ref) / np.abs(ref) < REL_STABILITY_TOL)\n\n if not stable:\n continue\n\n idx = i\n shift = ref\n break\n\n # No more shifts found: stop (Fix 2 — iterative)\n if idx is None:\n break\n\n # Safety: do not create new negative AUM\n candidate = corrected[idx:] - shift\n if ((candidate < 0) & (corrected[idx:] >= 0)).any():\n break\n\n # Safety: avoid extreme corrections\n if abs(shift) > 2 * np.nanmax(np.abs(corrected)):\n break\n\n # Apply correction\n corrected[idx:] = candidate\n repair_flag[idx:] = True\n n_repairs += 1\n\n if n_repairs == 0:\n return g\n\n # Rebuild expected path after all corrections\n expected_corr = np.empty(n)\n expected_corr[0] = np.nan\n for t in range(1, n):\n expected_corr[t] = corrected[t-1] + flows_[t-1]\n\n g[\"corrected_aum\"] = corrected\n g[\"expected_stock_corr\"] = expected_corr\n g[\"repair_flag\"] = repair_flag\n g[\"n_repairs\"] = n_repairs\n\n return g"
2026-03-10 18:45:51 +01:00
},
{
"cell_type": "code",
2026-04-05 17:49:37 +02:00
"execution_count": null,
2026-03-10 18:45:51 +01:00
"metadata": {},
"outputs": [],
2026-04-05 17:49:37 +02:00
"source": "# Apply repair\ndf_repair = df.copy()\ndf_repair[\"corrected_aum\"] = df_repair[\"Quantity - AUM\"]\ndf_repair[\"expected_stock_corr\"] = np.nan\ndf_repair[\"repair_flag\"] = False\ndf_repair[\"n_repairs\"] = 0\n\ndf_repair = (\n df_repair\n .groupby(GROUP, group_keys=False)\n .apply(repair_group)\n .reset_index(drop=True)\n)\n\n# Rebuild expected stock before repair\ndf_repair = df_repair.sort_values(KEY).reset_index(drop=True)\ndf_repair[\"prev_aum\"] = df_repair.groupby(GROUP)[\"Quantity - AUM\"].shift(1)\ndf_repair[\"prev_flow\"] = df_repair.groupby(GROUP)[\"Quantity - NetFlows\"].shift(1).fillna(0)\ndf_repair[\"expected_stock\"] = df_repair[\"prev_aum\"] + df_repair[\"prev_flow\"]\n\ndf_repair[\"gap_before\"] = df_repair[\"Quantity - AUM\"] - df_repair[\"expected_stock\"]\ndf_repair[\"gap_after\"] = df_repair[\"corrected_aum\"] - df_repair[\"expected_stock_corr\"]\ndf_repair[\"rupture_before\"] = df_repair[\"gap_before\"].abs() > 10\ndf_repair[\"rupture_after\"] = df_repair[\"gap_after\"].abs() > 10\n\nmulti_shift = (df_repair.groupby(GROUP)[\"n_repairs\"].max() > 1).sum()\n\nprint(\"=== REPAIR SUMMARY ===\")\nprint(pd.DataFrame({\n \"Before repair\": [int(df_repair[\"rupture_before\"].sum())],\n \"After repair\": [int(df_repair[\"rupture_after\"].sum())],\n \"Repaired points\": [int(df_repair[\"repair_flag\"].sum())],\n \"Multi-shift trajectories\": [int(multi_shift)]\n}))\n\ndf_repair[[\n \"Registrar Account - ID\", \"Product - Isin\", \"Centralisation Date\",\n \"Quantity - AUM\", \"corrected_aum\", \"Quantity - NetFlows\",\n \"expected_stock\", \"expected_stock_corr\",\n \"gap_before\", \"gap_after\", \"repair_flag\", \"n_repairs\"\n]].rename(columns={\n \"Quantity - AUM\": \"aum_raw\",\n \"corrected_aum\": \"aum_repaired\",\n \"Quantity - NetFlows\": \"flows\",\n \"expected_stock\": \"expected_aum_raw\",\n \"expected_stock_corr\": \"expected_aum_repaired\"\n}).to_csv(\"df_repaired.csv\", index=False)"
2026-03-10 18:45:51 +01:00
},
{
2026-04-05 17:49:37 +02:00
"cell_type": "markdown",
2026-03-10 18:45:51 +01:00
"metadata": {},
2026-04-05 17:49:37 +02:00
"source": "## 3. Rebuild Stocks File"
2026-03-10 18:45:51 +01:00
},
{
"cell_type": "code",
2026-04-05 17:49:37 +02:00
"execution_count": null,
2026-03-10 18:45:51 +01:00
"metadata": {},
2026-04-05 17:49:37 +02:00
"outputs": [],
"source": "stocks_repaired = stocks.copy()\nstocks_repaired[\"Centralisation Date\"] = pd.to_datetime(stocks_repaired[\"Centralisation Date\"])\n\nrepair_map = df_repair[KEY + [\"corrected_aum\", \"repair_flag\", \"n_repairs\"]].rename(\n columns={\"corrected_aum\": \"Quantity - AUM repaired\"}\n)\n\nstocks_repaired = stocks_repaired.merge(repair_map, on=KEY, how=\"left\")\nstocks_repaired[\"Quantity - AUM original\"] = stocks_repaired[\"Quantity - AUM\"]\nstocks_repaired[\"Quantity - AUM\"] = np.where(\n stocks_repaired[\"repair_flag\"] == True,\n stocks_repaired[\"Quantity - AUM repaired\"],\n stocks_repaired[\"Quantity - AUM\"]\n)\n\n# Recompute monetary values (NAV per share unchanged)\nstocks_repaired[\"nav_ccy\"] = stocks_repaired[\"Value - AUM CCY\"] / stocks_repaired[\"Quantity - AUM original\"]\nstocks_repaired[\"nav_eur\"] = stocks_repaired[\"Value - AUM €\"] / stocks_repaired[\"Quantity - AUM original\"]\nstocks_repaired[\"Value - AUM CCY\"] = stocks_repaired[\"Quantity - AUM\"] * stocks_repaired[\"nav_ccy\"]\nstocks_repaired[\"Value - AUM €\"] = stocks_repaired[\"Quantity - AUM\"] * stocks_repaired[\"nav_eur\"]\nstocks_repaired = stocks_repaired.drop(columns=[\n \"Quantity - AUM repaired\", \"Quantity - AUM original\", \"nav_ccy\", \"nav_eur\"\n])\n\nprint(f\"Share of repaired observations: {stocks_repaired['repair_flag'].mean():.4%}\")\nprint(f\"Number of repaired rows: {stocks_repaired['repair_flag'].sum():,}\")\nprint(f\"Trajectories with 2+ repairs: {(stocks_repaired.groupby(GROUP)['n_repairs'].max() >= 2).sum():,}\")\n\nstocks_repaired.to_csv(\"stock_repaired.csv\", index=False)"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "## 4. Validation & Diagnostics"
2026-03-10 18:45:51 +01:00
},
{
"cell_type": "code",
2026-04-05 17:49:37 +02:00
"execution_count": null,
2026-03-10 18:45:51 +01:00
"metadata": {},
2026-04-05 17:49:37 +02:00
"outputs": [],
"source": "# Safety check: no new negative AUM\ndf_compare = stocks.merge(\n stocks_repaired[KEY + [\"Quantity - AUM\"]],\n on=KEY, how=\"inner\", suffixes=(\"_raw\", \"_repaired\")\n)\n\nneg_raw = (df_compare[\"Quantity - AUM_raw\"] < 0).sum()\nneg_rep = (df_compare[\"Quantity - AUM_repaired\"] < 0).sum()\ncreated_neg = df_compare[\n (df_compare[\"Quantity - AUM_raw\"] >= 0) &\n (df_compare[\"Quantity - AUM_repaired\"] < 0)\n].groupby([\"Registrar Account - ID\", \"Product - Isin\"]).size()\n\nprint(f\"Negative AUM before repair: {neg_raw:,}\")\nprint(f\"Negative AUM after repair: {neg_rep:,}\")\nprint(f\"Series where repair created negatives: {len(created_neg):,}\")\n\nn_before = int(df_repair[\"rupture_before\"].sum())\nn_after = int(df_repair[\"rupture_after\"].sum())\nprint(f\"\\nRuptures before: {n_before:,}\")\nprint(f\"Ruptures after: {n_after:,}\")\nprint(f\"Reduction rate: {1 - n_after/n_before:.1%}\")"
2026-03-10 18:45:51 +01:00
},
{
"cell_type": "code",
2026-04-05 17:49:37 +02:00
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "# Accounting scatter: Flow vs ΔAUM (before / after)\n\ndef build_accounting_df(stocks_df):\n flows_agg = flows[KEY + [\"Quantity - NetFlows\"]].groupby(KEY, as_index=False).sum()\n d = stocks_df.merge(flows_agg, on=KEY, how=\"left\")\n d[\"Quantity - NetFlows\"] = d[\"Quantity - NetFlows\"].fillna(0)\n d = d.sort_values(KEY)\n d[\"prev_aum\"] = d.groupby(GROUP)[\"Quantity - AUM\"].shift(1)\n d[\"flow_lag\"] = d.groupby(GROUP)[\"Quantity - NetFlows\"].shift(1).fillna(0)\n d[\"delta_aum\"] = d[\"Quantity - AUM\"] - d[\"prev_aum\"]\n return d[d[\"prev_aum\"].notna()]\n\nfig, axes = plt.subplots(1, 2, figsize=(12, 5))\nfor ax, stocks_df, title in zip(axes, [stocks, stocks_repaired], [\"Raw\", \"Repaired\"]):\n s = build_accounting_df(stocks_df).sample(min(20000, len(stocks_df)), random_state=1)\n ax.scatter(s[\"flow_lag\"], s[\"delta_aum\"], alpha=0.2, s=4)\n lim = s[\"flow_lag\"].abs().quantile(0.99)\n x = np.linspace(-lim, lim, 100)\n ax.plot(x, x, color=\"red\", linewidth=1.5, label=\"Perfect identity\")\n ax.set_xlim(-lim, lim); ax.set_ylim(-lim, lim)\n ax.set_xlabel(\"Flow (t-1)\"); ax.set_ylabel(\"Δ AUM\")\n ax.set_title(title); ax.legend()\nplt.tight_layout()\nplt.show()"
2026-03-10 18:45:51 +01:00
},
{
"cell_type": "code",
2026-04-05 17:49:37 +02:00
"execution_count": null,
2026-03-10 18:45:51 +01:00
"metadata": {},
2026-04-05 17:49:37 +02:00
"outputs": [],
"source": "# Gap persistence: raw vs repaired\n\ndef build_gap_df(stocks_df):\n flows_agg = flows[KEY + [\"Quantity - NetFlows\"]].groupby(KEY, as_index=False).sum()\n d = stocks_df.merge(flows_agg, on=KEY, how=\"left\")\n d[\"Quantity - NetFlows\"] = d[\"Quantity - NetFlows\"].fillna(0)\n d = d.sort_values(KEY)\n d[\"prev_aum\"] = d.groupby(GROUP)[\"Quantity - AUM\"].shift(1)\n d[\"flow_lag\"] = d.groupby(GROUP)[\"Quantity - NetFlows\"].shift(1).fillna(0)\n d[\"gap\"] = d[\"Quantity - AUM\"] - (d[\"prev_aum\"] + d[\"flow_lag\"])\n return d\n\ndef compute_gap_sequences(df, tol=1.0):\n lengths = []\n for _, g in df.groupby(GROUP):\n gaps = g[\"gap\"].values\n run = 1\n for i in range(1, len(gaps)):\n if np.isfinite(gaps[i]) and np.isfinite(gaps[i-1]) and abs(gaps[i]-gaps[i-1]) < tol:\n run += 1\n else:\n lengths.append(run); run = 1\n lengths.append(run)\n return np.array(lengths)\n\nraw_lengths = compute_gap_sequences(build_gap_df(stocks))\nclean_lengths = compute_gap_sequences(build_gap_df(stocks_repaired))\n\nmax_len = 20\nraw_h = np.bincount(np.minimum(raw_lengths, max_len), minlength=max_len+1) / len(raw_lengths)\nclean_h = np.bincount(np.minimum(clean_lengths, max_len), minlength=max_len+1) / len(clean_lengths)\nx = np.arange(max_len + 1)\n\nplt.figure(figsize=(9, 4))\nplt.plot(x, raw_h, marker=\"o\", label=\"Raw\")\nplt.plot(x, clean_h, marker=\"o\", label=\"Repaired\")\nplt.xlabel(\"Gap persistence length (20 = 20+)\")\nplt.ylabel(\"Share of sequences\")\nplt.title(\"Persistence of accounting gaps — Raw vs Repaired\")\nplt.legend()\nplt.tight_layout()\nplt.show()"
2026-03-10 18:45:51 +01:00
},
2026-04-05 17:49:37 +02:00
{
"cell_type": "markdown",
"metadata": {},
"source": "## 5. Figure — Rupture intensity distribution (Before vs After)"
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": "def rupture_distribution(df):\n series = (\n df.groupby(GROUP)\n .agg(n_ruptures=(\"rupture\", \"sum\"), total_obs=(\"rupture\", \"count\"))\n .reset_index()\n )\n series[\"rupture_rate\"] = series[\"n_ruptures\"] / series[\"total_obs\"]\n bins = [0, 0.01, 0.10, 0.30, 1.01]\n labels = [\"Clean (<=1%)\", \"Moderate (1-10%)\", \"High (10-30%)\", \"Severe (>30%)\"]\n series[\"class\"] = pd.cut(series[\"rupture_rate\"], bins=bins, labels=labels, include_lowest=True)\n return (series[\"class\"].value_counts(normalize=True).sort_index() * 100).round(1)\n\ndf_raw_gap = build_gap_df(stocks); df_raw_gap[\"rupture\"] = df_raw_gap[\"gap\"].abs() > 10\ndf_clean_gap = build_gap_df(stocks_repaired); df_clean_gap[\"rupture\"] = df_clean_gap[\"gap\"].abs() > 10\n\ndist_before = rupture_distribution(df_raw_gap)\ndist_after = rupture_distribution(df_clean_gap)\n\nfig = go.Figure()\nfig.add_trace(go.Pie(\n labels=dist_before.index, values=dist_before.values,\n hole=0.45, name=\"Before repair\", domain=dict(x=[0, 0.48]), textinfo=\"percent\"\n))\nfig.add_trace(go.Pie(\n labels=dist_after.index, values=dist_after.values,\n hole=0.45, name=\"After repair\", domain=dict(x=[0.52, 1.0]), textinfo=\"percent\"\n))\nfig.update_layout(\n title=\"Rupture intensity distribution — Before vs After repair\",\n annotations=[\n dict(text=\"Before repair\", x=0.22, y=0.5, showarrow=False),\n dict(text=\"After repair\", x=0.78, y=0.5, showarrow=False)\n ]\n)\nfig.show()"
2026-03-10 18:45:51 +01:00
}
2026-04-05 17:49:37 +02:00
]
}