PK k{X6_python_dowhy_notebook.ipynb{ "cells": [ { "cell_type": "markdown", "id": "969d8f2c-bfac-4c4b-9838-e4d9cc24f25b", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "# Case Study: Effect of Sodium Intake on Blood Pressure\n", "\n", "**Motivation**: \n", "\n", "- [63% of Americans aged over 60 have high blood pressure (>=140mmHg)](https://www.cdc.gov/nchs/products/databriefs/db289.htm)\n", "- [85% of Americans aged over 50 consume more than 2.3g sodium/day](https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6452a1.htm)\n", "- [federal recommendation: less than 2.3g sodium/day](https://www.cdc.gov/salt/about/index.html)\n", "\n", "**Data**:\n", "\n", "- (simulated) epidemiological example taken from [Luque-Fernandez et al. (2018)](https://academic.oup.com/ije/article/48/2/640/5248195)\n", " - we corrected the real-world numbers\n", "- Outcome Y: (systolic) blood pressure\n", "- Treatment T: sodium intake\n", "- Covariates\n", " - W age\n", " - Z amount of protein excreted in urine" ] }, { "attachments": {}, "cell_type": "markdown", "id": "383d188e-30eb-437b-aa1f-d5f84faa3066", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Variables\n", "\n", "|var|type|desc|\n", "|---|---|---|\n", "| w_age | covariate | Age (years) |\n", "| z_prot | covariate | 24-hour excretion of urinary protein (proteinuria) (mg) (🇩🇪 Proteinurie)|\n", "| t_sod | treatment | 24-hour dietary sodium intake (g) |\n", "| y_sbp | outcome | Systolic blood pressure (mmHg) |\n", "\n", "## Causal Mechanisms\n", "\n", "![img](graph.png)" ] }, { "cell_type": "code", "execution_count": null, "id": "2f02d8a9-6dae-407c-aa0e-d9c0adbb7293", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "def generate_data(n, seed):\n", " rng = np.random.default_rng(seed)\n", " # structural equations\n", " age = rng.normal(loc=65, scale=5, size=n) # [years]\n", " sodium = age / 18 + rng.normal(0, 1, n) # [gramm]\n", " sbp = 1.05 * sodium + 2.15 * age + rng.normal(0, 1, n) # [mmHg] # fixed: 2.0 -> 2.15\n", " prot = 2.00 * sbp + 2.80 * sodium + rng.normal(0, 1, n) # [mg]\n", " hypertension = np.where(sbp > 140, 1, 0) # 1 where sbp > 140 mmHg\n", " sodium_upperlimit = np.where(sodium >= 2.3, 1, 0) # 1 where sodium intake >= 2.3g\n", " data = pd.DataFrame({\n", " 'y_sbp': sbp,\n", " 't_sod': sodium,\n", " 'w_age': age,\n", " 'z_prot': prot,\n", " 'hypertension': hypertension,\n", " 'sodium_upperlimit': sodium_upperlimit\n", " })\n", " return data\n", "\n", "sbp_sod_age_prot = [\"y_sbp\", \"t_sod\", \"w_age\", \"z_prot\"]\n", "data = generate_data(n=1000, seed=111)\n", "data" ] }, { "cell_type": "code", "execution_count": null, "id": "e5a6fd31-1914-4cbf-8a27-6c9b534db6f9", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "data.describe()" ] }, { "cell_type": "code", "execution_count": null, "id": "894de434-a8ab-476a-a86f-491cf8a976f9", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "axes = pd.plotting.scatter_matrix(data[sbp_sod_age_prot], figsize=(10, 10), c='#ff0d57', alpha=0.2, hist_kwds={'color':['#1E88E5']});\n", "for ax in axes.flatten():\n", " ax.xaxis.label.set_rotation(90)\n", " ax.yaxis.label.set_rotation(0)\n", " ax.yaxis.label.set_ha('right')" ] }, { "attachments": {}, "cell_type": "markdown", "id": "34d8b272-43f4-4623-8dbd-5c43e321d6ad", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## Linear Regression\n", "\n", "Model 0: Systolic Blood Pressure in mmHg = $\\beta_0$ + $\\beta_1\\times$ Sodium in g + ε\n", "\n", "Model 1: Systolic Blood Pressure in mmHg = $\\beta_0$ + $\\beta_1\\times$ Sodium in g + $\\beta_2\\times$Age + ε\n", "\n", "Model 2: Systolic Blood Pressure in mmHg = $\\beta_0$ + $\\beta_1\\times$ Sodium in g + $\\beta_2\\times$Age + $\\beta_3\\times$Proteinuria in mg + ε" ] }, { "cell_type": "code", "execution_count": null, "id": "bcf45f31-192f-4f4b-95da-b0d5466a803e", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# https://www.statsmodels.org/devel/examples/notebooks/generated/ols.html\n", "import statsmodels.api as sm\n", "from statsmodels.formula.api import ols\n", "from scipy.stats import norm" ] }, { "cell_type": "code", "execution_count": null, "id": "bd1d1b8e-459c-442e-9c93-2f1b6ace0d9a", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# Fit the linear regression model\n", "fit0 = ols(\"y_sbp ~ t_sod\", data).fit()\n", "fit1 = ols(\"y_sbp ~ t_sod + w_age\", data).fit()\n", "fit2 = ols(\"y_sbp ~ t_sod + w_age + z_prot\", data).fit()" ] }, { "cell_type": "code", "execution_count": null, "id": "f8bbca21-b8d4-418e-a985-c02a5090b219", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "fit0.params" ] }, { "cell_type": "code", "execution_count": null, "id": "fb42fda0-b435-40f5-97f1-e98f6cfb6da0", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "fit1.params" ] }, { "cell_type": "code", "execution_count": null, "id": "ca42b6e8-1eec-487b-859e-ac17e5af68a2", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "fit2.params" ] }, { "cell_type": "code", "execution_count": null, "id": "5d48fbae-ce39-4ab4-86db-6fe2c223c76a", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "xvalues = data.t_sod\n", "x_line = np.linspace(xvalues.min(), xvalues.max(), 100)\n", "y = data.y_sbp\n", "\n", "# Create a line with the regression coefficients\n", "coeff = fit0.params\n", "y_line0 = coeff.Intercept + coeff.t_sod * x_line\n", "\n", "coeff = fit1.params\n", "y_line1 = coeff.Intercept + coeff.t_sod * x_line + coeff.w_age * data.w_age.mean()\n", "\n", "coeff = fit2.params\n", "y_line2 = coeff.Intercept + coeff.t_sod * x_line + coeff.w_age * data.w_age.mean() + coeff.z_prot * data.z_prot.mean()" ] }, { "cell_type": "code", "execution_count": null, "id": "f17157dc-2e7e-43ca-9497-b16c93afc953", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "fig, ax = plt.subplots(figsize=(8, 6))\n", "# Create a scatter plot of the actual data\n", "ax.scatter(xvalues, y, label='Actual Data', alpha=0.25, s=30, edgecolors='k')\n", "# Plot the regression lines\n", "ax.plot(x_line, y_line0, color='red', label='y_sbp ~ t_sod')\n", "ax.plot(x_line, y_line1, color='purple', label='y_sbp ~ t_sod + w_age')\n", "ax.plot(x_line, y_line2, color='blue', label='y_sbp ~ t_sod + w_age + z_prot')\n", "\n", "# Get the confidence interval\n", "conf_int = fit0.get_prediction(pd.DataFrame({'t_sod': x_line, 'const': 1})).conf_int(alpha=0.05)\n", "ax.fill_between(x_line, conf_int[:, 0], conf_int[:, 1], color='red', alpha=0.2, label=\"95% Confidence interval\")\n", "\n", "# Add labels and title\n", "ax.set_xlabel('t_sod [g]')\n", "ax.set_ylabel('y_sbp [mmHg]')\n", "ax.set_title('Linear Regressions on Systolic Blood Pressure Variables')\n", "ax.legend()\n", "ax.grid(1, ls=':')\n", "\n", "# Show the plot\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "aca3a1f2-cb37-4d55-b324-1091b0560b20", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "fig = sm.graphics.plot_partregress_grid(fit2)\n", "fig.tight_layout(pad=1.0)" ] }, { "cell_type": "markdown", "id": "e3f3de53-2d8f-467e-8756-16ee5053509c", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## DoWhy - Graphs" ] }, { "cell_type": "code", "execution_count": null, "id": "6cc43217-a7f6-4a65-8808-791a8534c02a", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "import dowhy\n", "from dowhy import CausalModel\n", "\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": null, "id": "dcad8467-8ebb-4122-8801-4fd807bfbe88", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "dowhy.__version__" ] }, { "cell_type": "code", "execution_count": null, "id": "37cb3cce-f0d1-48a1-a43d-46663ec36e52", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# some data, ignore it\n", "w=[i for i in range(10)]\n", "np.random.shuffle(w)\n", "df = pd.DataFrame(data = {'W': w, 'X': range(0,10), 'Y': range(0,100,10), 'A': range(0, 20, 2)})" ] }, { "cell_type": "code", "execution_count": null, "id": "52758e53-f441-4d8d-a335-0832893c018f", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# https://www.pywhy.org/dowhy/main/example_notebooks/load_graph_example.html\n", "# With DOT string\n", "model=CausalModel(\n", " data = df,\n", " treatment='X',\n", " outcome='Y',\n", " graph=\"digraph {W -> X;X -> A;A -> Y;W -> Y;}\",\n", " #graph=\"digraph {W -> X;X -> A;Y -> A;W -> Y;}\" # no directed path\n", " #graph=\"digraph {W -> X;X -> A;Y -> A;W -> Y; X->Y}\"\n", ")\n", "model.view_model()" ] }, { "cell_type": "code", "execution_count": null, "id": "8919dc14-79d8-4082-afd0-15042d4c9590", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "identified_estimand = model.identify_effect()\n", "print(identified_estimand)" ] }, { "cell_type": "markdown", "id": "251bff0d-6e54-42ca-af58-637ee2b9e54c", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## DoWhy Case Study" ] }, { "cell_type": "code", "execution_count": null, "id": "e5602137-cfe2-4f4e-9107-12625820dc98", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# Define causal model\n", "model = CausalModel(\n", " data = data,\n", " graph = \"\"\"\n", " digraph {\n", " t_sod -> y_sbp;\n", " w_age -> t_sod;\n", " w_age -> y_sbp;\n", " y_sbp -> z_prot;\n", " t_sod -> z_prot;\n", " }\"\"\",\n", " treatment= \"t_sod\",\n", " outcome= \"y_sbp\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "ae850568-f3b4-4b09-9c90-cd8836f4920f", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "model.view_model()" ] }, { "cell_type": "code", "execution_count": null, "id": "fd7a778a-e2f2-4c77-9fdb-64c8f7597eeb", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "model.summary()" ] }, { "cell_type": "markdown", "id": "08b634dd-680f-4b03-86a9-776f80908a72", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "### Identification\n", "\n", " Identification of causal effect is the process of determining whether the effect can be estimated using the available variables’ data.\n", "\n", "[Formally](https://www.pywhy.org/dowhy/v0.10.1/user_guide/causal_tasks/estimating_causal_effects/identifying_causal_effect/index.html): convert target causal effect expression (e.g. $E[Y|do(A)]$) to a form that can be estimated using observed data distribution, i.e., without the do-operator.\n", "\n", "DoWhy provides `identify_effect()` method with optional parameters for estimand types and [identification methods](https://www.pywhy.org/dowhy/v0.10.1/user_guide/causal_tasks/estimating_causal_effects/identifying_causal_effect/index.html). DoWhy complains if there are issues which prevent effect estimation (e.g. cycles in the graph). DoWhy supports the following identification algorithms:\n", "\n", "- Backdoor\n", "- Frontdoor\n", "- Instrumental variable\n", "- ID algorithm" ] }, { "cell_type": "code", "execution_count": null, "id": "76b18b3d-8096-47b2-a384-5e91dffbff8f", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# Identify the causal effect\n", "identified_estimand = model.identify_effect(method_name=\"id-algorithm\")\n", "print(identified_estimand)" ] }, { "cell_type": "code", "execution_count": null, "id": "e2398bc5-eebb-44ca-8cc3-38fc641680d4", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# Identify the causal effect\n", "identified_estimand = model.identify_effect()\n", "print(identified_estimand)" ] }, { "cell_type": "markdown", "id": "1eebaa29-2102-4cf3-abf7-9379c738b456", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "estimand expression:\n", "\n", "- `w_age` as confounder (recap: conditioning on common causes required to avoid unadjusted confounding)\n", "- `t_sod` as treatment\n", "- `y_sbp` as target variable\n", "- `z_prot` is not involved (recap: not conditioning on common effects to avoid collider bias)" ] }, { "cell_type": "markdown", "id": "d56d8c0f-1c34-4797-ae0b-38ffec50a2f2", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "### Estimation\n", "\n", "Estimation is the \"process of quantifying the target effect using the available data\".\n", "DoWhy has a [number of methods (causal inference)](https://www.pywhy.org/dowhy/v0.11/example_notebooks/dowhy_estimation_methods.html) for estimation (regression, matching, stratification, and weighting estimators). Methods like [inverse probability weighting](https://en.wikipedia.org/wiki/Inverse_probability_weighting) are not restricted to linear relationships.\n", "\n", "\n", "- [Effect estimation with **backdoor**](https://www.pywhy.org/dowhy/main/user_guide/causal_tasks/estimating_causal_effects/effect_estimation_with_backdoor/index.html) amounts to estimating a conditional probability distribution. Given an action A, an outcome Y and set of backdoor variables W, the causal effect is identified as $\\sum_wE[Y|A,W=w]P(W=w)$.\n", " - Regression-based methods (DoWhy supports generalized linear models, e.g. to fit logistic regression models)\n", " - Distance-based matching (applicable only for binary treatments)\n", " - Propensity-based methods (applicable only for binary treatments)\n", " - Do-sampler / Pearlian inference / Pearlian interventions ([demo](https://www.pywhy.org/dowhy/main/example_notebooks/do_sampler_demo.html))\n", "- [Estimating average causal effect with natural experiments (instrumental variables)](https://www.pywhy.org/dowhy/main/user_guide/causal_tasks/estimating_causal_effects/effect_estimation_with_natural_experiments/index.html)\n", "- [Estimating conditional average causal effect (with EconML package)](https://www.pywhy.org/dowhy/main/user_guide/causal_tasks/estimating_causal_effects/conditional_effect_estimation/index.html)\n", " - another example: [Conditional Average Treatment Effects (CATE)](https://www.pywhy.org/dowhy/main/example_notebooks/dowhy-conditional-treatment-effects.html)\n", "- [Estimating average causal effect using GCM (intervention)](https://www.pywhy.org/dowhy/main/user_guide/causal_tasks/estimating_causal_effects/effect_estimation_with_gcm.html)\n", " - GCM: \"graphical causal models\" extension of DoWhy\n", " - estimate such differences in a target node: $\\mathbb{E}[Y | \\text{do}(T:=A)] - \\mathbb{E}[Y | \\text{do}(T:=B)]$" ] }, { "cell_type": "code", "execution_count": null, "id": "e9d686d5-61d1-4967-a039-2def4d670c0e", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# Estimate the causal effect and compare it with Average Treatment Effect\n", "estimate = model.estimate_effect(identified_estimand, \n", " method_name=\"backdoor.linear_regression\", \n", " test_significance=True,\n", " #target_units=\"att\"\n", " )\n", "print(estimate)\n", "print(\"Causal Estimate is \" + str(estimate.value))" ] }, { "cell_type": "code", "execution_count": null, "id": "25f0f431-c470-4259-92f5-4c3ebdba728f", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# compare with our linear regression fit from above\n", "fit1.params" ] }, { "cell_type": "code", "execution_count": null, "id": "49605c1f-efb4-4bb5-b709-e46d9525a4b5", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# compare with fit including collider bias\n", "fit2.params" ] }, { "cell_type": "markdown", "id": "e7cef4f1-d40e-499b-93c9-1caf6b637afe", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "The [source](https://academic.oup.com/ije/article/48/2/640/5248195#247003325) performs a Monte-Carlo simulation to estimate the relative collider bias.\n", "$$\\frac{\\mid \\mu_{SOD,true}\\mid - \\mid \\mu_{SOD,bias}\\mid}{\\mid \\mu_{SOD,true}\\mid}$$" ] }, { "cell_type": "code", "execution_count": null, "id": "10dded3c-bc13-4096-977c-bf94a0a41aa9", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# just for comparison the binarisation method as naive way to compute ACE:\n", "# average causal effect (t_sod):\n", "print(f'ACE = {data.query(\"sodium_upperlimit==1\").y_sbp.mean()-data.query(\"sodium_upperlimit==0\").y_sbp.mean()}')" ] }, { "cell_type": "markdown", "id": "2a1cee08-8dde-4f4b-9193-65dbf46a27df", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "### Refutation / Validation\n", "\n", " Let us now look at ways of refuting the estimate obtained. Refutation methods provide tests that every correct estimator should pass. So if an estimator fails the refutation test (p-value is <0.05), then it means that there is some problem with the estimator.\n", "[Source](https://www.pywhy.org/dowhy/v0.11/example_notebooks/dowhy_simple_example.html)\n", "\n", "Here are the refutation methods from [**refute()** documentation](https://www.pywhy.org/dowhy/v0.8/user_guide/effect_inference/refute.html) as table:\n", "\n", "|||\n", "|---|---|\n", "| **Add Random Common Cause**: | Does the estimation method change its estimate after we add an independent random variable as a common cause to the dataset? (Hint: It should not) |\n", "| **Placebo Treatment**: | What happens to the estimated causal effect when we replace the true treatment variable with an independent random variable? (Hint: the effect should go to zero) |\n", "| Dummy Outcome: | What happens to the estimated causal effect when we replace the true outcome variable with an independent random variable? (Hint: The effect should go to zero) |\n", "| Simulated Outcome: | What happens to the estimated causal effect when we replace the dataset with a simulated dataset based on a known data-generating process closest to the given dataset? (Hint: It should match the effect parameter from the data-generating process) |\n", "| Add Unobserved Common Causes: | How sensitive is the effect estimate when we add an additional common cause (confounder) to the dataset that is correlated with the treatment and the outcome? (Hint: It should not be too sensitive) |\n", "| Data Subsets Validation: | Does the estimated effect change significantly when we replace the given dataset with a randomly selected subset? (Hint: It should not) |\n", "| Bootstrap Validation: | Does the estimated effect change significantly when we replace the given dataset with bootstrapped samples from the same dataset? (Hint: It should not) |\n", "\n", "We only test the first two, but you should generally check for every method. Keep these methods as part of your pipeline. It raises an alert, when an update on the graph or change in the algorithm introduced any issue.\n", "\n", "- **Add Random Common Cause**: Does the estimation method change its estimate after we add an independent random variable as a common cause to the dataset? (Hint: It should not)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "658a118d-84c1-4f7e-800a-3ce4bb232a57", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "refute_results=model.refute_estimate(\n", " identified_estimand, \n", " estimate,\n", " method_name=\"random_common_cause\")\n", "print(refute_results)" ] }, { "cell_type": "markdown", "id": "87145479-9ead-4a06-ba1c-05d5a8704d2a", "metadata": {}, "source": [ "- **Placebo Treatment**: What happens to the estimated causal effect when we replace the true treatment variable with an independent random variable? (Hint: the effect should go to zero)" ] }, { "cell_type": "code", "execution_count": null, "id": "7d57a446-4f9b-4d05-a828-514f43db51a3", "metadata": {}, "outputs": [], "source": [ "refute_results = model.refute_estimate(\n", " identified_estimand, \n", " estimate,\n", " method_name='placebo_treatment_refuter',\n", " placebo_type='permute', \n", " num_simulations=20)\n", "print(refute_results)" ] }, { "attachments": {}, "cell_type": "markdown", "id": "5aa61e3c-eaf1-4aac-98be-047686f8d051", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "## DoWhy GCM - Answering What-If Questions\n", "\n", "[Documentation](https://www.pywhy.org/dowhy/main/user_guide/causal_tasks/what_if/index.html)\n", "\n", "### Intervention\n", "\n", "DoWhy Graphical Causal Models (GCM) extension offers methods to answer what-if questions in our graphical causal model. \n", "\n", "[Recap](https://www.pywhy.org/dowhy/v0.9.1/user_guide/gcm_based_inference/answering_causal_questions/computing_counterfactuals.html#understanding-the-method):\n", "\n", " [...] when performing interventions, we look into the future, for counterfactuals we look into an alternative past. To reflect this in the computation, when performing interventions, we generate all noise using our causal models. For counterfactuals, we use the noise from actual observed data.\n", "\n", "Here is an example for [interventions](https://www.pywhy.org/dowhy/main/user_guide/causal_tasks/estimating_causal_effects/effect_estimation_with_gcm.html):\n", "We want to compute the average causal effect of changing the age from 65 to 70.\n", "```\n", "ACE = E[y_sbp | do(w_age := 70), t_sod] - E[y_sbp | do(w_age := 65), t_sod]\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "71bb2e87-a6f6-4884-8b47-9f354295736d", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "data" ] }, { "cell_type": "code", "execution_count": null, "id": "a892af37-9482-42f7-89f9-8a71465841aa", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "model.view_model()" ] }, { "cell_type": "code", "execution_count": null, "id": "dd8b2fb4-6025-43e6-b3c4-d4f76e148f5a", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "from dowhy import gcm\n", "from scipy.stats import norm" ] }, { "cell_type": "code", "execution_count": null, "id": "d259b4d3-2512-4e78-99fe-6b5c26f1fc1f", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "causal_model = gcm.ProbabilisticCausalModel(model) # model._graph._graph\n", "gcm.auto.assign_causal_mechanisms(causal_model, data)\n", "gcm.fit(causal_model, data)" ] }, { "cell_type": "code", "execution_count": null, "id": "fb6560c4-f5c9-4ac0-97d0-2dc8ba5fa06a", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# intervention\n", "gcm.average_causal_effect(causal_model,\n", " target_node='y_sbp',\n", " interventions_alternative={'w_age': lambda x: 70},\n", " interventions_reference={'w_age': lambda x: 65},\n", " num_samples_to_draw=1000,\n", " # observed_data=... # factual data. If not provided new data is sampled using the causal model\n", " )" ] }, { "cell_type": "markdown", "id": "2742d402-f199-4ddd-8360-2603a85f3b8a", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "source": [ "### Computing Counterfactuals\n", "\n", " I observed a certain outcome z for a variable Z where variable X was set to a value x. What would have happened to the value of Z, had I intervened on X to assign it a different value x’?\n", "\n", "[Recap](https://www.pywhy.org/dowhy/v0.9.1/user_guide/gcm_based_inference/answering_causal_questions/computing_counterfactuals.html#understanding-the-method):\n", "\n", " [...] when performing interventions, we look into the future, for counterfactuals we look into an alternative past. To reflect this in the computation, when performing interventions, we generate all noise using our causal models. For counterfactuals, we use the noise from actual observed data.\n", "\n", "Example: My (observed) values are:\n", "\n", "- Age: 65\n", "- Sodium intake: 3.9g/day\n", "- Systolic blood pressure: 150 mmHg\n", "- Proteinurie: 300mg\n", "\n", "**What would my values be if I only consumed 1.5 g of sodium per day?**" ] }, { "cell_type": "code", "execution_count": null, "id": "54e90747-f5d0-4f80-a247-949cfe75bbd1", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "observed = dict(w_age=[65], t_sod=[3.9], y_sbp=[150], z_prot=[300])" ] }, { "cell_type": "code", "execution_count": null, "id": "48162ec7-c3f8-4459-ad56-e2f057b984ce", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# This will not work if causal_model is not type of InvertibleStructuralCausalModel\n", "# (causal_model is defined as ProbabilisticCausalModel in code above, so this must fail)\n", "gcm.counterfactual_samples(causal_model,\n", " interventions={'t_sod': lambda x: 1.5},\n", " observed_data=pd.DataFrame(data=observed)\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "1c6e458e-5b2c-4b0b-aca7-5a2af4d168ed", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "# so redefine our model as InvertibleStructuralCausalModel\n", "causal_model = gcm.InvertibleStructuralCausalModel(graph=model)\n", "# DoWhy does causal mechanism and data fitting \n", "training_data = data[sbp_sod_age_prot]\n", "gcm.auto.assign_causal_mechanisms(causal_model, training_data)\n", "gcm.fit(causal_model, training_data)" ] }, { "cell_type": "code", "execution_count": null, "id": "786f0229-35f4-418e-9630-4c923ee39fd0", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "gcm.counterfactual_samples(causal_model,\n", " interventions={'t_sod': lambda x: 1.5},\n", " observed_data=pd.DataFrame(data=observed)\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "01f46f92-b06e-46aa-9de4-8c481f4b6dc7", "metadata": { "editable": true, "slideshow": { "slide_type": "" }, "tags": [] }, "outputs": [], "source": [ "observed" ] }, { "cell_type": "code", "execution_count": null, "id": "21c3375d-66c0-42ee-b51d-26bc6c56086d", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.8" } }, "nbformat": 4, "nbformat_minor": 5 } PKYf3"}"}PK yX6_python_dowhy_notebook.html dowhy

Թ

PK ~PK zX6_python_slides.html slides slides

Թ

PK~j}}PK- k{XYf3"}"}6_python_dowhy_notebook.ipynbPK- yX ~m}6_python_dowhy_notebook.htmlPK- zX~j}}s6_python_slides.htmlPKp%