{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Income vs Safety Scatter Plot\n", "\n", "Explores the correlation between median household income and safety score across Toronto neighbourhoods." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Data Reference\n", "\n", "### Source Tables\n", "\n", "| Table | Grain | Key Columns |\n", "|-------|-------|-------------|\n", "| `mart_neighbourhood_overview` | neighbourhood × year | neighbourhood_name, median_household_income, safety_score, population |\n", "\n", "### SQL Query" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from sqlalchemy import create_engine\n", "import os\n", "\n", "engine = create_engine(os.environ.get('DATABASE_URL', 'postgresql://portfolio:portfolio@localhost:5432/portfolio'))\n", "\n", "query = \"\"\"\n", "SELECT\n", " neighbourhood_name,\n", " median_household_income,\n", " safety_score,\n", " population,\n", " livability_score,\n", " crime_rate_per_100k\n", "FROM mart_neighbourhood_overview\n", "WHERE year = (SELECT MAX(year) FROM mart_neighbourhood_overview)\n", " AND median_household_income IS NOT NULL\n", " AND safety_score IS NOT NULL\n", "ORDER BY median_household_income DESC\n", "\"\"\"\n", "\n", "df = pd.read_sql(query, engine)\n", "print(f\"Loaded {len(df)} neighbourhoods with income and safety data\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Transformation Steps\n", "\n", "1. Filter out null values for income and safety\n", "2. Optionally scale income to thousands for readability\n", "3. Pass to scatter figure factory with optional trendline" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Scale income to thousands for better axis readability\n", "df['income_thousands'] = df['median_household_income'] / 1000\n", "\n", "# Prepare data for figure factory\n", "data = df.to_dict('records')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sample Output" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "df[['neighbourhood_name', 'median_household_income', 'safety_score', 'crime_rate_per_100k']].head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Data Visualization\n", "\n", "### Figure Factory\n", "\n", "Uses `create_scatter_figure` from `portfolio_app.figures.scatter`.\n", "\n", "**Key Parameters:**\n", "- `x_column`: 'income_thousands' (median household income in $K)\n", "- `y_column`: 'safety_score' (0-100 percentile rank)\n", "- `name_column`: 'neighbourhood_name' (hover label)\n", "- `size_column`: 'population' (optional, bubble size)\n", "- `trendline`: True (adds OLS regression line)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import sys\n", "sys.path.insert(0, '../..')\n", "\n", "from portfolio_app.figures.scatter import create_scatter_figure\n", "\n", "fig = create_scatter_figure(\n", " data=data,\n", " x_column='income_thousands',\n", " y_column='safety_score',\n", " name_column='neighbourhood_name',\n", " size_column='population',\n", " title='Income vs Safety by Neighbourhood',\n", " x_title='Median Household Income ($K)',\n", " y_title='Safety Score (0-100)',\n", " trendline=True,\n", ")\n", "\n", "fig.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interpretation\n", "\n", "This scatter plot reveals the relationship between income and safety:\n", "\n", "- **Positive correlation**: Higher income neighbourhoods tend to have higher safety scores\n", "- **Bubble size**: Represents population (larger = more people)\n", "- **Trendline**: Orange dashed line shows the overall trend\n", "- **Outliers**: Neighbourhoods far from the trendline are interesting cases\n", " - Above line: Safer than income would predict\n", " - Below line: Less safe than income would predict" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Calculate correlation coefficient\n", "correlation = df['median_household_income'].corr(df['safety_score'])\n", "print(f\"Correlation coefficient (Income vs Safety): {correlation:.3f}\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 4 }