Files
personal-portfolio/docs/PROJECT_REFERENCE.md

12 KiB

Portfolio Project Reference

Project: Analytics Portfolio Owner: Leo Status: Ready for Sprint 1


Project Overview

Two-project analytics portfolio demonstrating end-to-end data engineering, visualization, and ML capabilities.

Project Domain Key Skills Phase
Toronto Housing Dashboard Real estate ETL, dimensional modeling, geospatial, choropleth Phase 1 (Active)
Energy Pricing Analysis Utility markets Time series, ML prediction, API integration Phase 3 (Future)

Platform: Monolithic Dash application on self-hosted VPS (bio landing page + dashboards).


Branching Strategy

Branch Purpose Deploys To
main Production releases only VPS (production)
staging Pre-production testing VPS (staging)
development Active development Local only

Rules:

  • All feature branches created FROM development
  • All feature branches merge INTO development
  • developmentstaging for testing
  • stagingmain for release
  • Direct commits to main or staging are forbidden
  • Branch naming: feature/{sprint}-{description} or fix/{issue-id}

Tech Stack (Locked)

Layer Technology Version
Database PostgreSQL + PostGIS 16.x
Validation Pydantic ≥2.0
ORM SQLAlchemy ≥2.0 (2.0-style API only)
Transformation dbt-postgres ≥1.7
Data Processing Pandas ≥2.1
Geospatial GeoPandas + Shapely ≥0.14
Visualization Dash + Plotly ≥2.14
UI Components dash-mantine-components Latest stable
Testing pytest ≥7.0
Python 3.11+ Via pyenv

Compatibility Notes:

  • SQLAlchemy 2.0 + Pydantic 2.0 integrate well—never mix 1.x APIs
  • PostGIS extension required—enable during db init
  • Docker Compose V2 (no version field in compose files)

Code Conventions

Import Style

Context Style Example
Same directory Single dot from .trreb import TRREBParser
Sibling directory Double dot from ..schemas.trreb import TRREBRecord
External packages Absolute import pandas as pd

Module Separation

Directory Contains Purpose
schemas/ Pydantic models Data validation
models/ SQLAlchemy ORM Database persistence
parsers/ PDF/CSV extraction Raw data ingestion
loaders/ Database operations Data loading
figures/ Chart factories Plotly figure generation
callbacks/ Dash callbacks Per-dashboard, in pages/{dashboard}/callbacks/
errors/ Exceptions + handlers Error handling

Code Standards

  • Type hints: Mandatory, Python 3.10+ style (list[str], dict[str, int], X | None)
  • Functions: Single responsibility, verb naming, early returns over nesting
  • Docstrings: Google style, minimal—only for non-obvious behavior
  • Constants: Module-level for magic values, Pydantic BaseSettings for runtime config

Error Handling

# errors/exceptions.py
class PortfolioError(Exception):
    """Base exception."""

class ParseError(PortfolioError):
    """PDF/CSV parsing failed."""

class ValidationError(PortfolioError):
    """Pydantic or business rule validation failed."""

class LoadError(PortfolioError):
    """Database load operation failed."""
  • Decorators for infrastructure concerns (logging, retry, transactions)
  • Explicit handling for domain logic (business rules, recovery strategies)

Application Architecture

Dash Pages Structure

portfolio_app/
├── app.py                    # Dash app factory with Pages routing
├── config.py                 # Pydantic BaseSettings
├── assets/                   # CSS, images (auto-served by Dash)
├── pages/
│   ├── home.py              # Bio landing page → /
│   ├── toronto/
│   │   ├── dashboard.py     # Layout only → /toronto
│   │   └── callbacks/       # Interaction logic
│   └── energy/              # Phase 3
├── components/              # Shared UI (navbar, footer, cards)
├── figures/                 # Shared chart factories
├── toronto/                 # Toronto data logic
│   ├── parsers/
│   ├── loaders/
│   ├── schemas/             # Pydantic
│   └── models/              # SQLAlchemy
└── errors/

URL Routing (Automatic)

URL Page Status
/ Bio landing page Sprint 2
/toronto Toronto Housing Dashboard Sprint 6
/energy Energy Pricing Dashboard Phase 3

Phase 1: Toronto Housing Dashboard

Data Sources

Track Source Format Geography Frequency
Purchases TRREB Monthly Reports PDF ~35 Districts Monthly
Rentals CMHC Rental Market Survey CSV ~20 Zones Annual
Enrichment City of Toronto Open Data GeoJSON/CSV 158 Neighbourhoods Census
Policy Events Curated list CSV N/A Event-based

Geographic Reality

┌─────────────────────────────────────────────────────────────────┐
│ City of Toronto Neighbourhoods (158)                            │ ← Enrichment only
├─────────────────────────────────────────────────────────────────┤
│ TRREB Districts (~35) — W01, C01, E01, etc.                     │ ← Purchase data
├─────────────────────────────────────────────────────────────────┤
│ CMHC Zones (~20) — Census Tract aligned                         │ ← Rental data
└─────────────────────────────────────────────────────────────────┘

Critical: These geographies do NOT align. Display as separate layers with toggle—do not force crosswalks.

Data Model (Star Schema)

Table Type Keys
fact_purchases Fact → dim_time, dim_trreb_district
fact_rentals Fact → dim_time, dim_cmhc_zone
dim_time Dimension date_key (PK)
dim_trreb_district Dimension district_key (PK), geometry
dim_cmhc_zone Dimension zone_key (PK), geometry
dim_neighbourhood Dimension neighbourhood_id (PK), geometry
dim_policy_event Dimension event_id (PK)

V1 Rule: dim_neighbourhood has NO FK to fact tables—reference overlay only.

dbt Layer Structure

Layer Naming Purpose
Staging stg_{source}__{entity} 1:1 source, cleaned, typed
Intermediate int_{domain}__{transform} Business logic, filtering
Marts mart_{domain} Final analytical tables

Sprint Overview

Sprint Focus Milestone
1 Project bootstrap, start TRREB digitization
2 Bio page, data acquisition Launch 1: Bio Live
3 Parsers, schemas, models
4 Loaders, dbt
5 Visualization
6 Polish, deploy dashboard Launch 2: Dashboard Live
7 Buffer

Sprint 1 Deliverables

Category Tasks
Bootstrap Git init, pyproject.toml, .env.example, Makefile, CLAUDE.md
Infrastructure Docker Compose (PostgreSQL + PostGIS), scripts/ directory
App Foundation portfolio_app/ structure, config.py, error handling
Tests tests/ directory, conftest.py, pytest config
Data Acquisition Download TRREB PDFs, START boundary digitization (HUMAN task)

Human Tasks (Cannot Automate)

Task Tool Effort
Digitize TRREB district boundaries QGIS 3-4 hours
Research policy events (10-20) Manual research 2-3 hours
Replace social link placeholders Manual 5 minutes

Scope Boundaries

Phase 1 — Build These

  • Bio landing page with content from bio_content_v2.md
  • TRREB PDF parser
  • CMHC CSV processor
  • PostgreSQL + PostGIS database layer
  • Star schema (facts + dimensions)
  • dbt models with tests
  • Choropleth visualization (Dash)
  • Policy event annotation layer
  • Neighbourhood overlay (toggle-able)

Phase 1 — Do NOT Build

Feature Reason When
bridge_district_neighbourhood table Area-weighted aggregation is Phase 4 After Energy project
Crime data integration Deferred scope Phase 4
Historical boundary reconciliation (140→158) 2021+ data only for V1 Phase 4
ML prediction models Energy project scope Phase 3
Multi-project shared infrastructure Build first, abstract second Phase 2

If a task seems to require Phase 3/4 features, stop and flag it.


File Structure

Root-Level Files (Allowed)

File Purpose
README.md Project overview
CLAUDE.md AI assistant context
pyproject.toml Python packaging
.gitignore Git ignore rules
.env.example Environment template
.python-version pyenv version
.pre-commit-config.yaml Pre-commit hooks
docker-compose.yml Container orchestration
Makefile Task automation

Directory Structure

portfolio/
├── portfolio_app/           # Monolithic Dash application
│   ├── app.py
│   ├── config.py
│   ├── assets/
│   ├── pages/
│   ├── components/
│   ├── figures/
│   ├── toronto/
│   └── errors/
├── tests/
├── dbt/
├── data/
│   └── toronto/
│       ├── raw/
│       ├── processed/       # gitignored
│       └── reference/
├── scripts/
│   ├── db/
│   ├── docker/
│   ├── deploy/
│   ├── dbt/
│   └── dev/
├── docs/
├── notebooks/
├── backups/                 # gitignored
└── reports/                 # gitignored

Gitignored Directories

  • data/*/processed/
  • reports/
  • backups/
  • notebooks/*.html
  • .env
  • __pycache__/
  • .venv/

Makefile Targets

Target Purpose
setup Install deps, create .env, init pre-commit
docker-up Start PostgreSQL + PostGIS
docker-down Stop containers
db-init Initialize database schema
run Start Dash dev server
test Run pytest
dbt-run Run dbt models
dbt-test Run dbt tests
lint Run ruff linter
format Run ruff formatter
ci Run all checks
deploy Deploy to production

Script Standards

All scripts in scripts/:

  • Include usage comments at top
  • Idempotent where possible
  • Exit codes: 0 = success, 1 = error
  • Use set -euo pipefail for bash
  • Log to stdout, errors to stderr

Environment Variables

Required in .env:

DATABASE_URL=postgresql://user:pass@localhost:5432/portfolio
POSTGRES_USER=portfolio
POSTGRES_PASSWORD=<secure>
POSTGRES_DB=portfolio
DASH_DEBUG=true
SECRET_KEY=<random>
LOG_LEVEL=INFO

Success Criteria

Launch 1 (Sprint 2)

  • Bio page accessible via HTTPS
  • All bio content rendered (from bio_content_v2.md)
  • No placeholder text visible
  • Mobile responsive
  • Social links functional

Launch 2 (Sprint 6)

  • Choropleth renders TRREB districts and CMHC zones
  • Purchase/rental mode toggle works
  • Time navigation works
  • Policy event markers visible
  • Neighbourhood overlay toggleable
  • Methodology documentation published
  • Data sources cited

Reference Documents

For detailed specifications, see:

Document Location Use When
Data schemas docs/toronto_housing_spec.md Parser/model tasks
WBS details docs/wbs.md Sprint planning
Bio content docs/bio_content.md Building home.py

Reference Version: 1.0 Created: January 2026