From 8ed220e0141a1ad5fabb072af06456cdf5370881 Mon Sep 17 00:00:00 2001 From: Leo Miranda Date: Sun, 11 Jan 2026 18:39:55 +0000 Subject: [PATCH] uploaded initial documentation. --- docs/PROJECT_REFERENCE.md | 396 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 396 insertions(+) create mode 100644 docs/PROJECT_REFERENCE.md diff --git a/docs/PROJECT_REFERENCE.md b/docs/PROJECT_REFERENCE.md new file mode 100644 index 0000000..50b29f5 --- /dev/null +++ b/docs/PROJECT_REFERENCE.md @@ -0,0 +1,396 @@ +# Portfolio Project Reference + +**Project**: Analytics Portfolio +**Owner**: Leo +**Status**: Ready for Sprint 1 + +--- + +## Project Overview + +Two-project analytics portfolio demonstrating end-to-end data engineering, visualization, and ML capabilities. + +| Project | Domain | Key Skills | Phase | +|---------|--------|------------|-------| +| **Toronto Housing Dashboard** | Real estate | ETL, dimensional modeling, geospatial, choropleth | Phase 1 (Active) | +| **Energy Pricing Analysis** | Utility markets | Time series, ML prediction, API integration | Phase 3 (Future) | + +**Platform**: Monolithic Dash application on self-hosted VPS (bio landing page + dashboards). + +--- + +## Branching Strategy + +| Branch | Purpose | Deploys To | +|--------|---------|------------| +| `main` | Production releases only | VPS (production) | +| `staging` | Pre-production testing | VPS (staging) | +| `development` | Active development | Local only | + +**Rules**: +- All feature branches created FROM `development` +- All feature branches merge INTO `development` +- `development` → `staging` for testing +- `staging` → `main` for release +- Direct commits to `main` or `staging` are forbidden +- Branch naming: `feature/{sprint}-{description}` or `fix/{issue-id}` + +--- + +## Tech Stack (Locked) + +| Layer | Technology | Version | +|-------|------------|---------| +| Database | PostgreSQL + PostGIS | 16.x | +| Validation | Pydantic | ≥2.0 | +| ORM | SQLAlchemy | ≥2.0 (2.0-style API only) | +| Transformation | dbt-postgres | ≥1.7 | +| Data Processing | Pandas | ≥2.1 | +| Geospatial | GeoPandas + Shapely | ≥0.14 | +| Visualization | Dash + Plotly | ≥2.14 | +| UI Components | dash-mantine-components | Latest stable | +| Testing | pytest | ≥7.0 | +| Python | 3.11+ | Via pyenv | + +**Compatibility Notes**: +- SQLAlchemy 2.0 + Pydantic 2.0 integrate well—never mix 1.x APIs +- PostGIS extension required—enable during db init +- Docker Compose V2 (no `version` field in compose files) + +--- + +## Code Conventions + +### Import Style + +| Context | Style | Example | +|---------|-------|---------| +| Same directory | Single dot | `from .trreb import TRREBParser` | +| Sibling directory | Double dot | `from ..schemas.trreb import TRREBRecord` | +| External packages | Absolute | `import pandas as pd` | + +### Module Separation + +| Directory | Contains | Purpose | +|-----------|----------|---------| +| `schemas/` | Pydantic models | Data validation | +| `models/` | SQLAlchemy ORM | Database persistence | +| `parsers/` | PDF/CSV extraction | Raw data ingestion | +| `loaders/` | Database operations | Data loading | +| `figures/` | Chart factories | Plotly figure generation | +| `callbacks/` | Dash callbacks | Per-dashboard, in `pages/{dashboard}/callbacks/` | +| `errors/` | Exceptions + handlers | Error handling | + +### Code Standards + +- **Type hints**: Mandatory, Python 3.10+ style (`list[str]`, `dict[str, int]`, `X | None`) +- **Functions**: Single responsibility, verb naming, early returns over nesting +- **Docstrings**: Google style, minimal—only for non-obvious behavior +- **Constants**: Module-level for magic values, Pydantic BaseSettings for runtime config + +### Error Handling + +```python +# errors/exceptions.py +class PortfolioError(Exception): + """Base exception.""" + +class ParseError(PortfolioError): + """PDF/CSV parsing failed.""" + +class ValidationError(PortfolioError): + """Pydantic or business rule validation failed.""" + +class LoadError(PortfolioError): + """Database load operation failed.""" +``` + +- Decorators for infrastructure concerns (logging, retry, transactions) +- Explicit handling for domain logic (business rules, recovery strategies) + +--- + +## Application Architecture + +### Dash Pages Structure + +``` +portfolio_app/ +├── app.py # Dash app factory with Pages routing +├── config.py # Pydantic BaseSettings +├── assets/ # CSS, images (auto-served by Dash) +├── pages/ +│ ├── home.py # Bio landing page → / +│ ├── toronto/ +│ │ ├── dashboard.py # Layout only → /toronto +│ │ └── callbacks/ # Interaction logic +│ └── energy/ # Phase 3 +├── components/ # Shared UI (navbar, footer, cards) +├── figures/ # Shared chart factories +├── toronto/ # Toronto data logic +│ ├── parsers/ +│ ├── loaders/ +│ ├── schemas/ # Pydantic +│ └── models/ # SQLAlchemy +└── errors/ +``` + +### URL Routing (Automatic) + +| URL | Page | Status | +|-----|------|--------| +| `/` | Bio landing page | Sprint 2 | +| `/toronto` | Toronto Housing Dashboard | Sprint 6 | +| `/energy` | Energy Pricing Dashboard | Phase 3 | + +--- + +## Phase 1: Toronto Housing Dashboard + +### Data Sources + +| Track | Source | Format | Geography | Frequency | +|-------|--------|--------|-----------|-----------| +| Purchases | TRREB Monthly Reports | PDF | ~35 Districts | Monthly | +| Rentals | CMHC Rental Market Survey | CSV | ~20 Zones | Annual | +| Enrichment | City of Toronto Open Data | GeoJSON/CSV | 158 Neighbourhoods | Census | +| Policy Events | Curated list | CSV | N/A | Event-based | + +### Geographic Reality + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ City of Toronto Neighbourhoods (158) │ ← Enrichment only +├─────────────────────────────────────────────────────────────────┤ +│ TRREB Districts (~35) — W01, C01, E01, etc. │ ← Purchase data +├─────────────────────────────────────────────────────────────────┤ +│ CMHC Zones (~20) — Census Tract aligned │ ← Rental data +└─────────────────────────────────────────────────────────────────┘ +``` + +**Critical**: These geographies do NOT align. Display as separate layers with toggle—do not force crosswalks. + +### Data Model (Star Schema) + +| Table | Type | Keys | +|-------|------|------| +| `fact_purchases` | Fact | → dim_time, dim_trreb_district | +| `fact_rentals` | Fact | → dim_time, dim_cmhc_zone | +| `dim_time` | Dimension | date_key (PK) | +| `dim_trreb_district` | Dimension | district_key (PK), geometry | +| `dim_cmhc_zone` | Dimension | zone_key (PK), geometry | +| `dim_neighbourhood` | Dimension | neighbourhood_id (PK), geometry | +| `dim_policy_event` | Dimension | event_id (PK) | + +**V1 Rule**: `dim_neighbourhood` has NO FK to fact tables—reference overlay only. + +### dbt Layer Structure + +| Layer | Naming | Purpose | +|-------|--------|---------| +| Staging | `stg_{source}__{entity}` | 1:1 source, cleaned, typed | +| Intermediate | `int_{domain}__{transform}` | Business logic, filtering | +| Marts | `mart_{domain}` | Final analytical tables | + +--- + +## Sprint Overview + +| Sprint | Focus | Milestone | +|--------|-------|-----------| +| 1 | Project bootstrap, start TRREB digitization | — | +| 2 | Bio page, data acquisition | **Launch 1: Bio Live** | +| 3 | Parsers, schemas, models | — | +| 4 | Loaders, dbt | — | +| 5 | Visualization | — | +| 6 | Polish, deploy dashboard | **Launch 2: Dashboard Live** | +| 7 | Buffer | — | + +### Sprint 1 Deliverables + +| Category | Tasks | +|----------|-------| +| **Bootstrap** | Git init, pyproject.toml, .env.example, Makefile, CLAUDE.md | +| **Infrastructure** | Docker Compose (PostgreSQL + PostGIS), scripts/ directory | +| **App Foundation** | portfolio_app/ structure, config.py, error handling | +| **Tests** | tests/ directory, conftest.py, pytest config | +| **Data Acquisition** | Download TRREB PDFs, START boundary digitization (HUMAN task) | + +### Human Tasks (Cannot Automate) + +| Task | Tool | Effort | +|------|------|--------| +| Digitize TRREB district boundaries | QGIS | 3-4 hours | +| Research policy events (10-20) | Manual research | 2-3 hours | +| Replace social link placeholders | Manual | 5 minutes | + +--- + +## Scope Boundaries + +### Phase 1 — Build These + +- Bio landing page with content from bio_content_v2.md +- TRREB PDF parser +- CMHC CSV processor +- PostgreSQL + PostGIS database layer +- Star schema (facts + dimensions) +- dbt models with tests +- Choropleth visualization (Dash) +- Policy event annotation layer +- Neighbourhood overlay (toggle-able) + +### Phase 1 — Do NOT Build + +| Feature | Reason | When | +|---------|--------|------| +| `bridge_district_neighbourhood` table | Area-weighted aggregation is Phase 4 | After Energy project | +| Crime data integration | Deferred scope | Phase 4 | +| Historical boundary reconciliation (140→158) | 2021+ data only for V1 | Phase 4 | +| ML prediction models | Energy project scope | Phase 3 | +| Multi-project shared infrastructure | Build first, abstract second | Phase 2 | + +If a task seems to require Phase 3/4 features, **stop and flag it**. + +--- + +## File Structure + +### Root-Level Files (Allowed) + +| File | Purpose | +|------|---------| +| `README.md` | Project overview | +| `CLAUDE.md` | AI assistant context | +| `pyproject.toml` | Python packaging | +| `.gitignore` | Git ignore rules | +| `.env.example` | Environment template | +| `.python-version` | pyenv version | +| `.pre-commit-config.yaml` | Pre-commit hooks | +| `docker-compose.yml` | Container orchestration | +| `Makefile` | Task automation | + +### Directory Structure + +``` +portfolio/ +├── portfolio_app/ # Monolithic Dash application +│ ├── app.py +│ ├── config.py +│ ├── assets/ +│ ├── pages/ +│ ├── components/ +│ ├── figures/ +│ ├── toronto/ +│ └── errors/ +├── tests/ +├── dbt/ +├── data/ +│ └── toronto/ +│ ├── raw/ +│ ├── processed/ # gitignored +│ └── reference/ +├── scripts/ +│ ├── db/ +│ ├── docker/ +│ ├── deploy/ +│ ├── dbt/ +│ └── dev/ +├── docs/ +├── notebooks/ +├── backups/ # gitignored +└── reports/ # gitignored +``` + +### Gitignored Directories + +- `data/*/processed/` +- `reports/` +- `backups/` +- `notebooks/*.html` +- `.env` +- `__pycache__/` +- `.venv/` + +--- + +## Makefile Targets + +| Target | Purpose | +|--------|---------| +| `setup` | Install deps, create .env, init pre-commit | +| `docker-up` | Start PostgreSQL + PostGIS | +| `docker-down` | Stop containers | +| `db-init` | Initialize database schema | +| `run` | Start Dash dev server | +| `test` | Run pytest | +| `dbt-run` | Run dbt models | +| `dbt-test` | Run dbt tests | +| `lint` | Run ruff linter | +| `format` | Run ruff formatter | +| `ci` | Run all checks | +| `deploy` | Deploy to production | + +--- + +## Script Standards + +All scripts in `scripts/`: +- Include usage comments at top +- Idempotent where possible +- Exit codes: 0 = success, 1 = error +- Use `set -euo pipefail` for bash +- Log to stdout, errors to stderr + +--- + +## Environment Variables + +Required in `.env`: + +```bash +DATABASE_URL=postgresql://user:pass@localhost:5432/portfolio +POSTGRES_USER=portfolio +POSTGRES_PASSWORD= +POSTGRES_DB=portfolio +DASH_DEBUG=true +SECRET_KEY= +LOG_LEVEL=INFO +``` + +--- + +## Success Criteria + +### Launch 1 (Sprint 2) +- [ ] Bio page accessible via HTTPS +- [ ] All bio content rendered (from bio_content_v2.md) +- [ ] No placeholder text visible +- [ ] Mobile responsive +- [ ] Social links functional + +### Launch 2 (Sprint 6) +- [ ] Choropleth renders TRREB districts and CMHC zones +- [ ] Purchase/rental mode toggle works +- [ ] Time navigation works +- [ ] Policy event markers visible +- [ ] Neighbourhood overlay toggleable +- [ ] Methodology documentation published +- [ ] Data sources cited + +--- + +## Reference Documents + +For detailed specifications, see: + +| Document | Location | Use When | +|----------|----------|----------| +| Data schemas | `docs/toronto_housing_spec.md` | Parser/model tasks | +| WBS details | `docs/wbs.md` | Sprint planning | +| Bio content | `docs/bio_content.md` | Building home.py | + +--- + +*Reference Version: 1.0* +*Created: January 2026* -- 2.49.1