chore(dbt): address remaining data integrity audit findings #103

Open
opened 2026-02-02 21:02:33 +00:00 by lmiranda · 0 comments
Owner

Overview

Data integrity audit (/data-review ./dbt/) identified several non-blocking issues that should be addressed for improved data quality and maintainability.

Findings

WARN - Should Address

1. Empty fact_amenities table (0 rows)

  • Location: raw_toronto.fact_amenities
  • Impact: mart_neighbourhood_amenities returns NULL values; amenity_score in mart_neighbourhood_overview uses hardcoded placeholder (50.0)
  • Action: Load amenity data from Toronto Open Data (parks, schools, transit stops)

2. Missing relationship tests for foreign keys

  • Location: All staging models with foreign keys
  • Impact: No automated validation of referential integrity
  • Action: Add relationship tests for:
    • stg_cmhc__rentals.zone_keystg_dimensions__cmhc_zones.zone_key
    • stg_cmhc__rentals.date_keystg_dimensions__time.date_key
    • stg_toronto__census.neighbourhood_idstg_toronto__neighbourhoods.neighbourhood_id
    • stg_toronto__crime.neighbourhood_idstg_toronto__neighbourhoods.neighbourhood_id
    • stg_cmhc__zone_crosswalk.neighbourhood_idstg_toronto__neighbourhoods.neighbourhood_id

INFO - Consider Addressing

3. Missing test for is_month_start column

  • Location: dbt/models/shared/_shared.yml
  • Action: Add not_null test to is_month_start column in stg_dimensions__time for consistency

4. Legacy schema remnants

  • Location: Database schemas
  • Schemas: public_intermediate, public_marts, public_shared, public_staging
  • Action: Drop legacy schemas from before domain-scoped migration (verify not in use first)

5. Hardcoded year range in int_year_spine

  • Location: dbt/models/intermediate/toronto/int_year_spine.sql:8
  • Current: generate_series(2014, 2025)
  • Action: Document when to update, or make dynamic using current_date

Acceptance Criteria

  • Amenity data loader implemented and fact_amenities populated
  • Relationship tests added to _staging.yml
  • not_null test added for is_month_start
  • Legacy schemas dropped (after verification)
  • Year spine documented or made dynamic

Technical Notes

  • Amenity data sources: Toronto Open Data Portal (parks, libraries, community centres, transit)
  • Relationship test syntax:
    - relationships:
        to: ref('stg_toronto__neighbourhoods')
        field: neighbourhood_id
    

Labels

  • Type: Chore
  • Priority: Low
  • Complexity: Medium
  • Effort: Large
## Overview Data integrity audit (`/data-review ./dbt/`) identified several non-blocking issues that should be addressed for improved data quality and maintainability. ## Findings ### WARN - Should Address #### 1. Empty `fact_amenities` table (0 rows) - **Location**: `raw_toronto.fact_amenities` - **Impact**: `mart_neighbourhood_amenities` returns NULL values; `amenity_score` in `mart_neighbourhood_overview` uses hardcoded placeholder (50.0) - **Action**: Load amenity data from Toronto Open Data (parks, schools, transit stops) #### 2. Missing relationship tests for foreign keys - **Location**: All staging models with foreign keys - **Impact**: No automated validation of referential integrity - **Action**: Add relationship tests for: - `stg_cmhc__rentals.zone_key` → `stg_dimensions__cmhc_zones.zone_key` - `stg_cmhc__rentals.date_key` → `stg_dimensions__time.date_key` - `stg_toronto__census.neighbourhood_id` → `stg_toronto__neighbourhoods.neighbourhood_id` - `stg_toronto__crime.neighbourhood_id` → `stg_toronto__neighbourhoods.neighbourhood_id` - `stg_cmhc__zone_crosswalk.neighbourhood_id` → `stg_toronto__neighbourhoods.neighbourhood_id` ### INFO - Consider Addressing #### 3. Missing test for `is_month_start` column - **Location**: `dbt/models/shared/_shared.yml` - **Action**: Add `not_null` test to `is_month_start` column in `stg_dimensions__time` for consistency #### 4. Legacy schema remnants - **Location**: Database schemas - **Schemas**: `public_intermediate`, `public_marts`, `public_shared`, `public_staging` - **Action**: Drop legacy schemas from before domain-scoped migration (verify not in use first) #### 5. Hardcoded year range in `int_year_spine` - **Location**: `dbt/models/intermediate/toronto/int_year_spine.sql:8` - **Current**: `generate_series(2014, 2025)` - **Action**: Document when to update, or make dynamic using `current_date` ## Acceptance Criteria - [ ] Amenity data loader implemented and `fact_amenities` populated - [ ] Relationship tests added to `_staging.yml` - [ ] `not_null` test added for `is_month_start` - [ ] Legacy schemas dropped (after verification) - [ ] Year spine documented or made dynamic ## Technical Notes - Amenity data sources: Toronto Open Data Portal (parks, libraries, community centres, transit) - Relationship test syntax: ```yaml - relationships: to: ref('stg_toronto__neighbourhoods') field: neighbourhood_id ``` ## Labels - Type: Chore - Priority: Low - Complexity: Medium - Effort: Large
Sign in to join this conversation.