Template

Files

lmiranda 2c998dff1d feat(data-platform): add data-integrity-audit skill

Define audit rules, severity classification, scanning strategies,
and report templates for data integrity validation.

Covers:
- Schema validity (PostgreSQL tables, columns, types)
- dbt project health (parse, compile, test, lineage)
- PostGIS compliance (SRID, geometry types, extent)
- Data type consistency (DataFrame dtypes)
- Query safety patterns

Closes #373

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-02-02 01:24:54 -05:00

9.4 KiB

Raw Blame History

name, description

name	description
data-integrity-audit	Rules and patterns for auditing data integrity, schema validity, and dbt compliance

Data Integrity Audit

Purpose

Defines what "data valid" means for the data-platform domain. This skill is loaded by the data-advisor agent for both review and gate modes during sprint execution and standalone audits.

What to Check

Check Category	What It Validates	MCP Tools Used
Schema Validity	Tables exist, columns have correct types, constraints present, no orphaned columns	`pg_tables`, `pg_columns`, `pg_schemas`
dbt Project Health	Project parses without errors, models compile, tests defined for critical models	`dbt_parse`, `dbt_compile`, `dbt_test`, `dbt_ls`
Lineage Integrity	No orphaned models (referenced but missing), no circular dependencies, upstream sources exist	`dbt_lineage`, `dbt_ls`
Data Type Consistency	DataFrame dtypes match expected schema, no silent type coercion, date formats consistent	`describe`, `head`, `pg_columns`
PostGIS Compliance	Spatial tables have correct SRID, geometry types match expectations, extent is reasonable	`st_tables`, `st_geometry_type`, `st_srid`, `st_extent`
Query Safety	SELECT queries used for reads (not raw SQL for mutations), parameterized patterns	Code review - manual pattern check

Common Violations

FAIL-Level Violations (Block Gate)

Violation	Detection Method	Example
dbt parse failure	`dbt_parse` returns error	Project YAML invalid, missing ref targets
dbt compilation error	`dbt_compile` fails	SQL syntax error, undefined column reference
Missing table/column	`pg_tables`, `pg_columns` lookup	Code references `census_raw` but table doesn't exist
Type mismatch	Compare `pg_columns` vs dbt schema	Column is `varchar` in DB but model expects `integer`
Broken lineage	`dbt_lineage` shows orphaned refs	Model references `stg_old_format` which doesn't exist
PostGIS SRID mismatch	`st_srid` returns unexpected value	Geometry column has SRID 0 instead of 4326
Unreasonable spatial extent	`st_extent` returns global bbox	Toronto data shows coordinates in China

WARN-Level Violations (Report, Don't Block)

Violation	Detection Method	Example
Missing dbt tests	`dbt_ls` shows model without test	`dim_customers` has no `unique` test on `customer_id`
Undocumented columns	dbt schema.yml missing descriptions	Model columns have no documentation
Schema drift	`pg_columns` vs dbt schema.yml	Column exists in DB but not in dbt YAML
Hardcoded SQL	Scan Python for string concatenation	`f"SELECT * FROM {table}"` without parameterization
Orphaned model	`dbt_lineage` shows no downstream	`stg_legacy` has no consumers and no exposure

INFO-Level Violations (Suggestions Only)

Violation	Detection Method	Example
Missing indexes	Query pattern suggests need	Frequent filter on non-indexed column
Documentation gaps	dbt docs incomplete	Missing model description
Unused models	`dbt_ls` vs actual queries	Model exists but never selected
Optimization opportunity	`describe` shows data patterns	Column has low cardinality, could be enum

Severity Classification

Severity	When to Apply	Gate Behavior
FAIL	Broken lineage, models that won't compile, missing tables/columns, data type mismatches that cause runtime errors, invalid SRID	Blocks issue completion
WARN	Missing dbt tests, undocumented columns, schema drift, hardcoded SQL, orphaned models	Does NOT block gate, included in review report
INFO	Optimization opportunities, documentation gaps, unused models	Review report only

Severity Decision Tree

Is the dbt project broken (parse/compile fails)?
  YES -> FAIL
  NO -> Does code reference non-existent tables/columns?
    YES -> FAIL
    NO -> Would this cause a runtime error?
      YES -> FAIL
      NO -> Does it violate data quality standards?
        YES -> WARN
        NO -> Is it an optimization/documentation suggestion?
          YES -> INFO
          NO -> Not a violation

Scanning Strategy

For dbt Projects

Parse validation (ALWAYS FIRST)

dbt_parse → if fails, immediate FAIL (project is broken)

Catalog resources

dbt_ls → list all models, tests, sources, exposures

Lineage check

dbt_lineage on changed models → check upstream/downstream integrity

Compilation check

dbt_compile on changed models → verify SQL renders correctly

Test execution

dbt_test --select <changed_models> → verify tests pass

Test coverage audit

Cross-reference dbt_ls tests against model list → flag models without tests (WARN)

For PostgreSQL Schema Changes

Table verification

pg_tables → verify expected tables exist

Column validation

pg_columns on affected tables → verify types match expectations

Schema comparison

Compare pg_columns output against dbt schema.yml → flag drift

For PostGIS/Spatial Data

Spatial table scan

st_tables → list tables with geometry columns

SRID validation

st_srid → verify SRID is correct for expected region
Expected: 4326 (WGS84) for GPS data, local projections for regional data

Geometry type check

st_geometry_type → verify expected types (Point, Polygon, etc.)

Extent sanity check

st_extent → verify bounding box is reasonable for expected region
Toronto data should be ~(-79.6 to -79.1, 43.6 to 43.9)

For DataFrame/pandas Operations

Data quality check

describe → check for unexpected nulls, type issues, outliers

Structure verification

head → verify data structure matches expectations

Memory management

list_data → verify no stale DataFrames from previous failed runs

For Python Code (Manual Scan)

SQL injection patterns
- Scan for f-strings with table/column names
- Check for string concatenation in queries
- Look for .format() calls with SQL
Mutation safety
- pg_execute usage should be intentional, not accidental
- Verify DELETE/UPDATE have WHERE clauses
Credential exposure
- No hardcoded connection strings
- No credentials in code (check for .env usage)

Report Templates

Gate Mode (Compact)

DATA GATE: PASS
No blocking data integrity violations found.

DATA GATE: FAIL

Blocking Issues (N):
1. <location> - <violation description>
   Fix: <actionable fix>

2. <location> - <violation description>
   Fix: <actionable fix>

Run /data-review for full audit report.

Review Mode (Detailed)

+----------------------------------------------------------------------+
|  DATA-PLATFORM - Data Integrity Audit                                 |
|  [Target Path]                                                        |
+----------------------------------------------------------------------+

Target: <scanned path or project>
Scope: N files scanned, N models checked, N tables verified

FINDINGS

FAIL (N)
  1. [location] violation description
     Fix: actionable fix

  2. [location] violation description
     Fix: actionable fix

WARN (N)
  1. [location] warning description
     Suggestion: improvement suggestion

  2. [location] warning description
     Suggestion: improvement suggestion

INFO (N)
  1. [location] info description
     Note: context

SUMMARY
  Schema:   Valid | N issues
  Lineage:  Intact | N orphaned
  dbt:      Passes | N failures
  PostGIS:  Valid | N issues | Not applicable

VERDICT: PASS | FAIL (N blocking issues)

Skip Patterns

Do not flag violations in:

**/tests/** - Test files may have intentional violations
**/__pycache__/** - Compiled files
**/fixtures/** - Test fixtures
**/.scratch/** - Temporary working files
Files with # noqa: data-audit comment
Migration files marked as historical

Error Handling

Scenario	Behavior
Database not reachable (`pg_connect` fails)	WARN, skip PostgreSQL checks, continue with file-based
dbt not configured (no `dbt_project.yml`)	Skip dbt checks entirely, not an error
No PostGIS tables found	Skip PostGIS checks, not an error
MCP tool call fails	Report as WARN with tool name, continue with remaining checks
No data files in scanned path	Report "No data artifacts found" - PASS (nothing to fail)
Empty directory	Report "No files found in path" - PASS

Integration Notes

projman Orchestrator

When called as a domain gate:

Orchestrator detects Domain/Data label on issue
Orchestrator identifies changed files
Orchestrator invokes /data-gate <path>
Agent runs gate mode scan
Returns PASS/FAIL to orchestrator
Orchestrator decides whether to complete issue

Standalone Usage

For manual audits:

User runs /data-review <path>
Agent runs full review mode scan
Returns detailed report with all severity levels
User decides on actions

9.4 KiB Raw Blame History