Define audit rules, severity classification, scanning strategies, and report templates for data integrity validation. Covers: - Schema validity (PostgreSQL tables, columns, types) - dbt project health (parse, compile, test, lineage) - PostGIS compliance (SRID, geometry types, extent) - Data type consistency (DataFrame dtypes) - Query safety patterns Closes #373 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
9.4 KiB
9.4 KiB
name, description
| name | description |
|---|---|
| data-integrity-audit | Rules and patterns for auditing data integrity, schema validity, and dbt compliance |
Data Integrity Audit
Purpose
Defines what "data valid" means for the data-platform domain. This skill is loaded by the data-advisor agent for both review and gate modes during sprint execution and standalone audits.
What to Check
| Check Category | What It Validates | MCP Tools Used |
|---|---|---|
| Schema Validity | Tables exist, columns have correct types, constraints present, no orphaned columns | pg_tables, pg_columns, pg_schemas |
| dbt Project Health | Project parses without errors, models compile, tests defined for critical models | dbt_parse, dbt_compile, dbt_test, dbt_ls |
| Lineage Integrity | No orphaned models (referenced but missing), no circular dependencies, upstream sources exist | dbt_lineage, dbt_ls |
| Data Type Consistency | DataFrame dtypes match expected schema, no silent type coercion, date formats consistent | describe, head, pg_columns |
| PostGIS Compliance | Spatial tables have correct SRID, geometry types match expectations, extent is reasonable | st_tables, st_geometry_type, st_srid, st_extent |
| Query Safety | SELECT queries used for reads (not raw SQL for mutations), parameterized patterns | Code review - manual pattern check |
Common Violations
FAIL-Level Violations (Block Gate)
| Violation | Detection Method | Example |
|---|---|---|
| dbt parse failure | dbt_parse returns error |
Project YAML invalid, missing ref targets |
| dbt compilation error | dbt_compile fails |
SQL syntax error, undefined column reference |
| Missing table/column | pg_tables, pg_columns lookup |
Code references census_raw but table doesn't exist |
| Type mismatch | Compare pg_columns vs dbt schema |
Column is varchar in DB but model expects integer |
| Broken lineage | dbt_lineage shows orphaned refs |
Model references stg_old_format which doesn't exist |
| PostGIS SRID mismatch | st_srid returns unexpected value |
Geometry column has SRID 0 instead of 4326 |
| Unreasonable spatial extent | st_extent returns global bbox |
Toronto data shows coordinates in China |
WARN-Level Violations (Report, Don't Block)
| Violation | Detection Method | Example |
|---|---|---|
| Missing dbt tests | dbt_ls shows model without test |
dim_customers has no unique test on customer_id |
| Undocumented columns | dbt schema.yml missing descriptions | Model columns have no documentation |
| Schema drift | pg_columns vs dbt schema.yml |
Column exists in DB but not in dbt YAML |
| Hardcoded SQL | Scan Python for string concatenation | f"SELECT * FROM {table}" without parameterization |
| Orphaned model | dbt_lineage shows no downstream |
stg_legacy has no consumers and no exposure |
INFO-Level Violations (Suggestions Only)
| Violation | Detection Method | Example |
|---|---|---|
| Missing indexes | Query pattern suggests need | Frequent filter on non-indexed column |
| Documentation gaps | dbt docs incomplete | Missing model description |
| Unused models | dbt_ls vs actual queries |
Model exists but never selected |
| Optimization opportunity | describe shows data patterns |
Column has low cardinality, could be enum |
Severity Classification
| Severity | When to Apply | Gate Behavior |
|---|---|---|
| FAIL | Broken lineage, models that won't compile, missing tables/columns, data type mismatches that cause runtime errors, invalid SRID | Blocks issue completion |
| WARN | Missing dbt tests, undocumented columns, schema drift, hardcoded SQL, orphaned models | Does NOT block gate, included in review report |
| INFO | Optimization opportunities, documentation gaps, unused models | Review report only |
Severity Decision Tree
Is the dbt project broken (parse/compile fails)?
YES -> FAIL
NO -> Does code reference non-existent tables/columns?
YES -> FAIL
NO -> Would this cause a runtime error?
YES -> FAIL
NO -> Does it violate data quality standards?
YES -> WARN
NO -> Is it an optimization/documentation suggestion?
YES -> INFO
NO -> Not a violation
Scanning Strategy
For dbt Projects
-
Parse validation (ALWAYS FIRST)
dbt_parse → if fails, immediate FAIL (project is broken) -
Catalog resources
dbt_ls → list all models, tests, sources, exposures -
Lineage check
dbt_lineage on changed models → check upstream/downstream integrity -
Compilation check
dbt_compile on changed models → verify SQL renders correctly -
Test execution
dbt_test --select <changed_models> → verify tests pass -
Test coverage audit
Cross-reference dbt_ls tests against model list → flag models without tests (WARN)
For PostgreSQL Schema Changes
-
Table verification
pg_tables → verify expected tables exist -
Column validation
pg_columns on affected tables → verify types match expectations -
Schema comparison
Compare pg_columns output against dbt schema.yml → flag drift
For PostGIS/Spatial Data
-
Spatial table scan
st_tables → list tables with geometry columns -
SRID validation
st_srid → verify SRID is correct for expected region Expected: 4326 (WGS84) for GPS data, local projections for regional data -
Geometry type check
st_geometry_type → verify expected types (Point, Polygon, etc.) -
Extent sanity check
st_extent → verify bounding box is reasonable for expected region Toronto data should be ~(-79.6 to -79.1, 43.6 to 43.9)
For DataFrame/pandas Operations
-
Data quality check
describe → check for unexpected nulls, type issues, outliers -
Structure verification
head → verify data structure matches expectations -
Memory management
list_data → verify no stale DataFrames from previous failed runs
For Python Code (Manual Scan)
-
SQL injection patterns
- Scan for f-strings with table/column names
- Check for string concatenation in queries
- Look for
.format()calls with SQL
-
Mutation safety
pg_executeusage should be intentional, not accidental- Verify DELETE/UPDATE have WHERE clauses
-
Credential exposure
- No hardcoded connection strings
- No credentials in code (check for
.envusage)
Report Templates
Gate Mode (Compact)
DATA GATE: PASS
No blocking data integrity violations found.
or
DATA GATE: FAIL
Blocking Issues (N):
1. <location> - <violation description>
Fix: <actionable fix>
2. <location> - <violation description>
Fix: <actionable fix>
Run /data-review for full audit report.
Review Mode (Detailed)
+----------------------------------------------------------------------+
| DATA-PLATFORM - Data Integrity Audit |
| [Target Path] |
+----------------------------------------------------------------------+
Target: <scanned path or project>
Scope: N files scanned, N models checked, N tables verified
FINDINGS
FAIL (N)
1. [location] violation description
Fix: actionable fix
2. [location] violation description
Fix: actionable fix
WARN (N)
1. [location] warning description
Suggestion: improvement suggestion
2. [location] warning description
Suggestion: improvement suggestion
INFO (N)
1. [location] info description
Note: context
SUMMARY
Schema: Valid | N issues
Lineage: Intact | N orphaned
dbt: Passes | N failures
PostGIS: Valid | N issues | Not applicable
VERDICT: PASS | FAIL (N blocking issues)
Skip Patterns
Do not flag violations in:
**/tests/**- Test files may have intentional violations**/__pycache__/**- Compiled files**/fixtures/**- Test fixtures**/.scratch/**- Temporary working files- Files with
# noqa: data-auditcomment - Migration files marked as historical
Error Handling
| Scenario | Behavior |
|---|---|
Database not reachable (pg_connect fails) |
WARN, skip PostgreSQL checks, continue with file-based |
dbt not configured (no dbt_project.yml) |
Skip dbt checks entirely, not an error |
| No PostGIS tables found | Skip PostGIS checks, not an error |
| MCP tool call fails | Report as WARN with tool name, continue with remaining checks |
| No data files in scanned path | Report "No data artifacts found" - PASS (nothing to fail) |
| Empty directory | Report "No files found in path" - PASS |
Integration Notes
projman Orchestrator
When called as a domain gate:
- Orchestrator detects
Domain/Datalabel on issue - Orchestrator identifies changed files
- Orchestrator invokes
/data-gate <path> - Agent runs gate mode scan
- Returns PASS/FAIL to orchestrator
- Orchestrator decides whether to complete issue
Standalone Usage
For manual audits:
- User runs
/data-review <path> - Agent runs full review mode scan
- Returns detailed report with all severity levels
- User decides on actions