Merge feat/373-data-integrity-audit-skill
This commit is contained in:
307
plugins/data-platform/skills/data-integrity-audit.md
Normal file
307
plugins/data-platform/skills/data-integrity-audit.md
Normal file
@@ -0,0 +1,307 @@
|
|||||||
|
---
|
||||||
|
name: data-integrity-audit
|
||||||
|
description: Rules and patterns for auditing data integrity, schema validity, and dbt compliance
|
||||||
|
---
|
||||||
|
|
||||||
|
# Data Integrity Audit
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
Defines what "data valid" means for the data-platform domain. This skill is loaded by the `data-advisor` agent for both review and gate modes during sprint execution and standalone audits.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## What to Check
|
||||||
|
|
||||||
|
| Check Category | What It Validates | MCP Tools Used |
|
||||||
|
|----------------|-------------------|----------------|
|
||||||
|
| **Schema Validity** | Tables exist, columns have correct types, constraints present, no orphaned columns | `pg_tables`, `pg_columns`, `pg_schemas` |
|
||||||
|
| **dbt Project Health** | Project parses without errors, models compile, tests defined for critical models | `dbt_parse`, `dbt_compile`, `dbt_test`, `dbt_ls` |
|
||||||
|
| **Lineage Integrity** | No orphaned models (referenced but missing), no circular dependencies, upstream sources exist | `dbt_lineage`, `dbt_ls` |
|
||||||
|
| **Data Type Consistency** | DataFrame dtypes match expected schema, no silent type coercion, date formats consistent | `describe`, `head`, `pg_columns` |
|
||||||
|
| **PostGIS Compliance** | Spatial tables have correct SRID, geometry types match expectations, extent is reasonable | `st_tables`, `st_geometry_type`, `st_srid`, `st_extent` |
|
||||||
|
| **Query Safety** | SELECT queries used for reads (not raw SQL for mutations), parameterized patterns | Code review - manual pattern check |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Common Violations
|
||||||
|
|
||||||
|
### FAIL-Level Violations (Block Gate)
|
||||||
|
|
||||||
|
| Violation | Detection Method | Example |
|
||||||
|
|-----------|-----------------|---------|
|
||||||
|
| dbt parse failure | `dbt_parse` returns error | Project YAML invalid, missing ref targets |
|
||||||
|
| dbt compilation error | `dbt_compile` fails | SQL syntax error, undefined column reference |
|
||||||
|
| Missing table/column | `pg_tables`, `pg_columns` lookup | Code references `census_raw` but table doesn't exist |
|
||||||
|
| Type mismatch | Compare `pg_columns` vs dbt schema | Column is `varchar` in DB but model expects `integer` |
|
||||||
|
| Broken lineage | `dbt_lineage` shows orphaned refs | Model references `stg_old_format` which doesn't exist |
|
||||||
|
| PostGIS SRID mismatch | `st_srid` returns unexpected value | Geometry column has SRID 0 instead of 4326 |
|
||||||
|
| Unreasonable spatial extent | `st_extent` returns global bbox | Toronto data shows coordinates in China |
|
||||||
|
|
||||||
|
### WARN-Level Violations (Report, Don't Block)
|
||||||
|
|
||||||
|
| Violation | Detection Method | Example |
|
||||||
|
|-----------|-----------------|---------|
|
||||||
|
| Missing dbt tests | `dbt_ls` shows model without test | `dim_customers` has no `unique` test on `customer_id` |
|
||||||
|
| Undocumented columns | dbt schema.yml missing descriptions | Model columns have no documentation |
|
||||||
|
| Schema drift | `pg_columns` vs dbt schema.yml | Column exists in DB but not in dbt YAML |
|
||||||
|
| Hardcoded SQL | Scan Python for string concatenation | `f"SELECT * FROM {table}"` without parameterization |
|
||||||
|
| Orphaned model | `dbt_lineage` shows no downstream | `stg_legacy` has no consumers and no exposure |
|
||||||
|
|
||||||
|
### INFO-Level Violations (Suggestions Only)
|
||||||
|
|
||||||
|
| Violation | Detection Method | Example |
|
||||||
|
|-----------|-----------------|---------|
|
||||||
|
| Missing indexes | Query pattern suggests need | Frequent filter on non-indexed column |
|
||||||
|
| Documentation gaps | dbt docs incomplete | Missing model description |
|
||||||
|
| Unused models | `dbt_ls` vs actual queries | Model exists but never selected |
|
||||||
|
| Optimization opportunity | `describe` shows data patterns | Column has low cardinality, could be enum |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Severity Classification
|
||||||
|
|
||||||
|
| Severity | When to Apply | Gate Behavior |
|
||||||
|
|----------|--------------|---------------|
|
||||||
|
| **FAIL** | Broken lineage, models that won't compile, missing tables/columns, data type mismatches that cause runtime errors, invalid SRID | Blocks issue completion |
|
||||||
|
| **WARN** | Missing dbt tests, undocumented columns, schema drift, hardcoded SQL, orphaned models | Does NOT block gate, included in review report |
|
||||||
|
| **INFO** | Optimization opportunities, documentation gaps, unused models | Review report only |
|
||||||
|
|
||||||
|
### Severity Decision Tree
|
||||||
|
|
||||||
|
```
|
||||||
|
Is the dbt project broken (parse/compile fails)?
|
||||||
|
YES -> FAIL
|
||||||
|
NO -> Does code reference non-existent tables/columns?
|
||||||
|
YES -> FAIL
|
||||||
|
NO -> Would this cause a runtime error?
|
||||||
|
YES -> FAIL
|
||||||
|
NO -> Does it violate data quality standards?
|
||||||
|
YES -> WARN
|
||||||
|
NO -> Is it an optimization/documentation suggestion?
|
||||||
|
YES -> INFO
|
||||||
|
NO -> Not a violation
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scanning Strategy
|
||||||
|
|
||||||
|
### For dbt Projects
|
||||||
|
|
||||||
|
1. **Parse validation** (ALWAYS FIRST)
|
||||||
|
```
|
||||||
|
dbt_parse → if fails, immediate FAIL (project is broken)
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Catalog resources**
|
||||||
|
```
|
||||||
|
dbt_ls → list all models, tests, sources, exposures
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Lineage check**
|
||||||
|
```
|
||||||
|
dbt_lineage on changed models → check upstream/downstream integrity
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Compilation check**
|
||||||
|
```
|
||||||
|
dbt_compile on changed models → verify SQL renders correctly
|
||||||
|
```
|
||||||
|
|
||||||
|
5. **Test execution**
|
||||||
|
```
|
||||||
|
dbt_test --select <changed_models> → verify tests pass
|
||||||
|
```
|
||||||
|
|
||||||
|
6. **Test coverage audit**
|
||||||
|
```
|
||||||
|
Cross-reference dbt_ls tests against model list → flag models without tests (WARN)
|
||||||
|
```
|
||||||
|
|
||||||
|
### For PostgreSQL Schema Changes
|
||||||
|
|
||||||
|
1. **Table verification**
|
||||||
|
```
|
||||||
|
pg_tables → verify expected tables exist
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Column validation**
|
||||||
|
```
|
||||||
|
pg_columns on affected tables → verify types match expectations
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Schema comparison**
|
||||||
|
```
|
||||||
|
Compare pg_columns output against dbt schema.yml → flag drift
|
||||||
|
```
|
||||||
|
|
||||||
|
### For PostGIS/Spatial Data
|
||||||
|
|
||||||
|
1. **Spatial table scan**
|
||||||
|
```
|
||||||
|
st_tables → list tables with geometry columns
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **SRID validation**
|
||||||
|
```
|
||||||
|
st_srid → verify SRID is correct for expected region
|
||||||
|
Expected: 4326 (WGS84) for GPS data, local projections for regional data
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Geometry type check**
|
||||||
|
```
|
||||||
|
st_geometry_type → verify expected types (Point, Polygon, etc.)
|
||||||
|
```
|
||||||
|
|
||||||
|
4. **Extent sanity check**
|
||||||
|
```
|
||||||
|
st_extent → verify bounding box is reasonable for expected region
|
||||||
|
Toronto data should be ~(-79.6 to -79.1, 43.6 to 43.9)
|
||||||
|
```
|
||||||
|
|
||||||
|
### For DataFrame/pandas Operations
|
||||||
|
|
||||||
|
1. **Data quality check**
|
||||||
|
```
|
||||||
|
describe → check for unexpected nulls, type issues, outliers
|
||||||
|
```
|
||||||
|
|
||||||
|
2. **Structure verification**
|
||||||
|
```
|
||||||
|
head → verify data structure matches expectations
|
||||||
|
```
|
||||||
|
|
||||||
|
3. **Memory management**
|
||||||
|
```
|
||||||
|
list_data → verify no stale DataFrames from previous failed runs
|
||||||
|
```
|
||||||
|
|
||||||
|
### For Python Code (Manual Scan)
|
||||||
|
|
||||||
|
1. **SQL injection patterns**
|
||||||
|
- Scan for f-strings with table/column names
|
||||||
|
- Check for string concatenation in queries
|
||||||
|
- Look for `.format()` calls with SQL
|
||||||
|
|
||||||
|
2. **Mutation safety**
|
||||||
|
- `pg_execute` usage should be intentional, not accidental
|
||||||
|
- Verify DELETE/UPDATE have WHERE clauses
|
||||||
|
|
||||||
|
3. **Credential exposure**
|
||||||
|
- No hardcoded connection strings
|
||||||
|
- No credentials in code (check for `.env` usage)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Report Templates
|
||||||
|
|
||||||
|
### Gate Mode (Compact)
|
||||||
|
|
||||||
|
```
|
||||||
|
DATA GATE: PASS
|
||||||
|
No blocking data integrity violations found.
|
||||||
|
```
|
||||||
|
|
||||||
|
or
|
||||||
|
|
||||||
|
```
|
||||||
|
DATA GATE: FAIL
|
||||||
|
|
||||||
|
Blocking Issues (N):
|
||||||
|
1. <location> - <violation description>
|
||||||
|
Fix: <actionable fix>
|
||||||
|
|
||||||
|
2. <location> - <violation description>
|
||||||
|
Fix: <actionable fix>
|
||||||
|
|
||||||
|
Run /data-review for full audit report.
|
||||||
|
```
|
||||||
|
|
||||||
|
### Review Mode (Detailed)
|
||||||
|
|
||||||
|
```
|
||||||
|
+----------------------------------------------------------------------+
|
||||||
|
| DATA-PLATFORM - Data Integrity Audit |
|
||||||
|
| [Target Path] |
|
||||||
|
+----------------------------------------------------------------------+
|
||||||
|
|
||||||
|
Target: <scanned path or project>
|
||||||
|
Scope: N files scanned, N models checked, N tables verified
|
||||||
|
|
||||||
|
FINDINGS
|
||||||
|
|
||||||
|
FAIL (N)
|
||||||
|
1. [location] violation description
|
||||||
|
Fix: actionable fix
|
||||||
|
|
||||||
|
2. [location] violation description
|
||||||
|
Fix: actionable fix
|
||||||
|
|
||||||
|
WARN (N)
|
||||||
|
1. [location] warning description
|
||||||
|
Suggestion: improvement suggestion
|
||||||
|
|
||||||
|
2. [location] warning description
|
||||||
|
Suggestion: improvement suggestion
|
||||||
|
|
||||||
|
INFO (N)
|
||||||
|
1. [location] info description
|
||||||
|
Note: context
|
||||||
|
|
||||||
|
SUMMARY
|
||||||
|
Schema: Valid | N issues
|
||||||
|
Lineage: Intact | N orphaned
|
||||||
|
dbt: Passes | N failures
|
||||||
|
PostGIS: Valid | N issues | Not applicable
|
||||||
|
|
||||||
|
VERDICT: PASS | FAIL (N blocking issues)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Skip Patterns
|
||||||
|
|
||||||
|
Do not flag violations in:
|
||||||
|
|
||||||
|
- `**/tests/**` - Test files may have intentional violations
|
||||||
|
- `**/__pycache__/**` - Compiled files
|
||||||
|
- `**/fixtures/**` - Test fixtures
|
||||||
|
- `**/.scratch/**` - Temporary working files
|
||||||
|
- Files with `# noqa: data-audit` comment
|
||||||
|
- Migration files marked as historical
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Error Handling
|
||||||
|
|
||||||
|
| Scenario | Behavior |
|
||||||
|
|----------|----------|
|
||||||
|
| Database not reachable (`pg_connect` fails) | WARN, skip PostgreSQL checks, continue with file-based |
|
||||||
|
| dbt not configured (no `dbt_project.yml`) | Skip dbt checks entirely, not an error |
|
||||||
|
| No PostGIS tables found | Skip PostGIS checks, not an error |
|
||||||
|
| MCP tool call fails | Report as WARN with tool name, continue with remaining checks |
|
||||||
|
| No data files in scanned path | Report "No data artifacts found" - PASS (nothing to fail) |
|
||||||
|
| Empty directory | Report "No files found in path" - PASS |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Integration Notes
|
||||||
|
|
||||||
|
### projman Orchestrator
|
||||||
|
|
||||||
|
When called as a domain gate:
|
||||||
|
1. Orchestrator detects `Domain/Data` label on issue
|
||||||
|
2. Orchestrator identifies changed files
|
||||||
|
3. Orchestrator invokes `/data-gate <path>`
|
||||||
|
4. Agent runs gate mode scan
|
||||||
|
5. Returns PASS/FAIL to orchestrator
|
||||||
|
6. Orchestrator decides whether to complete issue
|
||||||
|
|
||||||
|
### Standalone Usage
|
||||||
|
|
||||||
|
For manual audits:
|
||||||
|
1. User runs `/data-review <path>`
|
||||||
|
2. Agent runs full review mode scan
|
||||||
|
3. Returns detailed report with all severity levels
|
||||||
|
4. User decides on actions
|
||||||
Reference in New Issue
Block a user