feat: add data-platform plugin (v4.0.0)

Add new data-platform plugin for data engineering workflows with:

MCP Server (32 tools):
- pandas operations (14 tools): read_csv, read_parquet, read_json,
  to_csv, to_parquet, describe, head, tail, filter, select, groupby,
  join, list_data, drop_data
- PostgreSQL/PostGIS (10 tools): pg_connect, pg_query, pg_execute,
  pg_tables, pg_columns, pg_schemas, st_tables, st_geometry_type,
  st_srid, st_extent
- dbt integration (8 tools): dbt_parse, dbt_run, dbt_test, dbt_build,
  dbt_compile, dbt_ls, dbt_docs_generate, dbt_lineage

Plugin Features:
- Arrow IPC data_ref system for DataFrame persistence across tool calls
- Pre-execution validation for dbt with `dbt parse`
- SessionStart hook for PostgreSQL connectivity check (non-blocking)
- Hybrid configuration (system ~/.config/claude/postgres.env + project .env)
- Memory management with 100k row limit and chunking support

Commands: /initial-setup, /ingest, /profile, /schema, /explain, /lineage, /run
Agents: data-ingestion, data-analysis

Test suite: 71 tests covering config, data store, pandas, postgres, dbt tools

Addresses data workflow issues from personal-portfolio project:
- Lost data after multiple interactions (solved by Arrow IPC data_ref)
- dbt 1.9+ syntax deprecation (solved by pre-execution validation)
- Ungraceful PostgreSQL error handling (solved by SessionStart hook)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-25 14:24:03 -05:00
parent 6a267d074b
commit 89f0354ccc
39 changed files with 5413 additions and 6 deletions

View File

@@ -0,0 +1,44 @@
# /explain - dbt Model Explanation
Explain a dbt model's purpose, dependencies, and SQL logic.
## Usage
```
/explain <model_name>
```
## Workflow
1. **Get model info**:
- Use `dbt_lineage` to get model metadata
- Extract description, tags, materialization
2. **Analyze dependencies**:
- Show upstream models (what this depends on)
- Show downstream models (what depends on this)
- Visualize as dependency tree
3. **Compile SQL**:
- Use `dbt_compile` to get rendered SQL
- Explain key transformations
4. **Report**:
- Model purpose (from description)
- Materialization strategy
- Dependency graph
- Key SQL logic explained
## Examples
```
/explain dim_customers
/explain fct_orders
```
## Available Tools
Use these MCP tools:
- `dbt_lineage` - Get model dependencies
- `dbt_compile` - Get compiled SQL
- `dbt_ls` - List related resources

View File

@@ -0,0 +1,44 @@
# /ingest - Data Ingestion
Load data from files or database into the data platform.
## Usage
```
/ingest [source]
```
## Workflow
1. **Identify data source**:
- If source is a file path, determine format (CSV, Parquet, JSON)
- If source is "db" or a table name, query PostgreSQL
2. **Load data**:
- For files: Use `read_csv`, `read_parquet`, or `read_json`
- For database: Use `pg_query` with appropriate SELECT
3. **Validate**:
- Check row count against limits
- If exceeds 100k rows, suggest chunking or filtering
4. **Report**:
- Show data_ref, row count, columns, and memory usage
- Preview first few rows
## Examples
```
/ingest data/sales.csv
/ingest data/customers.parquet
/ingest "SELECT * FROM orders WHERE created_at > '2024-01-01'"
```
## Available Tools
Use these MCP tools:
- `read_csv` - Load CSV files
- `read_parquet` - Load Parquet files
- `read_json` - Load JSON/JSONL files
- `pg_query` - Query PostgreSQL database
- `list_data` - List loaded DataFrames

View File

@@ -0,0 +1,231 @@
---
description: Interactive setup wizard for data-platform plugin - configures MCP server and optional PostgreSQL/dbt
---
# Data Platform Setup Wizard
This command sets up the data-platform plugin with pandas, PostgreSQL, and dbt integration.
## Important Context
- **This command uses Bash, Read, Write, and AskUserQuestion tools** - NOT MCP tools
- **MCP tools won't work until after setup + session restart**
- **PostgreSQL and dbt are optional** - pandas tools work without them
---
## Phase 1: Environment Validation
### Step 1.1: Check Python Version
```bash
python3 --version
```
Requires Python 3.10+. If below, stop setup and inform user.
### Step 1.2: Check for Required Libraries
```bash
python3 -c "import sys; print(f'Python {sys.version_info.major}.{sys.version_info.minor}')"
```
---
## Phase 2: MCP Server Setup
### Step 2.1: Locate Data Platform MCP Server
The MCP server should be at the marketplace root:
```bash
# If running from installed marketplace
ls -la ~/.claude/plugins/marketplaces/leo-claude-mktplace/mcp-servers/data-platform/ 2>/dev/null || echo "NOT_FOUND_INSTALLED"
# If running from source
ls -la ~/claude-plugins-work/mcp-servers/data-platform/ 2>/dev/null || echo "NOT_FOUND_SOURCE"
```
Determine the correct path based on which exists.
### Step 2.2: Check Virtual Environment
```bash
ls -la /path/to/mcp-servers/data-platform/.venv/bin/python 2>/dev/null && echo "VENV_EXISTS" || echo "VENV_MISSING"
```
### Step 2.3: Create Virtual Environment (if missing)
```bash
cd /path/to/mcp-servers/data-platform && python3 -m venv .venv && source .venv/bin/activate && pip install --upgrade pip && pip install -r requirements.txt && deactivate
```
**Note:** This may take a few minutes due to pandas, pyarrow, and dbt dependencies.
---
## Phase 3: PostgreSQL Configuration (Optional)
### Step 3.1: Ask About PostgreSQL
Use AskUserQuestion:
- Question: "Do you want to configure PostgreSQL database access?"
- Header: "PostgreSQL"
- Options:
- "Yes, I have a PostgreSQL database"
- "No, I'll only use pandas/dbt tools"
**If user chooses "No":** Skip to Phase 4.
### Step 3.2: Create Config Directory
```bash
mkdir -p ~/.config/claude
```
### Step 3.3: Check PostgreSQL Configuration
```bash
cat ~/.config/claude/postgres.env 2>/dev/null || echo "FILE_NOT_FOUND"
```
**If file exists with valid URL:** Skip to Step 3.6.
**If missing or has placeholders:** Continue.
### Step 3.4: Gather PostgreSQL Information
Use AskUserQuestion:
- Question: "What is your PostgreSQL connection URL format?"
- Header: "DB Format"
- Options:
- "Standard: postgresql://user:pass@host:5432/db"
- "PostGIS: postgresql://user:pass@host:5432/db (with PostGIS extension)"
- "Other (I'll provide the full URL)"
Ask user to provide the connection URL.
### Step 3.5: Create Configuration File
```bash
cat > ~/.config/claude/postgres.env << 'EOF'
# PostgreSQL Configuration
# Generated by data-platform /initial-setup
POSTGRES_URL=<USER_PROVIDED_URL>
EOF
chmod 600 ~/.config/claude/postgres.env
```
### Step 3.6: Test PostgreSQL Connection (if configured)
```bash
source ~/.config/claude/postgres.env && python3 -c "
import asyncio
import asyncpg
async def test():
try:
conn = await asyncpg.connect('$POSTGRES_URL', timeout=5)
ver = await conn.fetchval('SELECT version()')
await conn.close()
print(f'SUCCESS: {ver.split(\",\")[0]}')
except Exception as e:
print(f'FAILED: {e}')
asyncio.run(test())
"
```
Report result:
- SUCCESS: Connection works
- FAILED: Show error and suggest fixes
---
## Phase 4: dbt Configuration (Optional)
### Step 4.1: Ask About dbt
Use AskUserQuestion:
- Question: "Do you use dbt for data transformations in your projects?"
- Header: "dbt"
- Options:
- "Yes, I have dbt projects"
- "No, I don't use dbt"
**If user chooses "No":** Skip to Phase 5.
### Step 4.2: dbt Discovery
dbt configuration is **project-level** (not system-level). The plugin auto-detects dbt projects by looking for `dbt_project.yml`.
Inform user:
```
dbt projects are detected automatically when you work in a directory
containing dbt_project.yml.
If your dbt project is in a subdirectory, you can set DBT_PROJECT_DIR
in your project's .env file:
DBT_PROJECT_DIR=./transform
DBT_PROFILES_DIR=~/.dbt
```
### Step 4.3: Check dbt Installation
```bash
dbt --version 2>/dev/null || echo "DBT_NOT_FOUND"
```
**If not found:** Inform user that dbt CLI tools require dbt-core to be installed globally or in the project.
---
## Phase 5: Validation
### Step 5.1: Verify MCP Server
```bash
cd /path/to/mcp-servers/data-platform && .venv/bin/python -c "from mcp_server.server import DataPlatformMCPServer; print('MCP Server OK')"
```
### Step 5.2: Summary
```
╔════════════════════════════════════════════════════════════╗
║ DATA-PLATFORM SETUP COMPLETE ║
╠════════════════════════════════════════════════════════════╣
║ MCP Server: ✓ Ready ║
║ pandas Tools: ✓ Available (14 tools) ║
║ PostgreSQL Tools: [✓/✗] [Status based on config] ║
║ PostGIS Tools: [✓/✗] [Status based on PostGIS] ║
║ dbt Tools: [✓/✗] [Status based on discovery] ║
╚════════════════════════════════════════════════════════════╝
```
### Step 5.3: Session Restart Notice
---
**⚠️ Session Restart Required**
Restart your Claude Code session for MCP tools to become available.
**After restart, you can:**
- Run `/ingest` to load data from files or database
- Run `/profile` to analyze DataFrame statistics
- Run `/schema` to explore database/DataFrame schema
- Run `/run` to execute dbt models (if configured)
- Run `/lineage` to view dbt model dependencies
---
## Memory Limits
The data-platform plugin has a default row limit of 100,000 rows per DataFrame. For larger datasets:
- Use chunked processing (`chunk_size` parameter)
- Filter data before loading
- Store to Parquet for efficient re-loading
You can override the limit by setting in your project `.env`:
```
DATA_PLATFORM_MAX_ROWS=500000
```

View File

@@ -0,0 +1,60 @@
# /lineage - Data Lineage Visualization
Show data lineage for dbt models or database tables.
## Usage
```
/lineage <model_name> [--depth N]
```
## Workflow
1. **Get lineage data**:
- Use `dbt_lineage` for dbt models
- For database tables, trace through dbt manifest
2. **Build lineage graph**:
- Identify all upstream sources
- Identify all downstream consumers
- Note materialization at each node
3. **Visualize**:
- ASCII art dependency tree
- List format with indentation
- Show depth levels
4. **Report**:
- Full dependency chain
- Critical path identification
- Refresh implications
## Examples
```
/lineage dim_customers
/lineage fct_orders --depth 3
```
## Output Format
```
Sources:
└── raw_customers (source)
└── raw_orders (source)
dim_customers (table)
├── upstream:
│ └── stg_customers (view)
│ └── raw_customers (source)
└── downstream:
└── fct_orders (incremental)
└── rpt_customer_lifetime (table)
```
## Available Tools
Use these MCP tools:
- `dbt_lineage` - Get model dependencies
- `dbt_ls` - List dbt resources
- `dbt_docs_generate` - Generate full manifest

View File

@@ -0,0 +1,44 @@
# /profile - Data Profiling
Generate statistical profile and quality report for a DataFrame.
## Usage
```
/profile <data_ref>
```
## Workflow
1. **Get data reference**:
- If no data_ref provided, use `list_data` to show available options
- Validate the data_ref exists
2. **Generate profile**:
- Use `describe` for statistical summary
- Analyze null counts, unique values, data types
3. **Quality assessment**:
- Identify columns with high null percentage
- Flag potential data quality issues
- Suggest cleaning operations if needed
4. **Report**:
- Summary statistics per column
- Data type distribution
- Memory usage
- Quality score
## Examples
```
/profile sales_data
/profile df_a1b2c3d4
```
## Available Tools
Use these MCP tools:
- `describe` - Get statistical summary
- `head` - Preview first rows
- `list_data` - List available DataFrames

View File

@@ -0,0 +1,55 @@
# /run - Execute dbt Models
Run dbt models with automatic pre-validation.
## Usage
```
/run [model_selection] [--full-refresh]
```
## Workflow
1. **Pre-validation** (MANDATORY):
- Use `dbt_parse` to validate project
- Check for deprecated syntax (dbt 1.9+)
- If validation fails, show errors and STOP
2. **Execute models**:
- Use `dbt_run` with provided selection
- Monitor progress and capture output
3. **Report results**:
- Success/failure status per model
- Execution time
- Row counts where available
- Any warnings or errors
## Examples
```
/run # Run all models
/run dim_customers # Run specific model
/run +fct_orders # Run model and its upstream
/run tag:daily # Run models with tag
/run --full-refresh # Rebuild incremental models
```
## Selection Syntax
| Pattern | Meaning |
|---------|---------|
| `model_name` | Run single model |
| `+model_name` | Run model and upstream |
| `model_name+` | Run model and downstream |
| `+model_name+` | Run model with all deps |
| `tag:name` | Run by tag |
| `path:models/staging` | Run by path |
## Available Tools
Use these MCP tools:
- `dbt_parse` - Pre-validation (ALWAYS RUN FIRST)
- `dbt_run` - Execute models
- `dbt_build` - Run + test
- `dbt_test` - Run tests only

View File

@@ -0,0 +1,48 @@
# /schema - Schema Exploration
Display schema information for database tables or DataFrames.
## Usage
```
/schema [table_name | data_ref]
```
## Workflow
1. **Determine target**:
- If argument is a loaded data_ref, show DataFrame schema
- If argument is a table name, query database schema
- If no argument, list all available tables and DataFrames
2. **For DataFrames**:
- Use `describe` to get column info
- Show dtypes, null counts, sample values
3. **For database tables**:
- Use `pg_columns` for column details
- Use `st_tables` to check for PostGIS columns
- Show constraints and indexes if available
4. **Report**:
- Column name, type, nullable, default
- For PostGIS: geometry type, SRID
- For DataFrames: pandas dtype, null percentage
## Examples
```
/schema # List all tables and DataFrames
/schema customers # Show table schema
/schema sales_data # Show DataFrame schema
```
## Available Tools
Use these MCP tools:
- `pg_tables` - List database tables
- `pg_columns` - Get column info
- `pg_schemas` - List schemas
- `st_tables` - List PostGIS tables
- `describe` - Get DataFrame info
- `list_data` - List DataFrames