Add "lessons/patterns/reset-pandas-index-after-filtering-to-prevent-column-pollution"

2026-01-26 22:28:13 +00:00
parent f37b5dc340
commit faebbe8e92

@@ -0,0 +1,48 @@
## Context
When building MCP tools that manipulate pandas DataFrames, operations that subset or filter data can unexpectedly add columns to the result.
**Issue:** #203 - filter tool adds unexpected `__index_level_0__` column
## Problem
The `filter` tool used `df.query(condition)` to filter rows. This preserves the original DataFrame's index. When the filtered result was later serialized or stored, pandas converted the preserved index into a column named `__index_level_0__`.
```python
# Before (problematic)
filtered = df.query(condition)
# Result has original index preserved, becomes column on storage
```
**Symptom:** Users reported filtered DataFrames having 5 columns when the source had 4.
## Solution
Always reset the index after filtering operations that subset rows:
```python
# After (correct)
filtered = df.query(condition).reset_index(drop=True)
```
The `drop=True` parameter discards the old index entirely rather than converting it to a column.
## Prevention
When implementing pandas operations in MCP tools:
1. **Filter/query operations** - Always add `.reset_index(drop=True)`
2. **Groupby operations** - Use `.reset_index()` (already correct in our impl)
3. **Merge/join operations** - `pd.merge()` handles this automatically
4. **Slicing operations** (head, tail) - Consider if index reset is needed
**Rule of thumb:** If an operation changes which rows are in the DataFrame, reset the index.
## Related
- File: `mcp-servers/data-platform/mcp_server/pandas_tools.py`
- Fix commit: `4ed3ed7`
---
**Tags:** pandas, data-platform, mcp-server, python, dataframe