diff --git a/lessons%2Fpatterns%2Freset-pandas-index-after-filtering-to-prevent-column-pollution.-.md b/lessons%2Fpatterns%2Freset-pandas-index-after-filtering-to-prevent-column-pollution.-.md new file mode 100644 index 0000000..c432b7d --- /dev/null +++ b/lessons%2Fpatterns%2Freset-pandas-index-after-filtering-to-prevent-column-pollution.-.md @@ -0,0 +1,48 @@ +## Context + +When building MCP tools that manipulate pandas DataFrames, operations that subset or filter data can unexpectedly add columns to the result. + +**Issue:** #203 - filter tool adds unexpected `__index_level_0__` column + +## Problem + +The `filter` tool used `df.query(condition)` to filter rows. This preserves the original DataFrame's index. When the filtered result was later serialized or stored, pandas converted the preserved index into a column named `__index_level_0__`. + +```python +# Before (problematic) +filtered = df.query(condition) +# Result has original index preserved, becomes column on storage +``` + +**Symptom:** Users reported filtered DataFrames having 5 columns when the source had 4. + +## Solution + +Always reset the index after filtering operations that subset rows: + +```python +# After (correct) +filtered = df.query(condition).reset_index(drop=True) +``` + +The `drop=True` parameter discards the old index entirely rather than converting it to a column. + +## Prevention + +When implementing pandas operations in MCP tools: + +1. **Filter/query operations** - Always add `.reset_index(drop=True)` +2. **Groupby operations** - Use `.reset_index()` (already correct in our impl) +3. **Merge/join operations** - `pd.merge()` handles this automatically +4. **Slicing operations** (head, tail) - Consider if index reset is needed + +**Rule of thumb:** If an operation changes which rows are in the DataFrame, reset the index. + +## Related + +- File: `mcp-servers/data-platform/mcp_server/pandas_tools.py` +- Fix commit: `4ed3ed7` + + +--- +**Tags:** pandas, data-platform, mcp-server, python, dataframe \ No newline at end of file