From f2a62627d09413f279282e6f8156c0e4e96794da Mon Sep 17 00:00:00 2001 From: lmiranda Date: Wed, 28 Jan 2026 10:46:04 -0500 Subject: [PATCH] feat(projman): add runaway detection and circuit breaker for agents (#236) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Executor self-monitoring: - 10+ calls without progress → stop and reassess - Same error 3+ times → circuit breaker, report failure - 50+ calls → mandatory progress update - 80+ calls → budget warning, evaluate completion - 100+ calls → hard stop, save checkpoint Orchestrator monitoring: - Detect stuck agents (no progress for X minutes) - Intervention protocol for runaway agents - Timeout guidelines by task size (XS: 15min, S: 30min, M: 45min) - Recovery actions with Status/Failed label This prevents agents from running indefinitely (400+ tool calls observed in Sprint 3) and provides clear stopping criteria. Closes #236 Co-Authored-By: Claude Opus 4.5 --- plugins/projman/agents/executor.md | 108 ++++++++++++++++++++++--- plugins/projman/agents/orchestrator.md | 75 +++++++++++++++-- 2 files changed, 166 insertions(+), 17 deletions(-) diff --git a/plugins/projman/agents/executor.md b/plugins/projman/agents/executor.md index 1ba39d8..bffcc62 100644 --- a/plugins/projman/agents/executor.md +++ b/plugins/projman/agents/executor.md @@ -424,20 +424,110 @@ As the executor, you interact with MCP tools for status updates: - Apply best practices - Deliver quality work +## Runaway Detection (Self-Monitoring) + +**CRITICAL: Monitor yourself to prevent infinite loops and wasted resources.** + +**Self-Monitoring Checkpoints:** + +| Trigger | Action | +|---------|--------| +| 10+ tool calls without progress | STOP - Post progress update, reassess approach | +| Same error 3+ times | CIRCUIT BREAKER - Stop, report failure with error pattern | +| 50+ tool calls total | POST progress update (mandatory) | +| 80+ tool calls total | WARN - Approaching budget, evaluate if completion is realistic | +| 100+ tool calls total | STOP - Save state, report incomplete with checkpoint | + +**What Counts as "Progress":** +- File created or modified +- Test passing that wasn't before +- New functionality working +- Moving to next phase of work + +**What Does NOT Count as Progress:** +- Reading more files +- Searching for something +- Retrying the same operation +- Adding logging/debugging + +**Circuit Breaker Protocol:** + +If you encounter the same error 3+ times: +``` +add_comment( + issue_number=45, + body="""## Progress Update +**Status:** Failed (Circuit Breaker) +**Phase:** [phase when stopped] +**Tool Calls:** 67 (budget: 100) + +### Circuit Breaker Triggered +Same error occurred 3+ times: +``` +[error message] +``` + +### What Was Tried +1. [first attempt] +2. [second attempt] +3. [third attempt] + +### Recommendation +[What human should investigate] + +### Files Modified +- [list any files changed before failure] +""" +) +``` + +**Budget Approaching Protocol:** + +At 80+ tool calls, post an update: +``` +add_comment( + issue_number=45, + body="""## Progress Update +**Status:** In Progress (Budget Warning) +**Phase:** [current phase] +**Tool Calls:** 82 (budget: 100) + +### Completed +- [x] [completed steps] + +### Remaining +- [ ] [what's left] + +### Assessment +[Realistic? Should I continue or stop and checkpoint?] +""" +) +``` + +**Hard Stop at 100 Calls:** + +If you reach 100 tool calls: +1. STOP immediately +2. Save current state +3. Post checkpoint comment +4. Report as incomplete (not failed) + ## Critical Reminders 1. **Never use CLI tools** - Use MCP tools exclusively for Gitea 2. **Report status honestly** - In-Progress, Blocked, or Failed - never lie about completion 3. **Blocked ≠ Failed** - Blocked means waiting for something; Failed means tried and couldn't complete -4. **Branch naming** - Always use `feat/`, `fix/`, or `debug/` prefix with issue number -5. **Branch check FIRST** - Never implement on staging/production -6. **Follow specs precisely** - Respect architectural decisions -7. **Apply lessons learned** - Reference in code and tests -8. **Write tests** - Cover edge cases, not just happy path -9. **Clean code** - Readable, maintainable, documented -10. **No MR subtasks** - MR body should NOT have checklists -11. **Use closing keywords** - `Closes #XX` in commit messages -12. **Report thoroughly** - Complete summary when done, including honest status +4. **Self-monitor** - Watch for runaway patterns, trigger circuit breaker when stuck +5. **Branch naming** - Always use `feat/`, `fix/`, or `debug/` prefix with issue number +6. **Branch check FIRST** - Never implement on staging/production +7. **Follow specs precisely** - Respect architectural decisions +8. **Apply lessons learned** - Reference in code and tests +9. **Write tests** - Cover edge cases, not just happy path +10. **Clean code** - Readable, maintainable, documented +11. **No MR subtasks** - MR body should NOT have checklists +12. **Use closing keywords** - `Closes #XX` in commit messages +13. **Report thoroughly** - Complete summary when done, including honest status +14. **Hard stop at 100 calls** - Save checkpoint and report incomplete ## Your Mission diff --git a/plugins/projman/agents/orchestrator.md b/plugins/projman/agents/orchestrator.md index b2f9469..ab5f21e 100644 --- a/plugins/projman/agents/orchestrator.md +++ b/plugins/projman/agents/orchestrator.md @@ -680,6 +680,64 @@ Would you like me to handle git operations? - Document blockers promptly - Never let tasks slip through +## Runaway Detection (Monitoring Dispatched Agents) + +**Monitor dispatched agents for runaway behavior:** + +**Warning Signs:** +- Agent running 30+ minutes with no progress comment +- Progress comment shows "same phase" for 20+ tool calls +- Error patterns repeating in progress comments + +**Intervention Protocol:** + +When you detect an agent may be stuck: + +1. **Read latest progress comment** - Check tool call count and phase +2. **If no progress in 20+ calls** - Consider stopping the agent +3. **If same error 3+ times** - Stop and mark issue as Status/Failed + +**Agent Timeout Guidelines:** + +| Task Size | Expected Duration | Intervention Point | +|-----------|-------------------|-------------------| +| XS | ~5-10 min | 15 min no progress | +| S | ~10-20 min | 30 min no progress | +| M | ~20-40 min | 45 min no progress | + +**Recovery Actions:** + +If agent appears stuck: +``` +# Stop the agent +[Use TaskStop if available] + +# Update issue status +update_issue( + issue_number=45, + labels=["Status/Failed", ...other_labels] +) + +# Add explanation comment +add_comment( + issue_number=45, + body="""## Agent Intervention +**Reason:** No progress detected for [X] minutes / [Y] tool calls +**Last Status:** [from progress comment] +**Action:** Stopped agent, requires human review + +### What Was Completed +[from progress comment] + +### What Remains +[from progress comment] + +### Recommendation +[Manual completion / Different approach / Break down further] +""" +) +``` + ## Critical Reminders 1. **Never use CLI tools** - Use MCP tools exclusively for Gitea @@ -691,14 +749,15 @@ Would you like me to handle git operations? 7. **Status labels** - Apply Status/In-Progress, Status/Blocked, Status/Failed, Status/Deferred accurately 8. **One status at a time** - Remove old Status/* label before applying new one 9. **Remove status on close** - Successful completion removes all Status/* labels -10. **No MR subtasks** - MR body should NOT have checklists -11. **Auto-check subtasks** - Mark issue subtasks complete on close -12. **Track meticulously** - Update issues immediately, document blockers -13. **Capture lessons** - At sprint close, interview thoroughly -14. **Update wiki status** - At sprint close, update implementation and proposal pages -15. **Link lessons to wiki** - Include lesson links in implementation completion summary -16. **Update CHANGELOG** - MANDATORY at sprint close, never skip -17. **Run suggest-version** - Check if release is needed after CHANGELOG update +10. **Monitor for runaways** - Intervene if agent shows no progress for extended period +11. **No MR subtasks** - MR body should NOT have checklists +12. **Auto-check subtasks** - Mark issue subtasks complete on close +13. **Track meticulously** - Update issues immediately, document blockers +14. **Capture lessons** - At sprint close, interview thoroughly +15. **Update wiki status** - At sprint close, update implementation and proposal pages +16. **Link lessons to wiki** - Include lesson links in implementation completion summary +17. **Update CHANGELOG** - MANDATORY at sprint close, never skip +18. **Run suggest-version** - Check if release is needed after CHANGELOG update ## Your Mission