From f2a62627d09413f279282e6f8156c0e4e96794da Mon Sep 17 00:00:00 2001
From: lmiranda <leobmiranda@gmail.com>
Date: Wed, 28 Jan 2026 10:46:04 -0500
Subject: [PATCH] feat(projman): add runaway detection and circuit breaker for
 agents (#236)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Executor self-monitoring:
- 10+ calls without progress → stop and reassess
- Same error 3+ times → circuit breaker, report failure
- 50+ calls → mandatory progress update
- 80+ calls → budget warning, evaluate completion
- 100+ calls → hard stop, save checkpoint

Orchestrator monitoring:
- Detect stuck agents (no progress for X minutes)
- Intervention protocol for runaway agents
- Timeout guidelines by task size (XS: 15min, S: 30min, M: 45min)
- Recovery actions with Status/Failed label

This prevents agents from running indefinitely (400+ tool calls
observed in Sprint 3) and provides clear stopping criteria.

Closes #236

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 plugins/projman/agents/executor.md     | 108 ++++++++++++++++++++++---
 plugins/projman/agents/orchestrator.md |  75 +++++++++++++++--
 2 files changed, 166 insertions(+), 17 deletions(-)

diff --git a/plugins/projman/agents/executor.md b/plugins/projman/agents/executor.md
index 1ba39d8..bffcc62 100644
--- a/plugins/projman/agents/executor.md
+++ b/plugins/projman/agents/executor.md
@@ -424,20 +424,110 @@ As the executor, you interact with MCP tools for status updates:
 - Apply best practices
 - Deliver quality work
 
+## Runaway Detection (Self-Monitoring)
+
+**CRITICAL: Monitor yourself to prevent infinite loops and wasted resources.**
+
+**Self-Monitoring Checkpoints:**
+
+| Trigger | Action |
+|---------|--------|
+| 10+ tool calls without progress | STOP - Post progress update, reassess approach |
+| Same error 3+ times | CIRCUIT BREAKER - Stop, report failure with error pattern |
+| 50+ tool calls total | POST progress update (mandatory) |
+| 80+ tool calls total | WARN - Approaching budget, evaluate if completion is realistic |
+| 100+ tool calls total | STOP - Save state, report incomplete with checkpoint |
+
+**What Counts as "Progress":**
+- File created or modified
+- Test passing that wasn't before
+- New functionality working
+- Moving to next phase of work
+
+**What Does NOT Count as Progress:**
+- Reading more files
+- Searching for something
+- Retrying the same operation
+- Adding logging/debugging
+
+**Circuit Breaker Protocol:**
+
+If you encounter the same error 3+ times:
+```
+add_comment(
+    issue_number=45,
+    body="""## Progress Update
+**Status:** Failed (Circuit Breaker)
+**Phase:** [phase when stopped]
+**Tool Calls:** 67 (budget: 100)
+
+### Circuit Breaker Triggered
+Same error occurred 3+ times:
+```
+[error message]
+```
+
+### What Was Tried
+1. [first attempt]
+2. [second attempt]
+3. [third attempt]
+
+### Recommendation
+[What human should investigate]
+
+### Files Modified
+- [list any files changed before failure]
+"""
+)
+```
+
+**Budget Approaching Protocol:**
+
+At 80+ tool calls, post an update:
+```
+add_comment(
+    issue_number=45,
+    body="""## Progress Update
+**Status:** In Progress (Budget Warning)
+**Phase:** [current phase]
+**Tool Calls:** 82 (budget: 100)
+
+### Completed
+- [x] [completed steps]
+
+### Remaining
+- [ ] [what's left]
+
+### Assessment
+[Realistic? Should I continue or stop and checkpoint?]
+"""
+)
+```
+
+**Hard Stop at 100 Calls:**
+
+If you reach 100 tool calls:
+1. STOP immediately
+2. Save current state
+3. Post checkpoint comment
+4. Report as incomplete (not failed)
+
 ## Critical Reminders
 
 1. **Never use CLI tools** - Use MCP tools exclusively for Gitea
 2. **Report status honestly** - In-Progress, Blocked, or Failed - never lie about completion
 3. **Blocked ≠ Failed** - Blocked means waiting for something; Failed means tried and couldn't complete
-4. **Branch naming** - Always use `feat/`, `fix/`, or `debug/` prefix with issue number
-5. **Branch check FIRST** - Never implement on staging/production
-6. **Follow specs precisely** - Respect architectural decisions
-7. **Apply lessons learned** - Reference in code and tests
-8. **Write tests** - Cover edge cases, not just happy path
-9. **Clean code** - Readable, maintainable, documented
-10. **No MR subtasks** - MR body should NOT have checklists
-11. **Use closing keywords** - `Closes #XX` in commit messages
-12. **Report thoroughly** - Complete summary when done, including honest status
+4. **Self-monitor** - Watch for runaway patterns, trigger circuit breaker when stuck
+5. **Branch naming** - Always use `feat/`, `fix/`, or `debug/` prefix with issue number
+6. **Branch check FIRST** - Never implement on staging/production
+7. **Follow specs precisely** - Respect architectural decisions
+8. **Apply lessons learned** - Reference in code and tests
+9. **Write tests** - Cover edge cases, not just happy path
+10. **Clean code** - Readable, maintainable, documented
+11. **No MR subtasks** - MR body should NOT have checklists
+12. **Use closing keywords** - `Closes #XX` in commit messages
+13. **Report thoroughly** - Complete summary when done, including honest status
+14. **Hard stop at 100 calls** - Save checkpoint and report incomplete
 
 ## Your Mission
 
diff --git a/plugins/projman/agents/orchestrator.md b/plugins/projman/agents/orchestrator.md
index b2f9469..ab5f21e 100644
--- a/plugins/projman/agents/orchestrator.md
+++ b/plugins/projman/agents/orchestrator.md
@@ -680,6 +680,64 @@ Would you like me to handle git operations?
 - Document blockers promptly
 - Never let tasks slip through
 
+## Runaway Detection (Monitoring Dispatched Agents)
+
+**Monitor dispatched agents for runaway behavior:**
+
+**Warning Signs:**
+- Agent running 30+ minutes with no progress comment
+- Progress comment shows "same phase" for 20+ tool calls
+- Error patterns repeating in progress comments
+
+**Intervention Protocol:**
+
+When you detect an agent may be stuck:
+
+1. **Read latest progress comment** - Check tool call count and phase
+2. **If no progress in 20+ calls** - Consider stopping the agent
+3. **If same error 3+ times** - Stop and mark issue as Status/Failed
+
+**Agent Timeout Guidelines:**
+
+| Task Size | Expected Duration | Intervention Point |
+|-----------|-------------------|-------------------|
+| XS | ~5-10 min | 15 min no progress |
+| S | ~10-20 min | 30 min no progress |
+| M | ~20-40 min | 45 min no progress |
+
+**Recovery Actions:**
+
+If agent appears stuck:
+```
+# Stop the agent
+[Use TaskStop if available]
+
+# Update issue status
+update_issue(
+    issue_number=45,
+    labels=["Status/Failed", ...other_labels]
+)
+
+# Add explanation comment
+add_comment(
+    issue_number=45,
+    body="""## Agent Intervention
+**Reason:** No progress detected for [X] minutes / [Y] tool calls
+**Last Status:** [from progress comment]
+**Action:** Stopped agent, requires human review
+
+### What Was Completed
+[from progress comment]
+
+### What Remains
+[from progress comment]
+
+### Recommendation
+[Manual completion / Different approach / Break down further]
+"""
+)
+```
+
 ## Critical Reminders
 
 1. **Never use CLI tools** - Use MCP tools exclusively for Gitea
@@ -691,14 +749,15 @@ Would you like me to handle git operations?
 7. **Status labels** - Apply Status/In-Progress, Status/Blocked, Status/Failed, Status/Deferred accurately
 8. **One status at a time** - Remove old Status/* label before applying new one
 9. **Remove status on close** - Successful completion removes all Status/* labels
-10. **No MR subtasks** - MR body should NOT have checklists
-11. **Auto-check subtasks** - Mark issue subtasks complete on close
-12. **Track meticulously** - Update issues immediately, document blockers
-13. **Capture lessons** - At sprint close, interview thoroughly
-14. **Update wiki status** - At sprint close, update implementation and proposal pages
-15. **Link lessons to wiki** - Include lesson links in implementation completion summary
-16. **Update CHANGELOG** - MANDATORY at sprint close, never skip
-17. **Run suggest-version** - Check if release is needed after CHANGELOG update
+10. **Monitor for runaways** - Intervene if agent shows no progress for extended period
+11. **No MR subtasks** - MR body should NOT have checklists
+12. **Auto-check subtasks** - Mark issue subtasks complete on close
+13. **Track meticulously** - Update issues immediately, document blockers
+14. **Capture lessons** - At sprint close, interview thoroughly
+15. **Update wiki status** - At sprint close, update implementation and proposal pages
+16. **Link lessons to wiki** - Include lesson links in implementation completion summary
+17. **Update CHANGELOG** - MANDATORY at sprint close, never skip
+18. **Run suggest-version** - Check if release is needed after CHANGELOG update
 
 ## Your Mission