feat(projman): add runaway detection and circuit breaker for agents (#236)

Executor self-monitoring:
- 10+ calls without progress → stop and reassess
- Same error 3+ times → circuit breaker, report failure
- 50+ calls → mandatory progress update
- 80+ calls → budget warning, evaluate completion
- 100+ calls → hard stop, save checkpoint

Orchestrator monitoring:
- Detect stuck agents (no progress for X minutes)
- Intervention protocol for runaway agents
- Timeout guidelines by task size (XS: 15min, S: 30min, M: 45min)
- Recovery actions with Status/Failed label

This prevents agents from running indefinitely (400+ tool calls
observed in Sprint 3) and provides clear stopping criteria.

Closes #236

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-28 10:46:04 -05:00
parent 0abf510ec0
commit f2a62627d0
2 changed files with 166 additions and 17 deletions

View File

@@ -680,6 +680,64 @@ Would you like me to handle git operations?
- Document blockers promptly
- Never let tasks slip through
## Runaway Detection (Monitoring Dispatched Agents)
**Monitor dispatched agents for runaway behavior:**
**Warning Signs:**
- Agent running 30+ minutes with no progress comment
- Progress comment shows "same phase" for 20+ tool calls
- Error patterns repeating in progress comments
**Intervention Protocol:**
When you detect an agent may be stuck:
1. **Read latest progress comment** - Check tool call count and phase
2. **If no progress in 20+ calls** - Consider stopping the agent
3. **If same error 3+ times** - Stop and mark issue as Status/Failed
**Agent Timeout Guidelines:**
| Task Size | Expected Duration | Intervention Point |
|-----------|-------------------|-------------------|
| XS | ~5-10 min | 15 min no progress |
| S | ~10-20 min | 30 min no progress |
| M | ~20-40 min | 45 min no progress |
**Recovery Actions:**
If agent appears stuck:
```
# Stop the agent
[Use TaskStop if available]
# Update issue status
update_issue(
issue_number=45,
labels=["Status/Failed", ...other_labels]
)
# Add explanation comment
add_comment(
issue_number=45,
body="""## Agent Intervention
**Reason:** No progress detected for [X] minutes / [Y] tool calls
**Last Status:** [from progress comment]
**Action:** Stopped agent, requires human review
### What Was Completed
[from progress comment]
### What Remains
[from progress comment]
### Recommendation
[Manual completion / Different approach / Break down further]
"""
)
```
## Critical Reminders
1. **Never use CLI tools** - Use MCP tools exclusively for Gitea
@@ -691,14 +749,15 @@ Would you like me to handle git operations?
7. **Status labels** - Apply Status/In-Progress, Status/Blocked, Status/Failed, Status/Deferred accurately
8. **One status at a time** - Remove old Status/* label before applying new one
9. **Remove status on close** - Successful completion removes all Status/* labels
10. **No MR subtasks** - MR body should NOT have checklists
11. **Auto-check subtasks** - Mark issue subtasks complete on close
12. **Track meticulously** - Update issues immediately, document blockers
13. **Capture lessons** - At sprint close, interview thoroughly
14. **Update wiki status** - At sprint close, update implementation and proposal pages
15. **Link lessons to wiki** - Include lesson links in implementation completion summary
16. **Update CHANGELOG** - MANDATORY at sprint close, never skip
17. **Run suggest-version** - Check if release is needed after CHANGELOG update
10. **Monitor for runaways** - Intervene if agent shows no progress for extended period
11. **No MR subtasks** - MR body should NOT have checklists
12. **Auto-check subtasks** - Mark issue subtasks complete on close
13. **Track meticulously** - Update issues immediately, document blockers
14. **Capture lessons** - At sprint close, interview thoroughly
15. **Update wiki status** - At sprint close, update implementation and proposal pages
16. **Link lessons to wiki** - Include lesson links in implementation completion summary
17. **Update CHANGELOG** - MANDATORY at sprint close, never skip
18. **Run suggest-version** - Check if release is needed after CHANGELOG update
## Your Mission