--- name: runaway-detection description: Detecting and handling stuck agents --- # Runaway Detection ## Purpose Defines how to detect stuck agents and intervention protocols. ## When to Use - **Orchestrator agent**: When monitoring dispatched agents - **Executor agent**: Self-monitoring during execution --- ## Warning Signs | Sign | Threshold | Action | |------|-----------|--------| | No progress comment | 30+ minutes | Investigate | | Same phase repeated | 20+ tool calls | Consider stopping | | Same error 3+ times | Immediately | Stop agent | | Approaching budget | 80% of limit | Post checkpoint | --- ## Agent Timeout Guidelines | Task Size | Expected Duration | Intervention Point | |-----------|-------------------|-------------------| | XS | ~5-10 min | 15 min no progress | | S | ~10-20 min | 30 min no progress | | M | ~20-40 min | 45 min no progress | --- ## Detection Protocol 1. **Read latest progress comment** - Check tool call count and phase 2. **Compare to previous** - Is progress happening? 3. **Check for error patterns** - Same error repeating? 4. **Evaluate time elapsed** - Beyond expected duration? --- ## Intervention Protocol When you detect an agent may be stuck: ### Step 1: Assess ``` Agent Status Check for #45: - Last progress: 25 minutes ago - Phase: "Testing" (same as 20 tool calls ago) - Errors: "ModuleNotFoundError" (3 times) - Assessment: LIKELY STUCK ``` ### Step 2: Stop Agent ```python # If TaskStop available TaskStop(task_id="agent-id") ``` ### Step 3: Update Issue Status ```python update_issue( repo="org/repo", issue_number=45, labels=["Status/Failed", ...other_labels] ) ``` ### Step 4: Add Explanation Comment ```python add_comment( repo="org/repo", number=45, body="""## Agent Intervention **Reason:** No progress detected for 25 minutes / repeated errors **Last Status:** Testing phase, ModuleNotFoundError x3 **Action:** Stopped agent, requires human review ### What Was Completed - [x] Created auth/jwt_service.py - [x] Implemented generate_token() ### What Remains - [ ] Fix import issue - [ ] Write tests - [ ] Commit ### Recommendation - Check for missing dependency in requirements.txt - May need manual intervention to resolve import """ ) ``` --- ## Self-Monitoring (Executor) Executors should self-monitor: ### Circuit Breakers - **Same error 3 times**: Stop and report - **80% of tool call budget**: Post checkpoint - **File not found 3 times**: Stop and ask for help - **Test failing same way 5 times**: Stop and report ### Self-Check Template ``` Self-check at tool call 45/100: - Progress: 4/7 steps completed - Current phase: Testing - Errors encountered: 1 (resolved) - Remaining budget: 55 calls - Status: ON TRACK ``` --- ## Recovery Actions After stopping a stuck agent: 1. **Preserve work** - Branch and commits remain 2. **Document state** - Checkpoint in issue comment 3. **Identify cause** - What caused the loop? 4. **Plan recovery**: - Manual completion - Different approach - Break down further - Assign to human --- ## Common Stuck Patterns | Pattern | Cause | Solution | |---------|-------|----------| | Import loop | Missing dependency | Add to requirements | | Test loop | Non-deterministic test | Fix test isolation | | Validation loop | Error message not changing | Improve error specificity | | File not found | Wrong path | Verify path exists | | Permission denied | File ownership | Check permissions |