Major refactoring of projman plugin architecture: Skills Extraction (17 new files): - Extracted reusable knowledge from commands and agents into skills/ - branch-security, dependency-management, git-workflow, input-detection - issue-conventions, lessons-learned, mcp-tools-reference, planning-workflow - progress-tracking, repo-validation, review-checklist, runaway-detection - setup-workflows, sprint-approval, task-sizing, test-standards, wiki-conventions Command Consolidation (17 → 12 commands): - /setup: consolidates initial-setup, project-init, project-sync (--full/--quick/--sync) - /debug: consolidates debug-report, debug-review (report/review modes) - /test: consolidates test-check, test-gen (run/gen modes) - /sprint-status: absorbs sprint-diagram via --diagram flag Architecture Cleanup: - Remove plugin-level mcp-servers/ symlinks (6 plugins) - Remove plugin README.md files (12 files, ~2000 lines) - Update all documentation to reflect new command structure - Fix documentation drift in CONFIGURATION.md, COMMANDS-CHEATSHEET.md Commands are now thin dispatchers (~20-50 lines) that reference skills. Agents reference skills for domain knowledge instead of inline content. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
157 lines
3.4 KiB
Markdown
157 lines
3.4 KiB
Markdown
---
|
|
name: runaway-detection
|
|
description: Detecting and handling stuck agents
|
|
---
|
|
|
|
# Runaway Detection
|
|
|
|
## Purpose
|
|
|
|
Defines how to detect stuck agents and intervention protocols.
|
|
|
|
## When to Use
|
|
|
|
- **Orchestrator agent**: When monitoring dispatched agents
|
|
- **Executor agent**: Self-monitoring during execution
|
|
|
|
---
|
|
|
|
## Warning Signs
|
|
|
|
| Sign | Threshold | Action |
|
|
|------|-----------|--------|
|
|
| No progress comment | 30+ minutes | Investigate |
|
|
| Same phase repeated | 20+ tool calls | Consider stopping |
|
|
| Same error 3+ times | Immediately | Stop agent |
|
|
| Approaching budget | 80% of limit | Post checkpoint |
|
|
|
|
---
|
|
|
|
## Agent Timeout Guidelines
|
|
|
|
| Task Size | Expected Duration | Intervention Point |
|
|
|-----------|-------------------|-------------------|
|
|
| XS | ~5-10 min | 15 min no progress |
|
|
| S | ~10-20 min | 30 min no progress |
|
|
| M | ~20-40 min | 45 min no progress |
|
|
|
|
---
|
|
|
|
## Detection Protocol
|
|
|
|
1. **Read latest progress comment** - Check tool call count and phase
|
|
2. **Compare to previous** - Is progress happening?
|
|
3. **Check for error patterns** - Same error repeating?
|
|
4. **Evaluate time elapsed** - Beyond expected duration?
|
|
|
|
---
|
|
|
|
## Intervention Protocol
|
|
|
|
When you detect an agent may be stuck:
|
|
|
|
### Step 1: Assess
|
|
|
|
```
|
|
Agent Status Check for #45:
|
|
- Last progress: 25 minutes ago
|
|
- Phase: "Testing" (same as 20 tool calls ago)
|
|
- Errors: "ModuleNotFoundError" (3 times)
|
|
- Assessment: LIKELY STUCK
|
|
```
|
|
|
|
### Step 2: Stop Agent
|
|
|
|
```python
|
|
# If TaskStop available
|
|
TaskStop(task_id="agent-id")
|
|
```
|
|
|
|
### Step 3: Update Issue Status
|
|
|
|
```python
|
|
update_issue(
|
|
repo="org/repo",
|
|
issue_number=45,
|
|
labels=["Status/Failed", ...other_labels]
|
|
)
|
|
```
|
|
|
|
### Step 4: Add Explanation Comment
|
|
|
|
```python
|
|
add_comment(
|
|
repo="org/repo",
|
|
number=45,
|
|
body="""## Agent Intervention
|
|
**Reason:** No progress detected for 25 minutes / repeated errors
|
|
**Last Status:** Testing phase, ModuleNotFoundError x3
|
|
**Action:** Stopped agent, requires human review
|
|
|
|
### What Was Completed
|
|
- [x] Created auth/jwt_service.py
|
|
- [x] Implemented generate_token()
|
|
|
|
### What Remains
|
|
- [ ] Fix import issue
|
|
- [ ] Write tests
|
|
- [ ] Commit
|
|
|
|
### Recommendation
|
|
- Check for missing dependency in requirements.txt
|
|
- May need manual intervention to resolve import
|
|
"""
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## Self-Monitoring (Executor)
|
|
|
|
Executors should self-monitor:
|
|
|
|
### Circuit Breakers
|
|
|
|
- **Same error 3 times**: Stop and report
|
|
- **80% of tool call budget**: Post checkpoint
|
|
- **File not found 3 times**: Stop and ask for help
|
|
- **Test failing same way 5 times**: Stop and report
|
|
|
|
### Self-Check Template
|
|
|
|
```
|
|
Self-check at tool call 45/100:
|
|
- Progress: 4/7 steps completed
|
|
- Current phase: Testing
|
|
- Errors encountered: 1 (resolved)
|
|
- Remaining budget: 55 calls
|
|
- Status: ON TRACK
|
|
```
|
|
|
|
---
|
|
|
|
## Recovery Actions
|
|
|
|
After stopping a stuck agent:
|
|
|
|
1. **Preserve work** - Branch and commits remain
|
|
2. **Document state** - Checkpoint in issue comment
|
|
3. **Identify cause** - What caused the loop?
|
|
4. **Plan recovery**:
|
|
- Manual completion
|
|
- Different approach
|
|
- Break down further
|
|
- Assign to human
|
|
|
|
---
|
|
|
|
## Common Stuck Patterns
|
|
|
|
| Pattern | Cause | Solution |
|
|
|---------|-------|----------|
|
|
| Import loop | Missing dependency | Add to requirements |
|
|
| Test loop | Non-deterministic test | Fix test isolation |
|
|
| Validation loop | Error message not changing | Improve error specificity |
|
|
| File not found | Wrong path | Verify path exists |
|
|
| Permission denied | File ownership | Check permissions |
|