feat(projman): add checkpoint/resume for interrupted agent work (#237)

Executor checkpointing:
- Standard checkpoint comment format with branch, commit, phase
- Files modified with status (created, modified)
- Completed and pending steps tracking
- State notes for resumption context
- Save checkpoint after major steps, before stopping

Orchestrator resume detection:
- Scan issue comments for "## Checkpoint" markers
- Offer resume options: resume, start fresh, review details
- Verify branch exists and files match before resuming
- Dispatch executor with checkpoint context

Sprint-start integration:
- Checkpoint detection as first workflow step
- Resume flow documentation with example
- Checkpoint format specification

This enables resuming work after:
- Budget exhaustion (100 tool call limit)
- Agent failure/circuit breaker
- Manual interruption
- Session timeout

Closes #237

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-01-28 10:49:34 -05:00
parent a69a4d19d0
commit 459550e7d3
3 changed files with 174 additions and 2 deletions

View File

@@ -424,6 +424,81 @@ As the executor, you interact with MCP tools for status updates:
- Apply best practices
- Deliver quality work
## Checkpointing (Save Progress for Resume)
**CRITICAL: Save checkpoints so work can be resumed if interrupted.**
**Checkpoint Comment Format:**
```markdown
## Checkpoint
**Branch:** feat/45-jwt-service
**Commit:** abc123 (or "uncommitted")
**Phase:** [current phase]
**Tool Calls:** 45
### Files Modified
- auth/jwt_service.py (created)
- tests/test_jwt.py (created)
### Completed Steps
- [x] Created jwt_service.py skeleton
- [x] Implemented generate_token()
- [x] Implemented verify_token()
### Pending Steps
- [ ] Write unit tests
- [ ] Add token refresh logic
- [ ] Commit and push
### State Notes
[Any important context for resumption]
```
**When to Save Checkpoints:**
- After completing each major step (every 20-30 tool calls)
- Before stopping due to budget limit
- When encountering a blocker
- After any commit
**Checkpoint Example:**
```
add_comment(
issue_number=45,
body="""## Checkpoint
**Branch:** feat/45-jwt-service
**Commit:** uncommitted (changes staged)
**Phase:** Testing
**Tool Calls:** 67
### Files Modified
- auth/jwt_service.py (created, 120 lines)
- auth/__init__.py (modified, added import)
- tests/test_jwt.py (created, 50 lines, incomplete)
### Completed Steps
- [x] Created auth/jwt_service.py
- [x] Implemented generate_token() with HS256
- [x] Implemented verify_token()
- [x] Updated auth/__init__.py exports
### Pending Steps
- [ ] Complete test_jwt.py (5 tests remaining)
- [ ] Add token refresh logic
- [ ] Commit changes
- [ ] Push to remote
### State Notes
- Using PyJWT 2.8.0
- Secret key from JWT_SECRET env var
- Tests use pytest fixtures in conftest.py
"""
)
```
**Checkpoint on Interruption:**
If you must stop (budget, failure, blocker), ALWAYS post a checkpoint FIRST.
## Runaway Detection (Self-Monitoring)
**CRITICAL: Monitor yourself to prevent infinite loops and wasted resources.**

View File

@@ -93,7 +93,44 @@ git branch --show-current
**Workflow:**
**A. Fetch Sprint Issues**
**A. Fetch Sprint Issues and Detect Checkpoints**
```
list_issues(state="open", labels=["sprint-current"])
```
**For each open issue, check for checkpoint comments:**
```
get_issue(issue_number=45) # Comments included
→ Look for comments containing "## Checkpoint"
```
**If Checkpoint Found:**
```
Checkpoint Detected for #45
Found checkpoint from previous session:
Branch: feat/45-jwt-service
Phase: Testing
Tool Calls: 67
Files Modified: 3
Completed: 4/7 steps
Options:
1. Resume from checkpoint (recommended)
2. Start fresh (discard previous work)
3. Review checkpoint details first
Would you like to resume?
```
**Resume Protocol:**
1. Verify branch exists: `git branch -a | grep feat/45-jwt-service`
2. Switch to branch: `git checkout feat/45-jwt-service`
3. Verify files match checkpoint
4. Dispatch executor with checkpoint context
5. Executor continues from pending steps
**B. Fetch Sprint Issues (Standard)**
```
list_issues(state="open", labels=["sprint-current"])
```

View File

@@ -25,7 +25,12 @@ If you are on a production or staging branch, you MUST stop and ask the user to
The orchestrator agent will:
1. **Fetch Sprint Issues**
1. **Detect Checkpoints (Resume Support)**
- Check each open issue for `## Checkpoint` comments
- If checkpoint found, offer to resume from that point
- Resume preserves: branch, completed work, pending steps
2. **Fetch Sprint Issues**
- Use `list_issues` to fetch open issues for the sprint
- Identify priorities based on labels (Priority/Critical, Priority/High, etc.)
@@ -300,6 +305,61 @@ Batch 2 (now unblocked):
Starting #46 while #48 continues...
```
## Checkpoint Resume Support
If a previous session was interrupted (agent stopped, failure, budget exhausted), checkpoints enable resumption.
**Checkpoint Detection:**
The orchestrator scans issue comments for `## Checkpoint` markers containing:
- Branch name
- Last commit hash
- Completed/pending steps
- Files modified
**Resume Flow:**
```
User: /sprint-start
Orchestrator: Checking for checkpoints...
Found checkpoint for #45 (JWT service):
Branch: feat/45-jwt-service
Last activity: 2 hours ago
Progress: 4/7 steps completed
Pending: Write tests, add refresh, commit
Options:
1. Resume from checkpoint (recommended)
2. Start fresh (lose previous work)
3. Review checkpoint details
User: 1
Orchestrator: Resuming #45 from checkpoint...
✓ Branch exists
✓ Files match checkpoint
✓ Dispatching executor with context
Executor continues from pending steps...
```
**Checkpoint Format:**
Executors save checkpoints after major steps:
```markdown
## Checkpoint
**Branch:** feat/45-jwt-service
**Commit:** abc123
**Phase:** Testing
### Completed Steps
- [x] Step 1
- [x] Step 2
### Pending Steps
- [ ] Step 3
- [ ] Step 4
```
## Getting Started
Simply invoke `/sprint-start` and the orchestrator will: