Concepts
Multi-Agent Resilience
State database and core types for multi-agent task tracking, event persistence, and session recovery.
The resilience module provides the database layer and core types for multi-agent resilience. It tracks agent tasks, events, traces, and subagent sessions in SQLite for recovery and observability.
Design Philosophy
- Persistent state — All agent state survives restarts via SQLite
- Structured traces — Task execution traces enable shadow replay for debugging
- Session lifecycle — Subagent sessions track creation, idle, and swap states
- Tiered events — Skeleton events for structure, Pulse events for detail
Core Types
TaskStatus
Tasks progress through a state machine:
pub enum TaskStatus {
Pending, // Waiting to execute
Running, // Currently executing
Completed, // Success
Failed, // Error occurred
Interrupted, // System restart
Idle, // Paused (Session-as-a-Service)
Swapped, // Context swapped to disk
}AgentTask
A task with recovery checkpoints:
pub struct AgentTask {
pub task_id: String,
pub status: TaskStatus,
pub lane: Lane, // Execution lane (Sequential/Parallel)
pub risk_level: RiskLevel,// Low/Medium/High/Critical
pub checkpoint_data: Option<String>,// Serialized state for recovery
}TaskTrace
Structured execution traces for shadow replay:
pub struct TaskTrace {
pub trace_id: String,
pub task_id: String,
pub events: Vec<TaskTraceInfo>,
}SubagentSession
Long-lived subagent session management:
pub struct SubagentSession {
pub session_id: String,
pub status: SessionStatus,
pub idle_since: Option<DateTime<Utc>>,
pub swapped_at: Option<DateTime<Utc>>,
}StateDatabase
SQLite database providing CRUD operations for:
| Table | Purpose |
|---|---|
events | Agent events (skeleton + pulse tiers) |
tasks | Agent tasks with status and checkpoints |
traces | Task execution traces |
sessions | Subagent sessions |
memory_events | Memory-backed event indexing |
Schema
The schema is versioned with migration utilities in migration.rs. Key indexes:
tasks(task_id, status)— Task lookup by statusevents(event_id, created_at)— Event time-range queriestraces(trace_id, task_id)— Trace-to-task mapping
Safety
- Integer overflow prevention —
usizetoi64conversions usetry_fromwithi64::MAXfallback - Lock safety — All
lock()calls useunwrap_or_else(|e| e.into_inner()) - Parameterized queries — All SQL uses
params![], no string interpolation - No static mut —
AtomicBoolfor flags
Key Source Files
src/resilience/mod.rs— Module overviewsrc/resilience/types.rs— Core types (AgentTask, TaskTrace, etc.)src/resilience/database/state_database.rs— SQLite CRUD operationssrc/resilience/database/migration.rs— Schema versioning
See Also
- Agent Runtime — Agent execution loop
- Task Scheduling — Task queue and scheduling
- Event System — Event bus architecture