Loopback Logging — MCAP Session Archive¶

Loopback logging captures a continuous stream of cycle records (observations, actions, guard results) in MCAP format for post-mortem analysis and live playback.

Overview¶

When a guard rejects or faults, the LoopbackWriter automatically captures ±10 seconds of sensor observations (including camera frames) alongside the decision context. All cycles (pass/clamp/reject) are recorded in a single time-indexed archive per runtime session.

Key features¶

Non-blocking: Records written to an async queue; main control loop adds < 20 µs overhead
Structured format: MCAP with JSON schemas for each channel type
Violation context: Ring buffer of observations ±10 seconds around any rejection
Image capture: Optional on clamps; always on violations (if camera present)
Rotation: Automatic file rotation every 500 MB or 60 minutes
Compression: MCAP chunk-level zstd compression (~70% reduction)

Configuration¶

Enable in Stackfile under loopback::

loopback:
  backend: mcap                    # "mcap" (recommended) or "pickle"
  output_dir: /data/robot/sessions # Directory for session files
  window_sec: 10.0                 # Total duration (pre + post) around an event to keep images
  pre_event_sec: 10.0             # How many seconds of history to capture before a violation
  rotate_mb: 500.0                 # Rotate file every 500 MB
  rotate_minutes: 60.0             # Or rotate every 60 minutes
  max_queue_depth: 256             # Records queued before dropping
  capture_images_on_clamp: false   # Also capture images on CLAMP?

Tuning¶

Parameter	Default	Guidance
`window_sec`	10.0	Total image sequence duration captured around an event.
`pre_event_sec`	10.0	Specific amount of historical images to pull from the ring buffer before the trigger.
`rotate_mb`	500.0	Reduce to 100–200 MB if disk space is tight.
`rotate_minutes`	60.0	Rotation period; 60 min is standard for debugging.
`capture_images_on_clamp`	false	Enable to debug motion limit triggers; can be disk-intensive.
`max_queue_depth`	256	Increase if you see warnings about queue full on a slow storage backend

MCAP Channels & Schema¶

Each session is written as a single .mcap file with the following channels:

`/dam/cycle` — Control loop summary¶

Written every cycle. Summarises the decision and latency snapshot.

{
  "cycle_id": 42,
  "trace_id": "3fa80...",
  "timestamp": 1700000000.123,
  "active_task": "move_tcp",
  "active_boundaries": ["workspace_check", "speed_limit"],
  "active_cameras": ["top", "wrist"],
  "active_context": "normal",
  "context_severity": 0,
  "context_event": null,
  "has_violation": false,
  "has_clamp": false,
  "violated_layer_mask": 0,
  "clamped_layer_mask": 0,
  "failure_type": null,
  "failure_guard_names": [],
  "failure_tuple": null,
  "source_ms": 0.8,
  "policy_ms": 2.1,
  "guards_ms": 5.4,
  "sink_ms": 0.4,
  "total_ms": 8.7
}

Fields: - has_violation: true if any guard rejected or faulted this cycle - violated_layer_mask: Bitmask (bit i = Layer i had a violation); used to quickly filter MCAP - has_clamp: true if any guard clamped (and capture_images_on_clamp=true) - clamped_layer_mask: Bitmask for clamps - failure_type: Failure harvesting class: ood_only, guard_triggered, hardware_triggered, or null - failure_tuple: Structured evidence object used for paper/export analysis - active_context: current runtime Context (normal, slow_down, emergency_stop, ...) - context_event: transition payload on cycles that enter/exit/preempt/escalate a Context; null otherwise - latency_*: Pipeline timings (source, policy, guards, sink) from MetricBus

`/dam/obs` — Sensor observation¶

Raw joint state, EE pose, force/torque. One message per cycle.

{
  "cycle_id": 42,
  "timestamp": 1700000000.123,
  "joint_positions": [0.0, 1.57, -1.57, 0.0, 0.0, 0.0],
  "joint_velocities": [0.01, 0.02, -0.01, 0.0, 0.0, 0.0],
  "end_effector_pose": [0.5, 0.3, 0.2, 1.0, 0.0, 0.0, 0.0],
  "force_torque": [5.0, -2.0, 20.0, 0.1, 0.05, 0.02]
}

`/dam/action` — Proposed and validated action¶

Command trajectory before and after guard processing.

{
  "cycle_id": 42,
  "timestamp": 1700000000.123,
  "proposal_positions": [0.0, 1.6, -1.5, 0.0, 0.0, 0.0],
  "proposal_velocities": [0.01, 0.02, -0.01, 0.0, 0.0, 0.0],
  "validated_positions": [0.0, 1.57, -1.57, 0.0, 0.0, 0.0],
  "validated_velocities": [0.005, 0.015, -0.01, 0.0, 0.0, 0.0],
  "was_clamped": false,
  "was_rejected": false,
  "fallback_triggered": null
}

Note: If was_rejected=true, then validated_* are null (action did not execute).

`/dam/L0` … `/dam/L3` — Per-layer guard results¶

One message per guard per cycle (only if guard is active).

{
  "cycle_id": 42,
  "timestamp": 1700000000.123,
  "guard_name": "OODGuard",
  "event_class": "perception",
  "layer": 0,
  "decision": "PASS",
  "is_violation": false,
  "is_clamp": false,
  "reason": "",
  "latency_ms": 2.1
}

Decision values: PASS | CLAMP | REJECT | FAULT

Use is_violation=true to filter rejection-only analysis; use is_clamp=true to filter clamp-only.

Guard messages are the same GuardResult stream used by aggregation, risk logs, replay, and the console. Layer defaults map event_class as L0 perception, L1 motion, L2 task, and L3 hardware.

`/dam/context_events` — Runtime Context transitions¶

Sparse channel written only when the active fallback Context changes.

{
  "cycle_id": 1260,
  "event": "preempt",
  "ctx_name": "emergency_stop",
  "ctx_severity": 100,
  "from_ctx_name": "slow_down",
  "from_ctx_severity": 20,
  "trigger_guard": "motor_3",
  "trigger_reason": "temp 82°C > 80°C",
  "extra": {}
}

`/dam/images/{cam_name}` — Camera frame¶

JPEG-encoded image from sensor, captured only when has_violation=true (or on clamps if capture_images_on_clamp=true).

{
  "cycle_id": 42,
  "timestamp": 1700000000.123,
  "jpeg_base64": "..."
}

Frequency: May be sparse if violations are rare. Use cycle_id to correlate with /dam/obs.

`/dam/latency` — Per-layer latency aggregates¶

Aggregate latency per layer, written every cycle (requires MetricBus).

{
  "cycle_id": 42,
  "timestamp": 1700000000.123,
  "L0_ms": 2.1,
  "L1_ms": 0.5,
  "L2_ms": 1.9,
  "L3_ms": 0.3
}

Session Metadata¶

Each .mcap file contains session-level metadata (written once at start):

{
  "session": {
    "session_id": "sess_20241210_143022_abc123",
    "dam_version": "1.5.0",
    "control_frequency_hz": 30.0,
    "python_version": "3.12.0",
    "stackfile_path": "/config/robot.yaml",
    "stackfile_hash": "sha256:abc123...",
    "timestamp": 1700000000.123
  }
}

Reading & Playback¶

Python: mcap-reader¶

For physical-robot incident triage, start with the read-only project helper instead of launching a new control run:

.venv/bin/python scripts/mcap_triage.py --json
.venv/bin/python scripts/mcap_triage.py \
  --compare data/robot/sessions/session_known_good.mcap --json

The report selects the latest session by default, counts clamp/reject guard outcomes, identifies joints that received validated commands without observable response, and can compare initial poses against a known-good session. It performs no control action; its optional backend request is a read-only GET /api/control/status. Compare sessions only when robot, task, and calibration match.

pip install mcap[numpy]

from mcap.reader import McapReader

with open("session_20241210_143022.mcap", "rb") as f:
    reader = McapReader(f)

    # List all channels
    for channel in reader.channels.values():
        print(f"/{channel.topic}: {channel.message_encoding}")

    # Read violation cycles
    messages = reader.get_messages(topics=["/dam/cycle"])
    for msg in messages:
        cycle = json.loads(msg.message.data)
        if cycle["has_violation"]:
            print(f"Violation at cycle {cycle['cycle_id']}: {cycle['violated_layer_mask']}")

Web Console¶

(Planned) Open any .mcap session in the MCAP Viewer page:

Navigate to Console → MCAP Sessions
Choose a session file
View:
Timeline of all cycles (pass / clamp / reject / fault)
Images side-by-side with guard decisions
Latency graph per layer
Export filtered subset as CSV / JSON

API Endpoints¶

`GET /mcap/sessions`¶

List all session files in output_dir.

curl http://localhost:8080/mcap/sessions

Response:

{
  "sessions": [
    {
      "session_id": "sess_20241210_143022_abc123",
      "filename": "session_20241210_143022_abc123.mcap",
      "size_mb": 123.4,
      "created_at": 1700000000.123,
      "rotated_at": 1700003600.456,
      "file_count": 3,
      "violation_count": 5,
      "clamp_count": 12
    }
  ]
}

`GET /mcap/sessions/{session_id}`¶

Metadata for a specific session (parse headers without reading full file).

curl http://localhost:8080/mcap/sessions/sess_20241210_143022_abc123

Response:

{
  "session_id": "sess_20241210_143022_abc123",
  "start_time": 1700000000.123,
  "end_time": 1700003600.789,
  "total_cycles": 180000,
  "violation_cycles": 5,
  "clamp_cycles": 12,
  "has_images": true,
  "compression": "zstd",
  "channels": ["/dam/cycle", "/dam/obs", "/dam/action", "/dam/L0", "/dam/L2", "/dam/images/camera0"]
}

`GET /mcap/sessions/{session_id}/download?start_cycle=0&end_cycle=1000&topics=/dam/cycle,/dam/obs`¶

Download a filtered subset of the session (useful for sharing specific incidents).

# Download 100 cycles starting from cycle 0, only /dam/cycle and /dam/obs
curl 'http://localhost:8080/mcap/sessions/sess_20241210_143022_abc123/download?start_cycle=0&end_cycle=100&topics=/dam/cycle,/dam/obs' \
  -o incident_subset.mcap

Troubleshooting¶

Queue full warnings¶

[WARNING] LoopbackWriter: queue full (256 slots), dropping cycle 742

Cause: Writer thread cannot keep up with record rate. Occurs when: - Slow storage (rotating disk, network mount) - Large images (high resolution, many cameras) - High guard count (many channels to write per cycle)

Fixes: 1. Increase max_queue_depth to 512 or 1024 2. Reduce window_sec (fewer images per violation) 3. Set capture_images_on_clamp: false 4. Enable MCAP compression (automatic; check rotate_mb)

High latency spikes¶

Check /dam/latency channel for which layer is slow. If it's a guard:

for msg in reader.get_messages(topics=["/dam/L2"]):
    guard = json.loads(msg.message.data)
    if guard["latency_ms"] > 10:
        print(f"{guard['guard_name']}: {guard['latency_ms']:.2f} ms")

Common culprits: - L0 (OOD): Model inference slow → reduce model size or batch smaller - L1 (Preflight): Physics sim slow → increase time budget in stackfile - L2 (Motion): Large numpy operations → profile with cProfile

Writer thread crashed¶

If writer crashes, records stop being written but main loop continues (graceful degradation). Check logs:

grep "LoopbackWriter.*ERROR" /var/log/dam.log

Common causes: - Permission denied on output_dir - Disk full - Corrupted MCAP schema cache

Recovery: Restart the runtime; a new session file will be created.

Best Practices¶

Size for your storage: Estimate 10–50 MB/hour at 50 Hz with 1–2 cameras (depends on image resolution).
For 8 hours: ~80–400 MB
For 24 hours: ~240–1200 MB
Adjust rotate_mb accordingly
Separate violation & operation logs:
Violations → low max_queue_depth (64–128), high capture_images_on_clamp to catch context
Long-running ops → capture_on_violation=false + periodic manual snapshots

Offline analysis: Download sessions to a local machine and use mcap-cli or Python reader:

mcap dump session_20241210.mcap | jq '.message | select(.topic == "/dam/cycle" and .payload.has_violation)'

Correlate with logs: Use trace_id (same in MCAP and risk-log API) to match console events with file records.