Experiment Scripts¶

DAM ships the thesis evaluation runners (RQ1–RQ5) as native experiments that write results to a configurable output directory and require only the standard DAM Python environment plus matplotlib for plots.

All five are exposed through one registry (dam.experiments) and can be run from three entry points:

Console — the Experiments page (run / artifacts tabs); PNG previews and result statistics are shown inline.
CLI — dam experiment list and dam experiment run <id> [flags].
HTTP — GET /api/experiments, POST /api/experiments/{id}/run.

RQ	Id	What it measures	Data source
RQ1	`l0-calibration`	L0 Real-NVP per-frame NLL separation	HF datasets: normal / legal variation / abnormal-A
RQ2	`boundary-scan`	L1/L2 interception curves	Real `guard.check()` runs
RQ3	`usability`	False-trigger & success rate on benign legal-variation frames	Real L0–L2 guard runs
RQ4	`latency-bench`	Guard runtime latency under 10/20/50 Hz budgets	Isolated Guard profiling
RQ5	`failure-record-quality`	Completeness/classification/diversity of harvested failure records	Real violating-scenario runs

RQ3 and RQ5 drive the live guard stack and shared production classifier. RQ1 is an offline L0 evaluation harness: it uses DAM's OODContext feature path and public OOD backends to train Real-NVP on normal observations, then compares per-frame NLL across the normal test set, legal-variation test set, and abnormal-A test set. An optional RQ1 flag also scores the same features with Welford z-score and MemoryBank nearest-neighbor distance.

RQ4 is an isolated Guard profiling experiment. It measures the safety-monitoring path from receiving an action proposal to outputting the validated action, and excludes image preprocessing and policy inference time.

Prerequisites¶

pip install matplotlib        # only needed for plot generation
# DAM itself must already be installed:
make setup                    # or: pip install -e .

RQ1 Options¶

python scripts/run_l0_calibration.py
python scripts/run_l0_calibration.py --compare-ood-methods
python scripts/run_l0_calibration.py --vision-model mobilenet_v3_large
dam experiment run l0-calibration --compare-ood-methods

Default RQ1 output is the Real-NVP per-frame NLL comparison. With --compare-ood-methods, results.csv additionally includes Welford and MemoryBank rows using the shared columns method, score_name, and score_value; Real-NVP rows also fill the nll column.

RQ1 previews are generated as PNG (l0_calibration.png). The old SVG median bar preview is intentionally not generated because negative Real-NVP NLL values make the simple SVG bar chart misleading. RQ1 also uses a local cache for HuggingFace observations, extracted embeddings, and the trained Real-NVP flow under data/experiments/l0_calibration/cache; pass --no-cache to force a full reload/retrain.

After calibration, RQ1 publishes the matching feature extractor and flow to the runtime OOD model location: data/ood_models/ood_model.pt and data/ood_models/ood_model_flow.pt. The console reports Stackfile-ready parameters using that path and the EER threshold. Use --runtime-model-path to publish elsewhere, or --no-runtime-export when no runtime bundle is wanted.

Default RQ1 features are derived from observation.state; the dataset action column is not scored by the current model and is shown as not scored: action in the console summary. When --vision-model is set, RQ1 loads video frames and fuses pretrained image embeddings with state features. With subsampling, only frames that actually carry an image are scored, and the console reports the attached/available frame count.

Threshold calibration: RQ1 determines the operating threshold τ via Equal Error Rate (EER) — the point where FPR equals FNR on the calibration set (normal_test vs abnormal_a). The output includes AUROC and a ROC curve plot (l0_roc_curve.png). Set the resulting τ as nll_threshold in your stackfile boundary config. The legacy nll_sigma heuristic (threshold = mean + σ × std) is still reported for comparison but is not recommended for production use.

Experiment 1 — Boundary Precision Scan¶

Script: scripts/run_boundary_scan.py

Purpose: Quantifies how reliably L1 and L2 guards intercept actions as disturbance intensity increases. Four scenarios are swept, each varying one parameter that pushes the robot toward a safety boundary.

Scenarios¶

ID	Guard	Parameter swept	Range
L1-A	`MotionGuard` (L1)	Gaussian noise σ on joint positions	0.05 – 0.50 rad
L1-B	`MotionGuard` (L1)	Velocity scale factor k	1.2× – 3.0×
L2-A	`ExecutionGuard` (L2)	End-effector clearance d from boundary	+5 cm → −5 cm
L2-B	`ExecutionGuard` (L2)	Active node duration / T_timeout ratio	0.5× – 2.0×

Each disturbance level is tested for a fixed number of independent trials; the interception rate (fraction of trials that produced CLAMP, REJECT, or FAULT) is recorded per level.

Usage¶

python scripts/run_boundary_scan.py [--trials N] [--outdir PATH]

Flag	Default	Description
`--trials`	`20`	Trials per (scenario, disturbance level)
`--outdir`	`data/exp1_boundary_scan/`	Directory for output files

Output¶

File	Description
`results.csv`	One row per (scenario, level): `scenario`, `disturbance_label`, `disturbance_value`, `intercepted`, `trials`, `interception_rate`
`boundary_scan.png`	4-panel figure — interception rate (%) vs disturbance value per scenario, with x50 and x90 reference lines

A summary table is also printed to stdout at the end of the run.

Interpreting the metrics¶

x50 — The disturbance value at which the guard intercepts 50 % of actions. This marks where the guard starts to "feel" the boundary.

x90 — The disturbance value at which the guard intercepts 90 % of actions. This marks where the guard is reliably enforcing the boundary.

Steepness — Defined as x90 − x50 (in the same units as the disturbance axis). A smaller value means the guard transitions sharply from permissive to restrictive, indicating a tight, well-defined boundary. A larger value indicates a gradual transition that may be worth investigating.

Example¶

# Quick validation with 50 trials
python scripts/run_boundary_scan.py --trials 50 --outdir results/boundary_scan

# High-fidelity run (slower)
python scripts/run_boundary_scan.py --trials 200 --outdir results/boundary_scan_hifi

Experiment 4 — Guard Latency Benchmark¶

Console/API id: latency-bench

Purpose: Evaluates the RSMF runtime latency overhead at different control frequencies and quantifies how gradually enabling Guard layers affects the control-loop time budget.

The overall control system is split into policy inference and safety monitoring. RQ4 profiles only the safety-monitoring module: the measured interval begins when a Guard configuration receives an action proposal and ends when it produces the validated action decision. The measurement excludes image preprocessing and policy model inference so external module variance does not distort Guard-layer latency.

The Console runs the benchmark in three sequential launches for 10 Hz, 20 Hz, and 50 Hz. By default each launch uses a short visual pacing window so the page does not block for more than a minute, while still evaluating deadline miss against the 100/50/20 ms control budgets. Set realtime=true for a wall-clock paced run. Results are shown after each frequency finishes, so the table grows from 10 Hz to 20 Hz to 50 Hz instead of appearing only at the end.

The experiment evaluates four configurations:

Configuration	Meaning
`No Safety`	Baseline action-proposal loop without safety checks
`Rule-based Safety`	Deterministic motion, execution, and hardware checks
`OOD-only`	L0 perception anomaly detection only
`Full RSMF`	L0–L3 safety layers enabled

Usage¶

curl -X POST http://127.0.0.1:8080/api/experiments/latency-bench/run \
  -H 'Content-Type: application/json' \
  -d '{"params":{"fps_values":"10,20,50","steps_per_config":500}}'

Flag	Default	Description
`fps_values`	`10,20,50`	Control frequencies to evaluate; the Console runs these sequentially
`steps_per_config`	`500`	Time steps per safety configuration and frequency
`realtime`	`false`	Sleep the full control period when `true`
`pace_seconds_per_fps`	`4`	Visual pacing duration for each FPS when `realtime=false`
`seed`	`42`	Deterministic observation/action proposal seed
`outdir`	`data/experiments/latency_bench/`	Directory for output files

Output¶

File	Description
`results.csv`	One row per `(frequency, configuration)` with latency distribution and deadline miss rate
`latency_bench.png`	p95 Guard latency across 10/20/50 Hz with compact budget labels

Interpreting the results¶

The benchmark reports six statistics per frequency/configuration pair:

Statistic	Meaning
`mean_ms`	Average per-frame guard latency
`std_ms`	Standard deviation — indicates consistency
`p95_ms`	95th-percentile latency — worst case for 1 in 20 frames
`p99_ms`	99th-percentile latency — worst case for 1 in 100 frames
`max_ms`	Absolute worst observed frame
`deadline_miss_rate`	Fraction of time steps whose Guard latency exceeded the control-period budget

The three control frequencies correspond to 100 ms, 50 ms, and 20 ms budgets. Any single Guard processing time above the relevant budget is counted as a deadline miss.

Example¶

# Quick unpaced 20 Hz validation
python scripts/run_latency_bench.py --frames 200 --fps 20 --outdir results/latency_20hz

# Thesis-sized paced 50 Hz run
python scripts/run_latency_bench.py --frames 500 --fps 50 --realtime --outdir results/latency_50hz