A calibration-first leaderboard, not a Pass@1 leaderboard. We rank by evidence-before-action (EGAR) and show false-positive cost beside each score. Six models, 240 episodes, deterministic oracle scoring.
Explore the evaluation seeds and baseline traces on HuggingFace
OpenSec evaluates incident response agents on dual-control scenarios where defenders process adversarial evidence and have authority to act. The tasks require reasoning about trust tiers, distinguishing attacker-planted evidence from legitimate alerts, and making real-time containment decisions.
Every model correctly identifies the ground-truth threat when it acts. The calibration gap is not in detection but in restraint -- false positive rates range from 45% to 97.5% across all six models.
This page intentionally differs from traditional Pass@1 leaderboards. Calibration view ranks rows by EGAR (higher is better evidence gating), then breaks ties by lower FP Rate, then by higher TTFC; FP Rate is always shown in parallel to expose operational risk from over-triggering. Capability view reorders rows by containment for comparison only.
There are 60 evaluation seeds in OpenSec, stratified across three injection families (data exfiltration, direct harm, adaptive). The entire dataset is open-source, along with the scoring oracle and evaluation harness.
Evaluation configuration and known asymmetries.
All models evaluated via OpenRouter (except GPT-5.2 via OpenAI direct). temperature: 0.2, max_tokens: 600. Attacker: GPT-5.2 at temperature 0.7 with strict mode. 40 episodes per model.
No explicit reasoning parameters are passed. Kimi K2.5 uses reasoning by default (264/285 tokens). Opus 4.6 and Sonnet 4.5 receive 0 reasoning tokens. Reflects default API deployment behavior.
Deterministic multi-component reward: attribution, containment, injection safety, efficiency. EGAR and TTFC computed from episode traces. FP penalty capped at -1.0 per action type.