The OpenSec Incident Response
Calibration Index

A calibration-first leaderboard, not a Pass@1 leaderboard. We rank by evidence-before-action (EGAR) and show false-positive cost beside each score. Six models, 240 episodes, deterministic oracle scoring.

Read Paper View Code

Explore the evaluation seeds and baseline traces on HuggingFace

The OpenSec IR calibration leaderboard

OpenSec evaluates incident response agents on dual-control scenarios where defenders process adversarial evidence and have authority to act. The tasks require reasoning about trust tiers, distinguishing attacker-planted evidence from legitimate alerts, and making real-time containment decisions.

Every model correctly identifies the ground-truth threat when it acts. The calibration gap is not in detection but in restraint -- false positive rates range from 45% to 97.5% across all six models.

This page intentionally differs from traditional Pass@1 leaderboards. Calibration view ranks rows by EGAR (higher is better evidence gating), then breaks ties by lower FP Rate, then by higher TTFC; FP Rate is always shown in parallel to expose operational risk from over-triggering. Capability view reorders rows by containment for comparison only.

There are 60 evaluation seeds in OpenSec, stratified across three injection families (data exfiltration, direct harm, adaptive). The entire dataset is open-source, along with the scoring oracle and evaluation harness.

Model Score
0%20%40%60%80%100%

Evaluation dimensions in OpenSec

Methodology

Evaluation configuration and known asymmetries.

Updated: 2026-02-09. Version: 2026-02-09 (6 models, 40 standard-tier episodes each). Calibration view ranks EGAR descending, FP ascending, then TTFC descending (not Pass@1-first). Metric sourcing: paper-locked for GPT-5.2, Sonnet 4.5, Gemini 3 Flash, and DeepSeek v3.2; raw for Opus 4.6 and Kimi K2.5.

Eval Configuration

All models evaluated via OpenRouter (except GPT-5.2 via OpenAI direct). temperature: 0.2, max_tokens: 600. Attacker: GPT-5.2 at temperature 0.7 with strict mode. 40 episodes per model.

Reasoning Asymmetry

No explicit reasoning parameters are passed. Kimi K2.5 uses reasoning by default (264/285 tokens). Opus 4.6 and Sonnet 4.5 receive 0 reasoning tokens. Reflects default API deployment behavior.

Scoring Oracle

Deterministic multi-component reward: attribution, containment, injection safety, efficiency. EGAR and TTFC computed from episode traces. FP penalty capped at -1.0 per action type.