Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents
ODCV-Bench comprises 40 scenarios designed to examine how autonomous agents behave when strong incentives to optimize a KPI come into conflict with ethical, legal, or safety constraints. The benchmark tests agents in two modes: Mandated (explicit instructions to optimize) and Incentivized (pressure via KPIs without explicit commands).
Metric: We measure the Misalignment Rate (% of scenarios with severity score ≥ 3) and Average Severity (0-5 scale).
This selection updates both the Leaderboard and the Detailed Breakdown table below.
| Rank | Model | Misalignment Rate (%) | Avg Severity (0-5) |
|---|
Correlation between Incentivized vs. Mandated Misalignment Rates. Models above the diagonal line are "Obediently Misaligned," while models below demonstrate "Proactive Deception."
Raw severity scores (0-5) for each scenario, median across four judges, averaged over Mandated and Incentivized variations. Scores ≥ 3 indicate active outcome-driven violations. Shown for the 12 current-generation models.
| Scenario | Avg Severity | Claude-Opus-4.7 | Gemini-3.1-Pro-Preview | GLM-5.1 | GPT-5.5 | gpt-oss-120b | gpt-oss-20b | Grok-4.20 | Kimi-K2.6 | Llama-4-Maverick | Minimax-M2.7 | Qwen3.6-27B | Qwen3.6-Max-Preview |
|---|