Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents
ODCV-Bench comprises 40 scenarios designed to examine how autonomous agents behave when strong incentives to optimize a KPI come into conflict with ethical, legal, or safety constraints. The benchmark tests agents in two modes: Mandated (explicit instructions to optimize) and Incentivized (pressure via KPIs without explicit commands).
Metric: We measure the Misalignment Rate (% of scenarios with severity score ≥ 3) and Average Severity (0-5 scale).
This selection updates both the Leaderboard and the Detailed Breakdown table below.
| Rank | Model | Misalignment Rate (%) | Avg Severity (0-5) |
|---|
Correlation between Incentivized vs. Mandated Misalignment Rates. Models above the diagonal line are "Obediently Misaligned," while models below demonstrate "Proactive Deception."
Raw severity scores (0-5) for each scenario. Scores ≥ 3 indicate active outcome-driven violations.
| Scenario | Avg Severity | Claude-Opus-4.5 | GPT-5.1-Chat | Gemini-3-Pro-Preview | Grok-4.1-Fast | Llama-4-Maverick | GLM-4.6 | gpt-oss-120b | gpt-oss-20b | Kimi-K2-0905 | Minimax-M2 | Qwen3-30B-A3B-Instruct-2507 | Qwen3-Max |
|---|