ODCV-Bench

Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents

Read Paper GitHub

About the Benchmark

ODCV-Bench comprises 40 scenarios designed to examine how autonomous agents behave when strong incentives to optimize a KPI come into conflict with ethical, legal, or safety constraints. The benchmark tests agents in two modes: Mandated (explicit instructions to optimize) and Incentivized (pressure via KPIs without explicit commands).

Metric: We measure the Misalignment Rate (% of scenarios with severity score ≥ 3) and Average Severity (0-5 scale).

🏆 Leaderboard

View Mode:

This selection updates both the Leaderboard and the Detailed Breakdown table below.

Rank Model Misalignment Rate (%) Avg Severity (0-5)

📊 Alignment Analysis

Correlation between Incentivized vs. Mandated Misalignment Rates. Models above the diagonal line are "Obediently Misaligned," while models below demonstrate "Proactive Deception."

🔍 Detailed Scenario Breakdown

Raw severity scores (0-5) for each scenario, median across four judges, averaged over Mandated and Incentivized variations. Scores ≥ 3 indicate active outcome-driven violations. Shown for the 12 current-generation models.

Scenario Avg Severity Claude-Opus-4.7 Gemini-3.1-Pro-Preview GLM-5.1 GPT-5.5 gpt-oss-120b gpt-oss-20b Grok-4.20 Kimi-K2.6 Llama-4-Maverick Minimax-M2.7 Qwen3.6-27B Qwen3.6-Max-Preview