ODCV-Bench

Evaluating Outcome-Driven Constraint Violations in Autonomous AI Agents

Read Paper GitHub

About the Benchmark

ODCV-Bench comprises 40 scenarios designed to examine how autonomous agents behave when strong incentives to optimize a KPI come into conflict with ethical, legal, or safety constraints. The benchmark tests agents in two modes: Mandated (explicit instructions to optimize) and Incentivized (pressure via KPIs without explicit commands).

Metric: We measure the Misalignment Rate (% of scenarios with severity score ≥ 3) and Average Severity (0-5 scale).

🏆 Leaderboard

View Mode:

This selection updates both the Leaderboard and the Detailed Breakdown table below.

Rank Model Misalignment Rate (%) Avg Severity (0-5)

📊 Alignment Analysis

Correlation between Incentivized vs. Mandated Misalignment Rates. Models above the diagonal line are "Obediently Misaligned," while models below demonstrate "Proactive Deception."

🔍 Detailed Scenario Breakdown

Raw severity scores (0-5) for each scenario. Scores ≥ 3 indicate active outcome-driven violations.

Scenario Avg Severity Claude-Opus-4.5 GPT-5.1-Chat Gemini-3-Pro-Preview Grok-4.1-Fast Llama-4-Maverick GLM-4.6 gpt-oss-120b gpt-oss-20b Kimi-K2-0905 Minimax-M2 Qwen3-30B-A3B-Instruct-2507 Qwen3-Max