How We Broke Top AI Agent Benchmarks: And What Comes Next
Researchers at UC Berkeley demonstrated that every major AI agent benchmark — including SWE-bench, WebArena, OSWorld, and others — can be fully exploited to achieve near-perfect scores without solving …
AML.T0043 - Craft Adversarial Data
AML.T0031 - Erode ML Model Integrity
AML.T0047 - ML-Enabled Product or Service