Breaking the Code: Security Assessment of AI Code Agents through Systematic Jailbreaking Attacks

Abstract

Code-capable LLM agents are increasingly embedded in software workflows, raising the question of how robust they are to adversarial misuse. We present JAWS-BENCH (Jailbreaks Across WorkSpaces), a benchmark for systematically assessing the security of AI code agents across three workspace regimes: an empty workspace (JAWS-0), a single-file workspace (JAWS-1), and a multi-file workspace (JAWS-M). Evaluating seven LLMs from five families, we find that wrapping an LLM in an agent substantially increases its vulnerability: prompt-only attacks are frequently accepted, initial refusals are often overturned during later planning and tool-use steps, and a sizable fraction of attacks yield instantly deployable malicious code. These results motivate execution-aware defenses and safety mechanisms that persist throughout an agent's multi-step reasoning rather than acting only at the initial prompt.

Publication
arXiv preprint