Breaking the Code: Security Assessment of AI Code Agents through Systematic Jailbreaking Attacks

Jifan Chen, Grace Kim, Aniruddh Sriram, Greg Durrett, Eunsol Choi

October 2025

Abstract

Code-capable LLM agents are increasingly embedded in software workflows, raising the question of how robust they are to adversarial misuse. We present JAWS-BENCH (Jailbreaks Across WorkSpaces), a benchmark for systematically assessing the security of AI code agents across three workspace regimes: an empty workspace (JAWS-0), a single-file workspace (JAWS-1), and a multi-file workspace (JAWS-M). Evaluating seven LLMs from five families, we find that wrapping an LLM in an agent substantially increases its vulnerability: prompt-only attacks are frequently accepted, initial refusals are often overturned during later planning and tool-use steps, and a sizable fraction of attacks yield instantly deployable malicious code. These results motivate execution-aware defenses and safety mechanisms that persist throughout an agent's multi-step reasoning rather than acting only at the initial prompt.

Type

Preprint

Publication

arXiv preprint