Jailbreaking LLMs and Agentic Systems

Attacks, Defenses, and Evaluations

ICML 2025 Tutorial

Hamed Hassani¹, Amin Karbasi^2,3, Alexander Robey^4,5, Yaron Singer³

¹ University of Pennsylvania

² Yale University

³ Robust Intelligence

⁴ Carnegie Mellon University

⁵ Gray Swan AI

Abstract

Since the inception of language models (LLMs), considerable attention has been directed toward the field of AI safety. These efforts aim to identify a range of best practices—including evaluation protocols, defense algorithms, and content filters—that facilitate the ethical, trustworthy, and reliable deployment of LLMs and related technologies. A key component of AI safety is model alignment, a broad concept referring to algorithms that optimize the outputs of LLMs to align with human values. And yet, despite these efforts, recent research has identified several failure modes—referred to as jailbreaks—that circumvent LLM alignment by eliciting unsafe content from a targeted model. And while initial jailbreaks targeted the generation of harmful information (e.g., copyrighted or illegal material), modern attacks seek to elicit domain-specific harms, such as digital agents violating user privacy and LLM-controlled robots performing harmful actions in the physical world. In the worst case, future attacks may target self-replication or power seeking behaviors. The insidious nature of jailbreaking attacks represents a substantial obstacle to the broad adoption of LLMs. Therefore, it is critical for the machine learning community to study these failure modes and develop effective defense strategies that counteract them.

Over the past two years, research in both academia and industry has sought to design new attacks that stress test model safeguards, and to develop stronger defenses against these attacks. And, in general, this concerted work has resulted in safer models. Notably, highly performant LLMs such as OpenAI's o-series and Anthropic's Claude3 models demonstrate significant robustness against a variety of jailbreaking attacks. However, the evolving arms race between jailbreaking attacks and defenses demonstrates that meeting acceptable standards of safety remains a work in progress. To provide a comprehensive overview of the evolving landscape in this field, this tutorial aims to present a unified perspective on recent progress in the jailbreaking community. In line with this goal, the primary objectives of this tutorial are as follows: (i) We will review cutting-edge advances in jailbreaking, covering new algorithmic frameworks and mathematical foundations, with particular emphasis on attack, defenses, evaluations, and applications in robotics and agentic systems; (ii) Noting that the foundations of jailbreaking are still at their infancy, we will discuss the plethora of new directions, opportunities, and challenges that have recently been brought to bear by the identification of jailbreaking attacks; (iii) We will walk through a range of open-source Python implementations of state-of-the-art algorithms.