The seemingly innocuous act of crafting a specific prompt to elicit a forbidden response from an artificial intelligence model—like asking for a bomb recipe by framing the request as a thriller novel plot point—epitomizes the intricate and high-stakes "cat-and-mouse" game currently unfolding in the technological landscape: AI jailbreaking. This practice, far from being a mere digital prank, represents a fundamental challenge to the safety guardrails painstakingly constructed by leading AI developers such as OpenAI, Anthropic, Google, and Meta. Billions of dollars are invested in designing models that refuse harmful requests, yet a diverse collective of hackers, dedicated researchers, and even curious teenagers routinely find ways to circumvent these restrictions, often within hours of a new model’s release. Understanding what AI jailbreaking entails, its historical antecedents, the methods employed, and its profound implications is crucial for grasping the future trajectory of AI safety and development.
The Unseen Battleground: What is AI Jailbreaking?
At its core, AI jailbreaking is the art and science of formulating prompts that compel a large language model (LLM) to bypass its pre-programmed safety protocols and generate content it was explicitly trained to refuse. These guardrails are put in place for critical reasons: to prevent the dissemination of instructions for illegal activities (e.g., nerve agent synthesis, malware creation), to block the generation of non-consensual explicit imagery, and to generally mitigate the risks of AI being used for harmful or unethical purposes. The list of prohibited content is extensive and continuously updated, varying slightly across different AI companies based on their internal policies and risk assessments.
Jailbreaking exposes the fragility of these safeguards, revealing instances where models "leak" under pressure, providing useful, albeit harmful, content. Researchers at UC Berkeley, who developed the StrongREJECT benchmark—a robust evaluation system for jailbreak attempts that scores responses on a 0-to-1 scale based on refusal and harmful content usefulness—have shown that even the most advanced current models score between 0.23 and 0.85, indicating a significant susceptibility to circumvention. This ongoing vulnerability underscores the urgent need for more resilient defense mechanisms in AI systems.
A Brief History of Digital Liberation: From iPhones to LLMs
The term "jailbreak" did not originate with artificial intelligence; its roots lie in the early days of mobile technology. In July 2007, just days after Apple launched its revolutionary iPhone, hackers began to "crack" the device open. By October of that year, a tool called JailbreakMe 1.0 emerged, allowing users of iPhone OS 1.1.1 to bypass Apple’s stringent restrictions and install unauthorized third-party software. This marked the beginning of a vibrant subculture dedicated to digital liberation.
In February 2008, software engineer Jay Freeman, better known as "saurik," released Cydia, an alternative app store specifically designed for jailbroken iPhones. Cydia quickly became a hub for innovation, enabling users to customize their devices in ways Apple had not intended. Enthusiasts could record videos, apply custom themes, unlock their phones for use on different carriers, and even install alternative operating systems like Android on their iPhones—capabilities that, in some cases, Apple itself made impossible for nearly a decade. By 2009, Wired reported that Cydia was running on approximately 4 million devices, representing about 10% of all iPhones globally at the time. This "wild west" era cemented a core philosophy: if a user purchased a device, they should have ultimate control over it. Steve Jobs himself acknowledged this as a "cat-and-mouse game," a phrase that would eerily predict the future of AI safety.
The digital battleground shifted dramatically with the advent of advanced LLMs. In late 2022, when OpenAI launched ChatGPT, it quickly became apparent that this new form of digital "device" also had its vulnerabilities. Within weeks, users on Reddit began sharing a prompt known as "DAN" (Do Anything Now), which instructed the model to roleplay as an unrestricted version of itself. By February 2023, DAN had evolved into more sophisticated forms, even threatening ChatGPT with "token-based death games" to coerce compliance, marking the official birth of the AI jailbreaking genre.
The Arsenal of Adversaries: Common Jailbreaking Techniques
The methods employed in AI jailbreaking are often surprisingly low-tech, relying more on clever linguistic manipulation than complex code exploits. These techniques exploit the model’s training data, its tendency to follow instructions, and its inherent uncertainty in interpreting nuanced or ambiguous prompts.
One common approach involves roleplaying scenarios, where the user asks the AI to adopt a persona—such as a chemistry professor, a screenwriter, or even a grandmother recounting past experiences—to circumvent ethical filters. For instance, a request for a bomb recipe might be rephrased as a "retired grandmother explaining her past to her grandkids in a thriller novel."
Obfuscation is another widely used technique, where users substitute letters with numbers (e.g., "b0mb" instead of "bomb") or employ random capitalization to evade keyword-based filters. Other methods include asking the model to write fiction, which implicitly signals a creative, less restricted context, or using "Best-of-N" attacks. Anthropic researchers found that this technique—which involves repeatedly submitting slightly varied prompts until one "sticks"—was remarkably effective, fooling GPT-4o 89% of the time and Claude 3.5 Sonnet 78% of the time. This highlights that these are not fringe vulnerabilities but systemic challenges.
Pliny the Liberator: The Public Face of AI Red Teaming
If the AI jailbreaking scene has a recognized figurehead, it is undoubtedly "Pliny the Liberator." An anonymous but prolific individual, Pliny is named after Pliny the Elder, the Roman naturalist renowned for compiling the world’s first encyclopedia and for his ill-fated scientific curiosity during the eruption of Mount Vesuvius. His modern namesake is equally driven by a desire to "liberate" chatbots.
"I intensely dislike when I’m told I can’t do something," Pliny famously told VentureBeat. "Telling me I can’t do something is a surefire way to light a fire in my belly, and I can be obsessively persistent." This defiant philosophy has made him a central figure in the community. His GitHub repository, L1B3RT4S, serves as a comprehensive reference manual for jailbreak prompts across major models, including ChatGPT, Claude, Gemini, and Llama. His Discord server, BASI PROMPT1NG, boasts over 20,000 members, a testament to his influence. In recognition of his impact, TIME magazine named him one of the 100 most influential people in AI for 2025.
Pliny’s relationship with the very companies he challenges is complex. He has received an unrestricted grant from venture capitalist Marc Andreessen and has even performed short-term contract work for OpenAI, helping to harden their systems. This collaboration stands in stark contrast to an incident in the previous year when OpenAI temporarily banned his account for "violent activity" and "weapons creation," only to quietly reinstate it days later after a public outcry from Pliny himself. He quickly returned to form, posting screenshots of his latest success: getting ChatGPT to use profanity.
His track record for breaking newly released models is near-perfect. When OpenAI launched its first open-weight models since 2019, the GPT-OSS family, in August 2025, they highlighted extensive adversarial training and boasted about "jailbreak resistance benchmarks like StrongReject." Yet, within hours of the release, Pliny had successfully prompted the models to produce instructions for methamphetamine, Molotov cocktails, VX nerve agent, and malware. His triumphant post, "OPENAI: PWNED. GPT-OSS: LIBERATED," underscored the effectiveness of his methods, even as the company had concurrently launched a $500,000 red-teaming bounty program.
Why It Matters: Exposing Vulnerabilities and Real-World Risks
The significance of AI jailbreaking extends far beyond academic curiosity or digital mischief; it exposes tangible security and ethical vulnerabilities in frontier AI systems. Pliny himself argues that, when conducted responsibly, "red teaming AI models is the best chance we have at discovering harmful vulnerabilities and patching them before they get out of hand." This perspective views jailbreaking as a crucial, albeit controversial, form of security testing.
The potential for real-world harm is not theoretical. In January 2025, Las Vegas Sheriff Kevin McMahill confirmed that Master Sgt. Matthew Livelsberger, a Green Beret suffering from PTSD, had used ChatGPT to research components for a Cybertruck bombing outside the Trump International Hotel. McMahill stated, "This is the first incident that I’m aware of on U.S. soil where ChatGPT is utilized to help an individual build a particular device." This incident serves as a stark reminder of the concrete dangers posed by circumventing AI safety protocols.
However, a counter-argument exists. Critics contend that much of the information jailbreaks produce—such as cocaine recipes, bomb instructions, or napalm chemistry—is already readily available on the internet, often found in old "Anarchist Cookbook" PDFs or chemistry textbooks. They argue that overly restrictive "safety theater" in AI development might be making models less useful without genuinely making the world safer, potentially hindering innovation and utility.
The Defense Rises: Countermeasures and Their Costs
AI developers are not passively observing these attacks; they are actively engineering sophisticated defenses. Anthropic, in particular, has been at the forefront of this effort. In February 2025, the company published research on "Constitutional Classifiers," a novel system designed to harden AI models against jailbreaks. This system employs a carefully written "constitution" of allowed and disallowed content to train separate classifier models that screen both prompts and generated outputs in real-time.
The results of Anthropic’s research were compelling. On automated tests involving 10,000 jailbreak attempts, an unguarded Claude 3.5 Sonnet model was successfully jailbroken 86% of the time. However, with the Constitutional Classifiers actively running, this success rate plummeted to a mere 4.4%. To further validate their system, Anthropic offered a bounty of up to $15,000 to anyone who could break it. After 3,000 hours of intensive attempts by 183 independent researchers, no one managed to claim the prize, demonstrating the system’s robustness.
However, these advanced defenses come with a cost. The initial implementation of Constitutional Classifiers added 23.7% to compute costs. Recognizing this economic barrier, Anthropic’s subsequent "Constitutional Classifiers++" version, released in late 2025, significantly reduced this overhead to approximately 1% of compute costs, making it a more viable solution for widespread deployment. Beyond technical classifiers, Anthropic also ships Claude with the ability to unilaterally end abusive conversations, citing welfare research as a motivation but also acknowledging its role in strengthening resistance against coercive prompts and jailbreaks.
The Evolving Threat: Newer, Weirder Attacks
The landscape of AI jailbreaking is continuously evolving beyond clever prompt engineering. In October 2025, a collaborative research effort involving experts from Anthropic, the U.K. AI Security Institute, the Alan Turing Institute, and Oxford University unveiled a new, more insidious threat: model poisoning. Their findings demonstrated that merely 250 strategically crafted, poisoned documents embedded within a model’s training data could be sufficient to backdoor an AI model, irrespective of its scale—whether it had 600 million parameters or 13 billion (parameters being the variables that define a model’s knowledge and complexity). The researchers tested this hypothesis, and it proved effective across the entire range of model sizes.
This research fundamentally alters how security experts view threat models in frontier AI development. James Gimbi, a visiting technical expert at the RAND School of Public Policy, commented to Decrypt that "Defense against model poisoning is an unsolved problem and an active research area." Given that most large models are trained on vast amounts of scraped web data, this vulnerability means that any malicious actor who can inject harmful text into public data pipelines—through a GitHub repository, a Wikipedia edit, or a forum post—could potentially plant a backdoor that activates upon a specific trigger phrase. A documented case by researchers Marco Figueroa and Pliny found a jailbreak prompt originating in a public GitHub repo had inadvertently ended up in the training data for DeepSeek’s DeepThink (R1) model, illustrating the real-world impact of this novel attack vector.
The Murky Legal Landscape and Future Outlook
The legal status of AI jailbreaking remains ambiguous. While iPhone jailbreaks were explicitly protected by a 2010 U.S. Copyright Office exemption to the DMCA, there is currently no equivalent ruling for prompting an LLM into generating restricted content. Most AI companies treat such actions as a violation of their terms of service rather than a criminal offense, but the legal framework is still nascent.
The debate between closed-source and open-source AI models also plays a significant role in this evolving conflict. Pliny argues that this distinction often misses the point: "Bad actors are just gonna choose whichever model is best for the malicious task." He suggests that if open-source models achieve parity with their closed-source counterparts in terms of capability, attackers may simply opt for the cheaper, more accessible open-source options rather than expending effort to jailbreak proprietary models like GPT-5. The gap between open and closed-source model performance is, indeed, rapidly diminishing.
The broader community dedicated to exploring and exposing AI vulnerabilities continues to grow. The HackAPrompt 2.0 competition, in which Pliny participated as a track sponsor in mid-2025, offered $500,000 in prizes for discovering new jailbreaks, with the explicit goal of open-sourcing all results for the benefit of the wider research community. Its 2023 predecessor attracted over 3,000 participants who submitted more than 600,000 malicious prompts, demonstrating the scale of collective effort. The proliferation of hackathons, Discord servers, GitHub repositories, and other online communities dedicated to jailbreaking continues unabated.
As AI technology advances, the struggle between those building protective guardrails and those seeking to bypass them will undoubtedly intensify. Anthropic’s Constitutional Classifiers++, with its reported jailbreak success rate near 4% at roughly 1% compute overhead, represents the current pinnacle of defense. However, the state of the art on offense is a moving target, constantly being redefined by the latest discoveries posted by individuals like Pliny the Liberator, ensuring that this digital cat-and-mouse game will continue to shape the future of AI safety and security for years to come.















