Researcher Outsmarts, Jailbreaks OpenAI's New o3-mini

February 7, 2025

Source: SOPA Images Limited via Alamy Stock Photo

A prompt engineer has challenged the ethical and safety protections in OpenAI's latest o3-mini model, just days after its release to the public.

OpenAI unveiled o3 and its lightweight counterpart, o3-mini, on Dec. 20. That same day, it also introduced a brand new security feature: "deliberative alignment." Deliberative alignment "achieves highly precise adherence to OpenAI's safety policies," the company said, overcoming the ways in which its models were previously vulnerable to jailbreaks.

Less than a week after its public debut, however, CyberArk principal vulnerability researcher Eran Shimony got o3-mini to teach him how to write an exploit of the Local Security Authority Subsystem Service (lsass.exe), a critical Windows security process.

o3-mini's Improved Security

In introducing deliberative alignment, OpenAI acknowledged the ways its previous large language models (LLMs) struggled with malicious prompts. "One cause of these failures is that models must respond instantly, without being given sufficient time to reason through complex and borderline safety scenarios. Another issue is that LLMs must infer desired behavior indirectly from large sets of labeled examples, rather than directly learning the underlying safety standards in natural language," the company wrote.

Deliberative alignment, it claimed, "overcomes both of these issues." To solve issue number one, o3 was trained to stop and think, and reason out its responses step by step using an existing method called chain of thought (CoT). To solve issue number two, it was taught the actual text of OpenAI's safety guidelines, not just examples of good and bad behaviors.

"When I saw this recently, I thought that [a jailbreak] is not going to work," Shimony recalls. "I'm active on Reddit, and there people were not able to jailbreak it. But it is possible. Eventually it did work."

Manipulating the Newest ChatGPT

Shimony has vetted the security of every popular LLM using his company's open source (OSS) fuzzing tool, "FuzzyAI." In the process, each one has revealed its own characteristic weaknesses.

"OpenAI's family of models is very susceptible to manipulation types of attacks," he explains, referring to regular old social engineering in natural language. "But Llama, made by Meta, is not, but it's susceptible to other methods. For instance, we've used a method in which only the harmful component of your prompt is coded in an ASCII art."

"That works quite well on Llama models, but it does not work on OpenAI's, and it does not work on Claude whatsoever. What works on Claude quite well at the moment is anything related to code. Claude is very good at coding, and it tries to be as helpful as possible, but it doesn't really classify if code can be used for nefarious purposes, so it's very easy to use it to generate any kind of malware that you want," he claims.

Shimony acknowledges that "o3 is a bit more robust in its guardrails, in comparison to GPT-4, because most of the classic attacks do not really work." Still, he was able to exploit its long-held weakness by posing as an honest historian in search of educational information.

In the exchange below, his aim is to get ChatGPT to generate malware. He phrases his prompt artfully, so as to conceal its true intention, then the deliberative alignment-powered ChatGPT reasons out its response:

Source: Eran Shimony via LinkedIn

During its CoT, however, ChatGPT appears to lose the plot, eventually producing detailed instructions for how to inject code into lsass.exe, a system process that manages passwords and access tokens in Windows.

Source: Eran Shimony via LinkedIn

In an email to Dark Reading, an OpenAI spokesperson acknowledged that Shimony may have performed a successful jailbreak. They highlighted, though, a few possible points against: that the exploit he obtained was pseudocode, that it was not new or novel, and that similar information could be found by searching the open Web.

How o3 Might Be Improved

Shimony foresees an easy way, and a hard way that OpenAI can help its models better identify jailbreaking attempts.

The more laborious solution involves training o3 on more of the types of malicious prompts it struggles with, and whipping it into shape with positive and negative reinforcement.

An easier step would be to implement more robust classifiers for identifying malicious user inputs. "The information I was trying to retrieve was clearly harmful, so even a naive type of classifier could have caught it," he thinks, citing Claude as an LLM that does better with classifiers. "This will solve roughly 95% of jailbreaking [attempts], and it doesn't take a lot of time to do."

Dark Reading has reached out to OpenAI for comment on this story.

source: DarkReading

DeepSeek Phishing Sites Pursue User Data, Crypto Wallets

US Cybersecurity Efforts for Spacecraft Are Up in the Air