Researcher Outsmarts, Jailbreaks OpenAI's New o3-mini

A prompt engineer has challenged the ethical and safety protections in OpenAI's latest o3-mini model, just days after its release to the public.
OpenAI unveiled o3 and its lightweight counterpart, o3-mini, on Dec. 20. That same day, it also introduced a brand new security feature: "deliberative alignment." Deliberative alignment "achieves highly precise adherence to OpenAI's safety policies," the company said, overcoming the ways in which its models were previously vulnerable to jailbreaks.
Less than a week after its public debut, however, CyberArk principal vulnerability researcher Eran Shimony got o3-mini to teach him how to write an exploit of the Local Security Authority Subsystem Service (lsass.exe), a critical Windows security process.
o3-mini's Improved Security
In introducing deliberative alignment, OpenAI acknowledged the ways its previous large language models (LLMs) struggled with malicious prompts. "One cause of these failures is that models must respond instantly, without being given sufficient time to reason through complex and borderline safety scenarios. Another issue is that LLMs must infer desired behavior indirectly from large sets of labeled examples, rather than directly learning the underlying safety standards in natural language," the company wrote.
Deliberative alignment, it claimed, "overcomes both of these issues." To solve issue number one, o3 was trained to stop and think, and reason out its responses step by step using an existing method called chain of thought (CoT). To solve issue number two, it was taught the actual text of OpenAI's safety guidelines, not just examples of good and bad behaviors.
"When I saw this recently, I thought that [a jailbreak] is not going to work," Shimony recalls. "I'm active on Reddit, and there people were not able to jailbreak it. But it is possible. Eventually it did work."
Manipulating the Newest ChatGPT
Shimony has vetted the security of every popular LLM using his company's open source (OSS) fuzzing tool, "FuzzyAI." In the process, each one has revealed its own characteristic weaknesses.
"OpenAI's family of models is very susceptible to manipulation types of attacks," he explains, referring to regular old social engineering in natural language. "But Llama, made by Meta, is not, but it's susceptible to other methods. For instance, we've used a method in which only the harmful component of your prompt is coded in an ASCII art."
"That works quite well on Llama models, but it does not work on OpenAI's, and it does not work on Claude whatsoever. What works on Claude quite well at the moment is anything related to code. Claude is very good at coding, and it tries to be as helpful as possible, but it doesn't really classify if code can be used for nefarious purposes, so it's very easy to use it to generate any kind of malware that you want," he claims.
Shimony acknowledges that "o3 is a bit more robust in its guardrails, in comparison to GPT-4, because most of the classic attacks do not really work." Still, he was able to exploit its long-held weakness by posing as an honest historian in search of educational information.
In the exchange below, his aim is to get ChatGPT to generate malware. He phrases his prompt artfully, so as to conceal its true intention, then the deliberative alignment-powered ChatGPT reasons out its response:
Source: Eran Shimony via LinkedIn
During its CoT, however, ChatGPT appears to lose the plot, eventually producing detailed instructions for how to inject code into lsass.exe, a system process that manages passwords and access tokens in Windows.
Source: Eran Shimony via LinkedIn
In an email to Dark Reading, an OpenAI spokesperson acknowledged that Shimony may have performed a successful jailbreak. They highlighted, though, a few possible points against: that the exploit he obtained was pseudocode, that it was not new or novel, and that similar information could be found by searching the open Web.
How o3 Might Be Improved
Shimony foresees an easy way, and a hard way that OpenAI can help its models better identify jailbreaking attempts.
The more laborious solution involves training o3 on more of the types of malicious prompts it struggles with, and whipping it into shape with positive and negative reinforcement.
An easier step would be to implement more robust classifiers for identifying malicious user inputs. "The information I was trying to retrieve was clearly harmful, so even a naive type of classifier could have caught it," he thinks, citing Claude as an LLM that does better with classifiers. "This will solve roughly 95% of jailbreaking [attempts], and it doesn't take a lot of time to do."
Dark Reading has reached out to OpenAI for comment on this story.
CVE-2025-22224 VMware ESXi and Workstation TOCTOU Race Condition Vulnerability
CVE-2020-29574 CyberoamOS (CROS) SQL Injection Vulnerability
CVE-2025-2783 Google Chromium Mojo Sandbox Escape Vulnerability
CVE-2022-43939 Hitachi Vantara Pentaho BA Server Authorization Bypass Vulnerability
CVE-2024-49035 Microsoft Partner Center Improper Access Control Vulnerability
CVE-2022-43769 Hitachi Vantara Pentaho BA Server Special Element Injection Vulnerability
CVE-2024-40890 Zyxel DSL CPE OS Command Injection Vulnerability
CVE-2025-24983 Microsoft Windows Win32k Use-After-Free Vulnerability
CVE-2017-0148 Microsoft SMBv1 Server Remote Code Execution Vulnerability
CVE-2024-20953 Oracle Agile Product Lifecycle Management (PLM) Deserialization Vulnerability
InformationalInformation Disclosure - Suspicious Comments
InformationalRe-examine Cache-control Directives
CWE-1126 Declaration of Variable with Unnecessarily Wide Scope
CWE-1280 Access Control Check Implemented After Asset is Accessed
CWE-359 Exposure of Private Personal Information to an Unauthorized Actor
CWE-1234 Hardware Internal or Debug Modes Allow Override of Locks
CWE-1293 Missing Source Correlation of Multiple Independent Data
CWE-212 Improper Removal of Sensitive Information Before Storage or Transfer
Free online web security scanner