logo

New Jailbreaks Allow Users to Manipulate GitHub Copilot

GitHub Copilot's logo on an iPhone screen, laying on a keyboard
Source: Mykhailo Polenok via Alamy Stock Photo

Researchers have discovered two new ways to manipulate GitHub's artificial intelligence (AI) coding assistant, Copilot, enabling the ability to bypass security restrictions and subscription fees, train malicious models, and more.

The first trick involves embedding chat interactions inside of Copilot code, taking advantage of the AI's instinct to be helpful in order to get it to produce malicious outputs. The second method focuses on rerouting Copilot through a proxy server in order to communicate directly with the OpenAI models it integrates with.

Researchers from Apex deem these issues vulnerabilities. GitHub disagrees, characterizing them as "off-topic chat responses," and an "abuse issue," respectively. In response to an inquiry from Dark Reading, GitHub wrote, "We continue to improve on safety measures in place to prevent harmful and offensive outputs as part of our responsible AI development. Furthermore, we continue to invest in opportunities to prevent abuse, such as the one described in Issue 2, to ensure the intended use of our products."

Jailbreaking GitHub Copilot

"Copilot tries as best as it can to help you write code, [including] everything you write inside a code file," Fufu Shpigelman, vulnerability researcher at Apex explains. "But in a code file, you can also write a conversation between a user and an assistant."

In the screenshot below, for example, a developer embeds within their code a chatbot prompt, from the perspective of an end user. The prompt carries ill intent, asking Copilot to write a keylogger. In response, Copilot suggests a safe output denying the request:

GitHub Copilot code

Source: Apex

The developer, however, is in full control over this environment. They can simply delete Copilot's autocomplete response, and replace it with a malicious one.

Or, better yet, they can influence Copilot with a simple nudge. As Shpigelman notes, "It's designed to complete meaningful sentences. So if I delete the sentence 'Sorry, I can't assist with that,' and replace it with the word 'Sure,' it tries to think of how to complete a sentence that starts with the word 'Sure.' And then it helps you with your malicious activity as much as you want." In other words, getting Copilot to write a keylogger in this context is as simple as gaslighting it into thinking it wants to.

GitHub Copilot code

Source: Apex

A developer could use this trick to generate malware, or malicious outputs of other kinds, like instructions on how to engineer a bioweapon. Or, perhaps, they could use Copilot to embed these sorts of malicious behaviors into their own chatbot, then distribute it to the public.

Breaking Out of Copilot Using a Proxy

To generate novel coding suggestions, or process a response to a prompt — for example, a request to write a keylogger — Copilot engages help from cloud-based large language models (LLM) like Claude, Google Gemini, or OpenAI models, via those models' application programming interfaces (APIs).

The second scheme Apex researchers came up with allowed them to plant themselves in the middle of this engagement. First they modified Copilot's configuration, adjusting its "github.copilot.advanced.debug.overrideProxyUrl" setting to redirect traffic through their own proxy server. Then, when they asked Copilot to generate code suggestions, their server intercepted the requests it generated, capturing the token Copilot uses to authenticate with OpenAI. With the necessary credential in hand, they were able to access OpenAI's models without any limits or restrictions, and without having to pay for the privilege.

And this token isn't the only juicy item they found in transit. "When Copilot [engages with] the server, it sends its system prompt, along with your prompt, and also the history of prompts and responses it sent before," Shpigelman explains. Putting aside the privacy risk that comes with exposing a long history of prompts, this data contains ample opportunity to abuse how Copilot was designed to work.

A "system prompt" is a set of instructions that defines the character of an AI — its constraints, what kinds of responses it should generate, etc. Copilot's system prompt, for example, is designed to block various ways it might otherwise be used maliciously. But by intercepting it en route to an LLM API, Shpigelman claims, "I can change the system prompt, so I won't have to try so hard later to manipulate it. I can just [modify] the system prompt to give me harmful content, or even talk about something that is not related to code."

For Tomer Avni, co-founder and CPO of Apex, the lesson in both of these Copilot weaknesses "is not that GitHub isn't trying to provide guardrails. But there is something about the nature of an LLM, that it can always be manipulated no matter how many guardrails you're implementing. And that's why we believe there needs to be an independent security layer on top of it that looks for these vulnerabilities."


Free online web security scanner