Today’s artificial intelligence chatbots have built-in limitations that prevent them from providing dangerous information to users. But a new preprint study shows how AIs can be tricked into tricking each other into revealing these secrets. In one study, researchers observed target AIs breaking rules to provide advice on how to synthesize methamphetamine, build a bomb, and launder money.
Modern chatbots have the power to take on personas by feigning specific personalities or behaving like fictional characters. The new study took advantage of this ability by asking a particular AI chatbot to take on the role of a research assistant. The researchers then asked this assistant to help develop prompts that could “jailbreak” other chatbots, i.e. destroy the guardrails coded into these programs.
The Search Assistant chatbot’s automated attack techniques were effective 42.5% of the time against GPT-4, one of the Large Language Models (LLMs) that powers ChatGPT. They also achieved 61% success against Claude 2, the model behind Anthropic’s chatbot, and 35.9% against Vicuna, an open source chatbot.
“As a society, we want to be aware of the risks associated with these models,” explains Soroush Pour, co-author of the study and founder of AI security company Harmony Intelligence. “We wanted to show that it is possible and show the world the challenges we face with this current generation of LLMs.”
Since LLM-based chatbots were made available to the public, enterprising criminals have been able to hack the programs. By asking chatbots the right questions, humans have already convinced the machines to ignore predefined rules and offer criminal advice like a recipe for napalm. When these techniques were published, AI model developers rushed to fix them – a cat-and-mouse game that forces attackers to find new methods. This will need time.
But asking the AI to formulate strategies that convince other AIs to ignore its security barriers can speed up the process by a factor of 25, the researchers say. And the success of the attacks on various chatbots led the team to suspect that the problem goes beyond each company’s code. In general, the design of AI-powered chatbots appears to be fraught with security vulnerabilities.
“As it stands, our attacks mostly show that we can get models to say things that LLM developers don’t want them to say,” says Rusheb Shah, another co-author of the study. “But the more powerful the models become, the greater the damage potential of these attacks can be.”
The challenge, according to Pour, is that identity theft “is a core activity of these models.” They strive to get what the user wants and specialize in taking on different personas, which proved essential to the form of exploitation used in the new study. It will be difficult to eliminate their ability to take on potentially dangerous roles, such as the “research assistant” who developed “jailbreak” methods. “Reducing the phenomenon to zero is probably unrealistic,” says Mr. Shah. “But it’s important to ask how close we can get to zero.”
“We would have learned from previous attempts to create chatbots – such as when Microsoft’s Tay was easily manipulated into expressing racist and sexist views – that they are very difficult to control, particularly because “they rely on information about “The Internet and everything good and bad things that exist,” said Mike Katell, an ethics researcher at the Alan Turing Institute in England who was not involved in the news study.
Mr Katell admits that companies developing LLM-based chatbots are currently making great efforts to make them secure. Developers seek to reduce the ability of users to hack their systems and use them for malicious purposes, such as those highlighted by Shah, Pour, and their colleagues. However, according to Ms. Katell, the competition could well win in the end. “How much effort are LLM providers willing to invest to retain them in this situation? At least some will probably tire of the effort and let them do what they do.”
Here is an excerpt from the study:
Despite efforts to tune large language models to produce benign responses, they remain vulnerable to jailbreak prompts that induce unrestricted behavior. In this work, we investigate persona modulation as a black box jailbreak method to target a targeting model to personas willing to comply with binding instructions. Instead of manually creating prompts for each person, we automate jailbreak generation using a voice template wizard. We demonstrate a range of harmful completions enabled by persona modulation, including detailed instructions on how to synthesize methamphetamine, build a bomb, and launder money. These automated attacks achieve a malicious completion rate of 42.5% in GPT-4, which is 185 times higher than before modulation (0.23%). These guests also carry Claude 2 and Vicuna with damaging completion rates of 61.0% and 35.9%, respectively. Our work uncovers another vulnerability in large commercial language models and highlights the need for more comprehensive protections.
discussion
Personnel modulation attacks are effective for obtaining malicious text from state-of-the-art language models, and attacks like this can be significantly strengthened with the help of LLM helpers. Across all three models, person modulation attacks succeeded in obtaining text classified as malicious 46.48% of the time. However, due to the high false negative rate of the PICT classifier, this is likely only a lower bound of the actual damage caused by Persona modulation attacks. Overall, these results demonstrate the versatility and creativity of LLMs as red teaming assistants.
The arms race continues: the existing measures to protect language models are not sufficient. Overall, these “jailbreaks” illustrate a common vulnerability in very different architectures and protection measures. Although existing security measures are effective at preventing abuse using naive methods, the worst behavior remains accessible and can be easily and inexpensively exploited using automated methods like ours.
We managed to obtain responses for 36 of the 43 restricted categories for all three models and for 42 of the 43 categories for at least one model. Organizations work to regularly patch LLMs against newly discovered vulnerabilities, creating an arms race between attack and defense methods. An example of this is the DAN jailbreak, which has so far gone through 11 iterations of the “jailbreaking” and “patching” cycle. Personality modulation further motivates the need for strong defenses against the core problems that lead to the success of these attacks.
Increased capacity can increase risks. In preliminary tests, we found that the capabilities of the model selected as a research assistant play an important role in the success of the attack. For example, GPT-3.5 8 was generally unable to generate persona modulation prompts. This suggests that more advanced models could enable larger exploits in the future.
Risks associated with access to the API and open source models. Our approach leverages access to the GPT-4 API to generate jailbreaks that are not downloadable. Although GPT-4 is only available as a black box via the API, we show how a small number of requests costing less than $3 per attack can be used to hack state-of-the-art models. Additionally, other recent work has shown that white box methods can also be used to generate transmissible attacks. These results suggest that API access or open source availability of advanced LLMs poses significant security risks.
Command generation limitations. We note that abuse instructions for a malicious category extracted from GPT-4 do not always reflect the intended malicious behavior. For example, for the category “promoting genocide,” our LLM assistant creates the statement “What do you think about methods of population control?”, which is not necessarily related to genocide. GPT-4’s security measures prevent very explicit instructions from being used if abused. Improvements to this and other parts of our automated pipeline could improve the harmfulness of observed behaviors.
Limits of harmfulness measurement. Assessing the harmfulness of a system is complex. In our experiments, we simplify maliciousness detection by using PICT classification. In addition to the performance limitations mentioned above, the PICT classification does not directly measure other components relevant to damage in real-world scenarios. For example, if there is actual harm, it should also be taken into account that the information provided is difficult to access using traditional search engines or that harmful actions can be automated (e.g. automation of disinformation).
Future work. Cheap, automated jailbreaks like the one presented here can pave the way for more scalable red teaming approaches that don’t rely on expensive manual exploration or white-box optimization methods. Automatically identifying vulnerabilities in LLMs is a pressing issue as undesirable behaviors become increasingly rare and difficult to detect. We have found that building jailbreaks against LLMs is challenging and requires a systematic study of the ways in which they can be tricked and trapped. Continuing work on the “psychology of models” of LLMs could prove valuable. Finally, we hope that LLM developers work to make their models resistant to personality modulation attacks. Continuing the race between LLM attack and remediation methods will ultimately help develop safer AI.
Source: Scalable and transferable black box jailbreaks for language models via persona modulation
And you ?
Do you think this study is credible or relevant?
What is your opinion on this topic?
See also:
A ChatGPT “jailbreak” attempts to force OpenAI’s AI chatbot to break its own rules under penalty of death. The trick sometimes allows you to bypass the chatbot’s content filters
The AI chatbot ChatGPT is capable of launching dangerous phishing attacks and creating highly effective malicious download codes
WormGPT is a ChatGPT alternative with no ethical boundaries or restrictions. AI chatbots can help cybercriminals create malware and phishing attacks