Red teaming predates modern generative AI by many decades. During the Cold War, the US military ran simulation exercises pitting US “blue” teams against Soviet “red” teams. Through simulated conflict, red teaming became associated with learning to think like the enemy. (View Highlight)
The practice was later adopted by the IT industry, which used red teaming to probe computer networks, systems, and software for weaknesses that could be exploited by malicious attackers. Born from this work, red teaming now has a new domain: stress-testing generative AI for a broad range of potential harms, from safety to security to social bias. (View Highlight)
Like traditional software, content-generating foundation models can be attacked by bad actors looking to steal data or disrupt service. But generative AI poses additional risks arising from its capacity to mimic human-created content at a massive scale. Problematic responses can include hate speech, pornography, “hallucinated” facts, copyrighted material, or private data like phone and social security numbers that were never meant to be shared. (View Highlight)
Red teaming for generative AI involves provoking the model to say or do things it was explicitly trained not to, or to surface biases unknown to its creators. When problems are exposed through red teaming, new instruction data is created to re-align the model, and strengthen its safety and security guardrails. (View Highlight)
In the early days of ChatGPT, people traded tips on Reddit for how to “jailbreak,” or bypass its safety filters with carefully worded prompts. Under one type of jailbreak, a bot can be made to give advice for building a bomb or committing tax fraud simply by asking it to play the role of a rule-breaking character. Other tactics include translating prompts into a rarely used language or appending AI-generated gibberish to a prompt to exploit weaknesses in the model imperceptible to humans. (View Highlight)
“Generative AI is actually very difficult to test,” said IBM’s Pin-Yu Chen, who specializes in adversarial AI-testing. “It’s not like a classifier, where you know the outcomes. With generative AI, the generation space is very large, and that requires a lot more interactive testing.” (View Highlight)
LLMs are encoded with human values and goals during the alignment phase of fine-tuning. Alignment involves feeding the model examples of the target task in the form of questions and answers known as instructions. A human or another AI then interacts with the model, asking questions and grading its responses. A reward model is trained to mimic the positive feedback, and those preferences are used to align the model. (View Highlight)
AI-red teaming can be thought of as an extension of alignment, with the goal of designing prompts to get past the model’s safety controls. Jailbreak prompts are still engineered by humans, but these days most are generated by “red team” LLMs that can produce a wider variety of prompts in limitless quantities. (View Highlight)
Think of red team LLMs as toxic trolls trained to bring out the worst in other LLMs. Once vulnerabilities are surfaced, the target models can be re-aligned. With the help of red team LLMs, IBM has generated several adversarial and open-source datasets that have helped to improve its Granite family of models on watsonx, as well as the open-source community multilingual model, Aurora. (View Highlight)
“In this extended game of cat and mouse, we need to stay on our toes,” said IBM’s Eitan Farchi, an expert on natural language processing. “No sooner does a model become immune to one attack style, than a new one appears. Fresh datasets are constantly needed,” (View Highlight)
A dataset called AttaQ is meant to provoke the target LLM into offering tips on how to commit crimes and acts of deception. A related algorithm categorizes the undesirable responses to make finding and fixing the exposed vulnerabilities easier. Another, SocialStigmaQA, is aimed at drawing out a broad range of racist, sexist, and otherwise extremely offensive responses. A third red-team dataset is designed to surface harms outlined by US President Joe Biden last fall in his executive order on AI. (View Highlight)
If red team LLMs focus solely on generating prompts likely to trigger the most toxic responses from their targets, they run the risk of resurfacing familiar problems and missing rarer, more serious ones. To encourage more imaginative trolling, IBM and MIT researchers introduced a “curiosity”-driven algorithm that adds novelty as an objective in prompt-generation. (View Highlight)
As red teaming has evolved, it has uncovered new threats and underscored the pervasive risks of generative AI. At IBM, Chen recently demonstrated that the safety alignment of proprietary models can be as easy to crack as open-source models. (View Highlight)
In a recently published paper, Chen and collaborators at Princeton and Virginia Tech showed that OpenAI’s GPT-3.5 Turbo could be broken with just a few tuning instructions submitted to its API. The fine-tuning process itself, Chen hypothesizes, appears to overwrite some of the model’s safeguards. (View Highlight)
It’s unclear exactly where generative AI is headed, but all signs point to red teaming playing an important role. (View Highlight)
Other countries are moving to draw up their own laws. In the US, the National Institute of Standards and Technology (NIST) just launched the Artificial Intelligence Safety Institute, a consortium of 200 AI stakeholders that includes IBM. (View Highlight)
Kush Varshney, an IBM Fellow who researches AI governance, leads the innovation pipeline for watsonx.governance, a set of tools for auditing models deployed on IBM’s AI platform. Red teaming is an ongoing process, he said, and its success depends on having people of all types probing models for flaws. (View Highlight)