No foolproof method exists so far for protecting artificial intelligence systems from misdirection, warns an American standards body, and AI developers and users should be wary of any who claim otherwise.
The caution comes from the U.S. National Institute of Standards and Technology (NIST) in a new guideline for application developers on vulnerabilities of predictive and generative AI and machine learning (ML) systems, the types of attacks they might expect, and approaches to mitigate them.
“Adversaries can deliberately confuse or even “poison” artificial intelligence (AI) systems to make them malfunction — and there’s no foolproof defense that their developers can employ,” says NIST.
The paper, titled Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations (NIST.AI.100-2), is part of NIST’s effort to support the development of trustworthy AI. It can also help put NIST’s AI Risk Management Framework into practice.
One major issue is that the data used to train AI systems may not be trustworthy, NIST says. Data sources may be websites and interactions with the public. There are many opportunities for bad actors to corrupt this data — both during an AI system’s training period and afterward, while the AI continues to refine its behaviors by interacting with the physical world. This can cause the AI to perform in an undesirable manner. Chatbots, for example, might learn to respond with abusive or racist language when their guardrails get circumvented by carefully crafted malicious prompts.
“For the most part, software developers need more people to use their product so it can get better with exposure,” NIST computer scientist Apostol Vassilev, one of the publication’s authors, said in a statement. “But there is no guarantee the exposure will be good. A chatbot can spew out bad or toxic information when prompted with carefully designed language.”
In part because the datasets used to train an AI are far too large for people to successfully monitor and filter, there is no foolproof way as yet to protect AI from misdirection. To assist the developer community, the new report offers an overview of the sorts of attacks AI products might suffer and corresponding approaches to reduce the damage.
— evasion attacks, which occur after an AI system is deployed, attempt to alter an input to change how the system responds to it. Examples would include adding markings to stop signs to make an autonomous vehicle misinterpret them as speed limit signs or creating confusing lane markings to make the vehicle veer off the road;
— poisoning attacks, which occur in the training phase by introducing corrupted data. An example would be slipping numerous instances of inappropriate language into conversation records, so that a chatbot interprets these instances as common enough parlance to use in its own customer interactions;
— privacy attacks, which occur during deployment, are attempts to learn sensitive information about the AI or the data it was trained on to misuse it. An adversary can ask a chatbot numerous legitimate questions, and then use the answers to reverse engineer the model to find its weak spots — or guess at its sources. Adding undesired examples to those online sources could make the AI behave inappropriately, and making the AI unlearn those specific undesired examples after the fact can be difficult;
— abuse attacks, which involve the insertion of incorrect information into a source, such as a webpage or online document, that an AI then absorbs. Unlike poisoning attacks, abuse attacks attempt to give the AI incorrect pieces of information from a legitimate but compromised source to repurpose the AI system’s intended use.
“Most of these attacks are fairly easy to mount and require minimum knowledge of the AI system and limited adversarial capabilities,” said report co-author Alina Oprea, a professor at Northeastern University. “Poisoning attacks, for example, can be mounted by controlling a few dozen training samples, which would be a very small percentage of the entire training set.”
Many mitigations focus on data and model sanitization. However, the report adds, they should be combined with cryptographic techniques for origin and integrity attestation of AI systems. Red teaming — creating an internal team to attack a system — as part of pre-deployment testing and evaluation of AI systems to identify vulnerabilities is also vital, the report says.
On the other hand, the report also admits that a lack of reliable benchmarks can be a problem in evaulating the actual performance of proposed mitigations.
“Given the multitude of powerful attacks, designing appropriate mitigations is a challenge
that needs to be addressed before deploying AI systems in critical domains,” says the report.
This challenge, it notes, is exacerbated by the lack of secure machine learning algorithms for many tasks. “This implies that presently designing mitigations is an inherently ad hoc and fallible process,” the report says.
The report also says developers and buyers of AI systems will have to accept certain trade-offs: That’s because the trustworthiness of an AI system depends on all of the attributes that characterize it, the report notes. For example, an AI system that is accurate but easily susceptible to adversarial exploits is unlikely to be trusted. Conversely, an AI system optimized for adversarial robustness may exhibit lower accuracy and deteriorated fairness outcomes.
“In most cases, organizations will need to accept trade-offs between these properties and
decide which of them to prioritize depending on the AI system, the use case, and potentially many other considerations about the economic, environmental, social, cultural, political, and global implications of the AI technology.”
Joseph Thacker, principal AI engineer and security researcher at AppOmni, called the report “the best AI security publication I’ve seen. What’s most noteworthy are the depth and coverage. It’s the most in-depth content about adversarial attacks on AI systems that I’ve encountered. It covers the different forms of prompt injection, elaborating and giving terminology for components that previously weren’t well-labeled. It even references prolific real-world examples like the DAN (Do Anything Now) jailbreak, and some amazing indirect prompt injection work. It includes multiple sections covering potential mitigations, but is clear about it not being a solved problem yet.
“It also covers the open vs closed model debate. There’s a helpful glossary at the end, which I personally plan to use as extra ‘context’ to large language models when writing or researching AI security. It will make sure the LLM and I are working with the same definitions specific to this subject domain. Overall, I believe this is the most successful over-arching piece of content covering AI security.”