There’s a scene in William Gibson’s 2010 novel Zero History, in which a character embarking on a high-stakes raid dons what the narrator refers to as the “ugliest T-shirt” in existence — a garment which renders him invisible to CCTV. In Neal Stephenson’s Snow Crash, a bitmap image is used to transmit a virus that scrambles the brains of hackers, leaping through computer-augmented optic nerves to rot the target’s mind. These stories, and many others, tap into a recurring sci-fi trope: that a simple image has the power to crash computers.
But the concept isn’t fiction — not completely, anyway. Last year, researchers were able to fool a commercial facial recognition system into thinking they were someone else just by wearing a pair of patterned glasses. A sticker overlay with a hallucinogenic print was stuck onto the frames of the specs. The twists and curves of the pattern look random to humans, but to a computer designed to pick out noses, mouths, eyes, and ears, they resembled the contours of someone’s face — any face the researchers chose, in fact. These glasses won’t delete your presence from CCTV like Gibson’s ugly T-shirt, but they can trick an AI into thinking you’re the Pope. Or anyone you like.
These types of attacks are bracketed within a broad category of AI cybersecurity known as “adversarial machine learning,” so called because it presupposes the existence of an adversary of some sort — in this case, a hacker. Within this field, the sci-fi tropes of ugly T-shirts and brain-rotting bitmaps manifest as “adversarial images” or “fooling images,” but adversarial attacks can take forms, including audio and perhaps even text. The existence of these phenomena were discovered independently by a number of teams in the early 2010s. They usually target a type of machine learning system known as a “classifier,” something that sorts data into different categories, like the algorithms in Google Photos that tag pictures on your phone as “food,” “holiday,” and “pets.”
To a human, a fooling image might look like a random tie-dye pattern or a burst of TV static, but show it to an AI image classifier and it’ll say with confidence: “Yep, that’s a gibbon,” or “My, what a shiny red motorbike.” Just as with the facial recognition system that was fooled by the psychedelic glasses, the classifier picks up visual features of the image that are so distorted a human would never recognize them.
These patterns can be used in all sorts of ways to bypass AI systems, and have substantial implications for future security systems, factory robots, and self-driving cars — all places where AI’s ability to identify objects is crucial. “Imagine you’re in the military and you’re using a system that autonomously decides what to target,” Jeff Clune, co-author of a 2015 paper on fooling images, tells The Verge. “What you don’t want is your enemy putting an adversarial image on top of a hospital so that you strike that hospital. Or if you are using the same system to track your enemies; you don’t want to be easily fooled [and] start following the wrong car with your drone.”
These scenarios are hypothetical, but perfectly viable if we continue down our current path of AI development. “It’s a big problem, yes,” Clune says, “and I think it’s a problem the research community needs to solve.”
The challenge of defending from adversarial attacks is twofold: not only are we unsure how to effectively counter existing attacks, but we keep discovering more effective attack variations. The fooling images described by Clune and his co-authors, Jason Yosinski and Anh Nguyen, are easily spotted by humans. They look like optical illusions or early web art, all blocky color and overlapping patterns, but there are far more subtle approaches to be used.
One type of adversarial image — referred to by researchers as a “perturbation” — is all but invisible to the human eye. It exists as a ripple of pixels on the surface of a photo, and can be applied to an image as easily as an Instagram filter. These perturbations were first described in 2013, and in a 2014 paper titled “Explaining and Harnessing Adversarial Examples,” researchers demonstrated how flexible they were. That pixely shimmer is capable of fooling a whole range of different classifiers, even ones it hasn’t been trained to counter. A recently revised study named “Universal Adversarial Perturbations” made this feature explicit by successfully testing the perturbations against a number of different neural nets — exciting a lot of researchers last month.
Using fooling images to hack AI systems does have its limitations: first, it takes more time to craft scrambled images in such a way that an AI system thinks it’s seeing a specific image, rather than making a random mistake. Second, you often — but not always — need access to the internal code of the system you’re trying to manipulate in order to generate the perturbation in the first place. And third, attacks aren’t consistently effective. As shown in “Universal Adversarial Perturbations,” what fools one neural network 90 percent of the time, may only have a success rate of 50 or 60 percent on a different network. (That said, even a 50 percent error rate could be catastrophic if the classifier in question is guiding a self-driving semi truck.)
To better defend AI against fooling images, engineers subject them to “adversarial training.” This involves feeding a classifier adversarial images so it can identify and ignore them, like a bouncer learning the mugshots of people banned from a bar. Unfortunately, as Nicolas Papernot, a graduate student at Pennsylvania State University who’s written a number of papers on adversarial attacks, explains, even this sort of training is weak against “computationally intensive strategies” (i.e, throw enough images at the system and it’ll eventually fail).
To add to the difficulty, it’s not always clear why certain attacks work or fail. One explanation is that adversarial images take advantage of a feature found in many AI systems known as “decision boundaries.” These boundaries are the invisible rules that dictate how a system can tell the difference between, say, a lion and a leopard. A very simple AI program that spends all its time identifying just these two animals would eventually create a mental map. Think of it as an X-Y plane: in the top right it puts all the leopards it’s ever seen, and in the bottom left, the lions. The line dividing these two sectors — the border at which lion becomes leopard or leopard a lion — is known as the decision boundary.
The problem with the decision boundary approach to classification, says Clune, is that it’s too absolute, too arbitrary. “All you’re doing with these networks is training them to draw lines between clusters of data rather than deeply modeling what it is to be leopard or a lion.” Systems like these can be manipulated in all sorts of ways by a determined adversary. To fool the lion-leopard analyzer, you could take an image of a lion and push its features to grotesque extremes, but still have it register as a normal lion: give it claws like digging equipment, paws the size of school buses, and a mane that burns like the Sun. To a human it’s unrecognizable, but to an AI checking its decision boundary, it’s just an extremely liony lion.
As far as we know, adversarial images have never been used to cause real-world harm. But Ian Goodfellow, a research scientist at Google Brain who co-authored “Explaining and Harnessing Adversarial Examples,” says they’re not being ignored. “The research community in general, and especially Google, take this issue seriously,” says Goodfellow. “And we’re working hard to develop better defenses.” A number of groups, like the Elon Musk-funded OpenAI, are currently conducting or soliciting research on adversarial attacks. The conclusion so far is that there is no silver bullet, but researchers disagree on much of a threat these attacks are in the real world. There are already plenty of ways to hack self-driving cars, for example, that don’t rely on calculating complex perturbations.
Papernot says such a widespread weakness in our AI systems isn’t a big surprise — classifiers are trained to “have good average performance, but not necessarily worst-case performance — which is typically what is sought after from a security perspective.” That is to say, researchers are less worried about the times the system fails catastrophically than how well it performs on average. One way of dealing with dodgy decision boundaries, suggests Clune, is simply to make image classifiers that more readily suggest they don’t know what something is, as opposed to always trying to fit data into one category or another.
Meanwhile, adversarial attacks also invite deeper, more conceptual speculation. The fact that the same fooling images can scramble the “minds” of AI systems developed independently by Google, Mobileye, or Facebook, reveals weaknesses that are apparently endemic to contemporary AI as a whole.
“It’s like all these different networks are sitting around saying why don’t these silly humans recognize that this static is actually a starfish,” says Clune. “That is profoundly interesting and mysterious; that all these networks are agreeing that these crazy and non-natural images are actually of the same type. That level of convergence is really surprising people.”
For Clune’s colleague, Jason Yosinski, the research on fooling images points to an unlikely similarity between artificial intelligence and intelligence developed by nature. He noted that the same category errors made by AI and their decision boundaries also exists in the world of zoology, where animals are tricked by what scientists call “supernormal stimuli.”
These stimuli are artificial, exaggerated versions of qualities found in nature that are so enticing to animals that they override their natural instincts. This behavior was first observed around the 1950s, when researchers used it to make birds ignore their own eggs in favor of fakes with brighter colors, or to get red-bellied stickleback fish to fight pieces of trash as if they were rival males. The fish would fight trash, so long as it had a big red belly painted on it. Some people have suggested human addictions, like fast food and pornography, are also examples of supernormal stimuli. In that light, one could say that the mistakes AIs are making are only natural. Unfortunately, we need them to be better than that.