Thoughts on "Adversarial examples in the physical world"

composition.al 2016-10-26

In machine learning, computers “learn” by being shown lots of examples; for instance, we might use millions of pictures to train a model that can identify pictures of cats. The idea is that if we do a good job training the model, then it should be able to correctly identify new cat pictures (that is, ones it never saw during training) most of the time.

But it’s possible to take a picture that the model classifies correctly as being of a cat, and then tweak that picture very slightly in a way that would be undetectable to a human, but that fools the same model into classifying it with high confidence as, say, a philodendron. These subtly tweaked images are called adversarial examples, and there are known techniques for generating lots of them. Such techniques rely on making tiny, precise perturbations to the original correctly-classified example.

In addition to being hilarious, adversarial examples (and the fact that we know how to generate them) have important implications for computer security. Ian Goodfellow pointed out that although adversarial examples are unlikely to occur naturally, they could come up in practice in situations where there really is an adversary — someone who is intentionally trying to fool the system.¹

Suppose I’m an attacker who has something to gain from making your model misclassify an image. Not only that, but let’s say that I want the misclassified image to be one that a human could look at and not suspect that anything fishy was going on. This distinction is important. For instance, we know that certain dramatic makeup and hair styles can fool state-of-the-art face recognition systems, but they’re pretty likely to be noticed by human observers. Adversarial examples, on the other hand, have the potential to fool machine learning systems while flying under humans’ radar. In their recently published technical report, “Adversarial examples in the physical world”, Goodfellow and his co-authors Alexey Kurakin and Samy Bengio write, “An adversarial example for the face recognition domain might consist of very subtle markings applied to a person’s face, so that a human observer would recognize their identity correctly, but a machine learning system would recognize them as being a different person.” Or imagine subtle stickers on a road sign that would go unnoticed by a human driver, but would fool the sign recognition system used by an autonomous vehicle.

But is it really possible to carry out such an attack? An adversarial input is created by making subtle, per-pixel perturbations to the original input. If the attack has to be carried out in the physical world, and input has to be perceived by the machine learning system through a sensor — like, say, the camera used by a face detection system, or one on a self-driving car that “reads” road signs — then is it still possible to pass in an adversarial input? The “noise” of the physical world would interfere with the fine-grained perturbations that make the image adversarial. And you couldn’t just, say, print out a carefully-crafted adversarial image, take a picture of it with an ordinary camera, and expect the resulting image to still be adversarial…

…or could you?

That’s the question that the “Adversarial examples in the physical world” work addresses. I found the paper to be approachable and fun to read, even for someone like me without much machine learning background. Quoting from the introduction:

In this paper we explore the possibility of creating adversarial examples in the physical world for image classification tasks. For this purpose we conducted an experiment with a pre-trained ImageNet Inception classifier (Szegedy et al., 2015). We generated adversarial examples for this model, then we fed these examples to the classifier through a cell-phone camera and measured the classification accuracy. This scenario is a simple physical world system which perceive data through a camera and then runs image classification. We found that a large fraction of adversarial examples generated for the original model remain misclassified even when perceived through a camera.

To do the experiments in this paper, the authors literally just printed out a bunch of known adversarial images with an ordinary office printer and took pictures of them with an ordinary camera phone. They then fed those images back to the image classification system, along with the original adversarial images, and compared the classifier’s accuracy on them. They then measured the extent to which the print-out-and-take-a-photo process caused “adversarial destruction”. Adversarial destruction is when a transformation on an adversarial image causes it to become less adversarial. (It’s also the name of my metal band.)

After that, they did several more experiments to try to tease out which aspects of the process did and didn’t contribute to adversarial destruction. They found that brightness and contrast changes didn’t contribute much, but that Gaussian noise and JPEG conversion did. They also considered different ways to generate adversarial images to learn which ones were more robust to adversarial destruction.

How to generate robust adversarial images that cause “interesting mistakes”?

Section 2 of the paper describes three techniques the authors used to generate adversarial images for their experiments. It’s the most technically dense part of the paper, and I don’t understand much of what’s happening here, but I want to make note of a few things that I do understand.

First of all, all of the techniques they present for generating adversarial images rely on knowledge of the internals of the trained model. The authors write, “We intentionally omit network weights (and other parameters) $\theta$ in the cost function because we assume they are fixed (to the value resulting from training the machine learning model) in the context of the paper.” In other words, they leave $\theta$ out of their equations because they are always dealing with a particular $\theta$. To be able to apply their methods, though, you still need to know what $\theta$ is!² Furthermore, you need to know the cost function $J$ that was used to train the model.

At first glance, this might seem like a showstopper problem for an adversary, who doesn’t have access to the guts of the trained model that they’re trying to fool. But the authors do point out earlier work by Szegedy et al. that provides evidence that an adversarial example designed to be misclassified by one model is often misclassified by another model. The earlier paper had an experiment that partitioned the set of 60,000 MNIST training images into two parts of 30,000 images each, and trained separate models on them. If I’m understanding the results in Table 4 of Szegedy et al. correctly, they found that adversarial examples created for one of the networks would fool the other network between 5.1% and 8.2% of the time. While 5.1 to 8.2% might not sound like a lot, I imagine plenty of adversaries would be happy with an attack that worked that often. Still, it shows that this isn’t an easy attack to do: there’s a lot of work involved in training a network, and the adversary doesn’t necessarily know much about what training data was used or how to get their hands on more data like it. If you were trying to fool, say, a commercial road sign recognition system, it seems to me like it’d be hard to come up with training data that is as similar to the original system’s training data as the two halves of the MNIST data set are to each other.

It’s also noteworthy that of the three techniques that the “physical world” paper presents for generating adversarial examples, the first two, the “fast” method and the “basic iterative” method, just try to get the model to make any incorrect classification. If you perturb a cat picture using one of the first two techniques, you’re trying to get it misclassified as anything other than a cat. However, an adversary would most likely be trying to get the classifier to make a specific mistake, not just any mistake. The authors address this issue by proposing a third technique that results in “more interesting mistakes”:

On ImageNet, with a much larger number of classes and the varying degrees of significance in the difference between classes, these methods can result in uninteresting misclassifications, such as mistaking one breed of sled dog for another breed of sled dog. In order to create more interesting mistakes, we introduce the iterative least-likely class method. This iterative method tries to make an adversarial image which will be classified as a specific desired target class. […] For a well-trained classifier, the least-likely class is usually highly dissimilar from the true class, so this attack method results in more interesting mistakes, such as mistaking a dog for an airplane.

Although this third method is called the iterative least-likely class method, it seems like it would be possible to use a version of it to target any specific class instead of just the least-likely class. In the procedure they give, $y_{LL}$ is the least-likely class label, but unless I’m really misunderstanding things, any other class label could be swapped in. So, if you want to fool a face-recognition system into thinking that you’re not just not you, nor the person whose face is least like yours, but actually some specific other person, then it seems like this third method has you covered. In any case, it’s a great day whenever I get to read a paper containing a sentence that begins, “In order to create more interesting mistakes…”.

Finally, the paper’s experiments found that the “fast” method for generating adversarial images is the one that was most robust to adversarial destruction compared to the other two methods. This is convenient for adversaries who want to generate adversarial images quickly. However, the “fast” adversarial images are also those that are most obviously different from the original “clean” image (see, for instance, Figure 2 of the paper), so they might be easier for a suspicious human to detect. (Still, though, although to a human a “fast” adversarial image of a bird might indeed look different from a clean image of a bird, it still doesn’t look like a image of anything else in particular. It just looks like a somewhat noisy image of a bird.) The authors hypothesize that the robustness of the “fast” images “could be explained by the fact that iterative methods exploit more subtle kind of perturbations, and these subtle perturbations are more likely to be destroyed by photo transformation.” One question I was left with after reading this, though, was whether images that remained adversarial after the photo transformation were still adversarial in the same way that they were beforehand: did the model make the same misclassification that it did before? An adversary who wants to cause a specific misclassification would care about this. I’m imagining the answer is probably yes, but from what I can tell, the paper doesn’t say.

The phone in your pocket right now

One thing I really like about this paper is that it includes lots of details about how the physical-world experiments were carried out, like what kind of printer they used (it was a Ricoh MP C5503, set to 600 dpi) and what kind of phone they used (a Nexus 5x). The paper even gives the details of the automatic cropping technique they used to crop correctly-sized images out of each photo after they were taken.

These details are, of course, helpful for anyone who might want to try to reproduce the experiment, and I appreciated them on an intellectual level for that reason. But they also just delighted me as a reader in a much less intellectual, much more visceral way. In my line of work, I spend so much time dealing with the abstract, “only slightly removed from pure thought-stuff”, and most of the time, that suits me fine. But then along comes this paper, saying, “We printed out these pictures on a printer like the one you’re sitting a few feet away from, then took pictures of them with a phone like the one in your pocket right now!” It was incredibly refreshing.

I felt the same way recently when I came across this paper about anomaly detection in autonomous robots. The authors had done experiments with a mobile robot to evaluate how it would react to anomalous situations. They had done awful, hilarious things to the robot, like taping a coin to one of its wheels that would cause it to change its heading every time the coin touched the floor. As I read the paper, I couldn’t help but empathize with the poor robot, thumping along on its bum wheel, desperately trying to correct its heading. Maybe to a roboticist, this sort of thing is unremarkable, but for me, getting something so wonderfully visceral out of a computer science paper doesn’t happen all that often. As I continue to explore areas that are new to me with my new lab, I’m looking forward to getting to learn a lot more about systems that touch the physical world.

Even if you aren’t worried about encountering an adversary in practice, there’s evidence that training on adversarial inputs can improve your model’s performance on clean inputs, too. ↩
It may or may not be possible to recover information about the weights of a trained model based on observing inputs and outputs – if anyone wants to point me to any work that’s been done on that, I’d be interested. But even if that’s possible, we’re also assuming a setting here where the adversary doesn’t have direct access to the model, so they wouldn’t be able to reverse-engineer the weights this way. ↩