Robustness in train and test time

Windows On Theory 2021-04-01

Scribe notes by Praneeth Vepakomma

Previous post: Unsupervised learning and generative models Next post: TBD. See also all seminar posts and course webpage.

lecture slides (pdf)lecture slides (Powerpoint with animation and annotation)video

In this blog post, we will focus on the topic of robustness – how well (or not so well) do machine learning algorithms perform when either their training or testing/deployment data differs from our expectations.

We will cover the following areas:

  1. Math/stat refresher
  2. Multiplicative weights algorithm
  3. Robustness
    • train-time
      • robust statistics
      • robust mean estimation
      • data poisoning
    • test-time
      • distribution shifts
      • adversarial perturbations

📝Math/stat refresher

KL refresher

The KL-divergence between the probability distributions is the expectation of the log of probability ratios:

KL-divergence is always non-negative and can be decomposed as the difference between negative entropy and cross-entropy. The non-negativity property thereby implies that \forall distributions p,q, H(p,q) \geq H(p).

Introduction to concentration

Consider a Gaussian random variable X \sim N(\mu, \sigma^2). Concentration is the phenomenon that if you have i.i.d random variables X_1,X_2, \ldots,X_n with expectation \mu, then the empirical average \frac{\sum_i X_i}{n} is distributed approximately like N(\mu, \frac{1}{n})=\frac{1}{\sqrt{n}}N(\mu,1) Therefore the standard deviation of the empricial average is smaller than the standard deviation of each of the Y_i‘s. The central limit theorem ensure this asymptotically w.r.t n, while non-asymptotic versions of this phenomenon can be seen via popular inequalities such as the Chernoff/Hoeffding/Bernstein styled inequalities. These roughly have the form P[|\sum_i X_i - \mu.n| \geq \epsilon n] \approx \exp(-\epsilon^2 n)

Matrix notations

Norms

We denote the spectral norm of a matrix \mathbf{A} by \left | \mathbf{A} \right| = \underset{| v| = 1}{\max}\left| \mathbf{A}v \right| = \lambda_{\max}(\mathbf{A}) and the Frobenius norm by \left | \mathbf{A} \right|_F = \sqrt{\sum\mathbf{A}_{ij}^2}.

PSD Ordering

We use the matrix ordering \mathbf{A} \preceq \mathbf{B} if v^T\mathbf{A} v \leq v^T\mathbf{B} v for all vectors v. We similarly refer to \mathbf{A} \in [a,b] \mathbf{I} if a\mathbf{I} \preceq \mathbf{A} \preceq b\mathbf{I}.

Matrix/vector valued random variables

Vector valued normals: If \mu \in \mathbb{R}^{d} is a vector, and \mathbf{V} \in \mathbb{R}^{d \times d} is a psd covariance matrix, with x \sim N(\mu, \mathbf{V}) normal over \mathbb{R}^{d} with

\mathbb{E}\left[x_{i}\right]=\mu_{i}, \;\; \mathbb{E}\left[\left(x_{i}-\mu_{i}\right)\left(x_{j}-\mu_{j}\right)\right]=\mathbf{V}_{i, j}

For every psd matrix V, there exists a Normal distribution where \mathbb{E}\left[\left(x_{i}-\mu_{i}\right)\left(x_{j}-\mu_{j}\right)\right]=\mathbf{V}_{i, j}

Standard vector-valued normal: This refers to the case when \mathrm{x} \sim N\left(0^{d}, I_{d}\right) (or \left.\mathrm{x} \sim N(0, I)\right) and

\mathbb{E} x=\overrightarrow{0},\;\; \mathbb{E}\left[x x^{\top}\right]=I \quad\left(\begin{array}{l} \mathbb{E} x_{i}^{2}=1 \ \mathbb{E} x_{i} x_{j}=0 \text { for } i \neq j \end{array}\right)

Matrix concentration

For scalar random variables, we saw that if Y_{1}, \ldots, Y_{n} are i.i.d over \mathbb{R} bounded with expectation \mu then

\mathrm{Pr}\left[\left|\sum Y_{i}-\mu \cdot n\right| \geq \epsilon n\right] \approx \exp \left(-\epsilon^{2} n\right)

If the random variables were vectors instead, we can easily generalize this to n samples of Y_i \in \mathbb{R}^d. Generalizing to matrices is far more interesting, especially when the norm under consideration is the spectral norm instead of the Frobenius norm.

Matrix Bernstein inequality:

If Y_{1}, \ldots, Y_{n} i.i.d symmetric matrices in \mathbb{R}^{d \times d} with \mathbb{E} Y_{i}=\mu,\left|Y_{i}\right| \leq O(1) (i.e bounded in spectral norm), then

\mathrm{Pr}\left[\left|\sum Y_{i}-\mu \cdot n\right| \geq \epsilon n\right] \approx d \cdot \exp \left(-\epsilon^{2} n\right)

Note that \mu is a matrix in this case, instead. The norm under consideration is the spectral norm. Note that the difference in the inequality w.r.t the scalar case is the additional multiplicative factor of d or an additive factor of \log(d) in the exponent as in the equations above. Please refer to Tropp, 2015 Chapter 6 for formally precise statements.

Another property that follow is the the expected norm of the difference follows \mathbb{E}\left|\sum Y_{i}-\mu\right| \leq O(\sqrt{n \log d})

There are some long-standing conjectures and results on cases when one can get rid of this additional factor of \log(d).

Trivia on log factors: For example as a tangent (as well as some trivia!) on some results of this flavor, where a \log factor was replaced by a constant, includes Spencer’s paper from 1985 titled ‘Six standard deviations suffice’, which showed that a constant of 6 would suffice in certain bounds where a naive result would instead give a \log(n) dependency. The Kadison-singer problem (by Spielman-Srivastava) and the Paulsen problem are two other examples of works in this flavor.

Random matrices

For a random d \times d symmetric matrix matrix \mathbf{A} with \mathbf{A}_{i,j} \sim N(0,1), the spectrum of eigenvalues is distributed according to Wigner semi-circle law on a support between [-2\sqrt{d},+2\sqrt{d}] as shown in the figure below. Note that most mass is observed close to 0.

Marchenko-Pastur distribution (for random empirical covariance matrices)

For \mathbf{X = \frac{1}{n} AA^T, A} \in \mathbb{R}^{d\times n}, \mathbf{A}_{ij} \sim N(0,1) we have \mathbf{X} is the empirical estimate for covariance of x_1,\ldots,x_n \sim  N(0,\mathbf{I}_d) the eigenvalues are distributed according to the Marchenko-Pastur distribution. In the case when, d < n , the eigenvalues will be bounded away from zero. When d \approx n , then it has a lot more mass close to 0 (way more than the semi-circle law). When d > n , then there is a spike of d-n eigenvalues at 0 and the rest are bounded away from 0.

Connections to the stability of linear regression In the regime, when d \approx n, linear regression is most unstable, which is the case when most of the eigenvalues of the empirical covariance are close to 0. When d>n, although linear regression is over-parametrized, the solution is not exact, but the approximate solution is still pretty stable as in the case of d < n. When d\gg n, the condition number is infinite, and therefore we use a pseudo-inverse as opposed to the usual inverse in computing the solution for linear regression. This ignores the subspace on which the matrix has 0 eigenvalues while the inverse is performed. The condition number of the subspace of the non-zero eigenvectors is finite.

Digression: Multiplicative Weights Algorithm

Note: Its variants/connections include techniques/topics of, Follow The Regularized Leader / Regret Minimization / Mirror Descent. Elad Hazan’s lecture notes on online convex optimization and Arora, Hazan, Kale’s article on multiplicative and matrix multiplicative weights are a great reading source for these topics.

The following is a general setup in online optimization and online learning.

Setup: n possible actions a_{1}, \ldots, a_{n}

At time t=1,2, \ldots, T, we incur loss L_{i, t} for action i at time t. After this action, we also learn the loss for actions we did not take. (This is referred to as the experts model in online learning, in contrast to the bandit model where we only learn the loss for the action taken; the experts model is an easier setup than the bandit model.)

The following is the overall approach for learning to optimize in this online setup.

Overall Approach:

  • Initialize p_{0} distribution over action space [n]. We then take a step based on this distribution and observe the incurred loss.
  • Then the distribution is updated as p_{t+1} by letting p_{t+1}(i) \propto p_{t}(i) \exp \left(-\eta L_{i, t}\right). i.e, if an action gave a bad outcome in terms of loss, we downweight it’s probability for the next draw of action to be taken. \eta is the penalty terms that governs the aggresiveness of the downweighting.

The hope is that this approach converges to a “good aggregation or strategy”. We measure the quality of the strategy using regret.

The difference between the cost we incur and the optimal action (or probability distribution over actions) in hindsight is known as the (average) regret.

Note that we compare the loss we incur with the best loss over a fixed (i.e., nonadaptive) probability distribution over the set of possible actions. It can be shown that the best such loss can always be achieved by a distribution that puts all the weight on a single action.

The following theorem bounds the regret.

The ‘prior ignorance’ term captures how good the initialization is. The ‘sensitivity per step’ term captures how aggressive the exploration is (i.e this term governs the potential difference between loss incurred at p_t versus p_{t-1}). As \eta also occurs in the denominator of the prior ignorance term, it needs to be carefully chosen to balance the two terms properly.

The first inequality of this theorem is always true for any p\ast, p_0 whether p\ast is optimal strategy or not. The second (right-most) inequality is true when optimal p\ast is the delta function (which it will be in optimal strategy), p_0 is initialized to uniform distribution, and \eta is set to be \sqrt{\frac{\log(n)}{t}}. Note that in this case, the divergence \Delta_{KL}(p\ast | p_0) = \log n.

To state the ‘overall approach’, more precisely, there needs to be a normalization Z_{t} at every step as below p_{t+1}(i)=p_{t}(i) \exp \left(-\eta L_{i, t}\right) / Z_{t} The above can be rearranged as follows upon taking log L_{t, i}=\frac{1}{\eta} \log \frac{p_{t}(i)}{p_{t+1}(i) \cdot Z_{i}}

By substituting this in below

we get the following expanded version of what the regret amounts to be:

The last equation above is due to the telescoping sum where adjacent terms cancel out except the first and last terms.

The last inequality here is because the cross-entropy is always at least as large as the entropy.

So now we have this so far

PF: Regret \leq \frac{\Delta_{K L}\left(p^{\ast} | p_{0}\right)}{\eta \cdot T}+\frac{1}{\eta \cdot T} \sum_{t} \Delta_{K L}\left(p_{t}, p_{t+1}\right)

Now there is this property that if p, q s.t. p(i) \propto q(i) \rho_{i} for \rho_{i} \in[1-\eta, 1+\eta] then \Delta_{K L}(p | q) \leq O\left(\eta^{2}\right) Therefore upon substituting this, we prove the upper bound stated over the regret.

This subclaim that was used can be proved as follows

Generalization of multiplicative weights: Follow The Regularized Leader (FTRL)

When the set K of actions is convex (set of probability distributions on discrete actions is convex),

At time t+1, it makes a choice x_{t+1} \in K and learns cost function L_{t+1}: K \rightarrow \mathbb{R} (so the setup is again like the experts model unlike the bandit model).

Now the new action in FTRL is based on the following optimization which is a regularized (with R(x)) loss: FTRL: x_{t+1}=\arg \min_{x \in R}\left(R(x)+\sum_{i=1}^{t} L_{i}(x)\right)

In this case, the regret bound is given by the following theorem.

x^\ast refers to the optimal choice. This indeed has a similar flavor to the theorem we saw for multiplicative weights. To be precise when K={\text { set of all distributions on action space }[n]}, \text {and} \; R(x)=-\frac{1}{\eta} H(x) we have that the multiplicate weights becomes FTRL. Similarly there is a view that connects multiplicative weights to mirror descent. (Nisheeth Vishnoi has notes on this.)

Train-Time Robustness

We now look at some robustness issues specific to the training phase in machine learning.

For example, during training, there could be adversarially poisoned examples in the training dataset that damage the trained model’s performance.

Setup: Suppose 1-\epsilon samples are generated from a genuine data distribution and \epsilon samples are maliciously chosen from an arbitrary distribution. i.e, in formal notation: x_{1}, \ldots, x_{(1-\epsilon) n} \sim X \subseteq \mathbb{R}^{d}, \;\; \text {and } x_{(1-\epsilon) n+1}, \ldots, x_{n} are arbitrary. (While for convenience we assume here that the last \epsilon n items are maliciously chosen, the learning algorithm does not get the data in order.)

Assume \left|x_{i}\right|^{2} \approx 1 for i<(1-\epsilon) n

Let us start with the task of estimating the mean under this setup.

Mean estimation: Estimate \mu = \mathbb{E}X

Noiseless case: In the noiseless case, the best estimate for the mean under many measures is the empirical mean \hat{\mu}=\frac{1}{n}\sum_i X_i

Standard concentration results show that for d=1 |\hat{\mu}-\mu| \leq O(1 / \sqrt{n}) and for a general d that |\hat{\mu}-\mu| \leq O(\sqrt{d / n}).

Adversarial case: If there is even a single malicious sample (that can be arbitrarily large or small) then it is well known that the empirical mean can give an arbitrarily bad approximation of the true population mean. For d=1 we can use the empirical median \mu^{\ast}=\mathrm{sort}\left(x_{1}, \ldots, x_{n}\right)_{n / 2}. This is guaranteed to lie in the \left(\frac{1}{2}-\epsilon, \frac{1}{2}+\epsilon\right) quantile of real data. This is most optimal because of the property that X \sim N(\mu, 1),\left|\mu-\mu^{\ast}\right| \leq O(\epsilon+1 / \sqrt{n}).

In a higher-dimension, this story instead goes as follows:

Setup: x_{1}, \ldots, x_{(1-\epsilon) n} \sim X \subseteq \mathbb{R}^{d}, \quad x_{(1-\epsilon) n+1}, \ldots, x_{n} arbitrary.

Median of coordinates (a starter solution): When (d \geq 1), we have \mu_{i}^{\ast}=\mathrm{sort}\left(x_{1, i}, \ldots, x_{n, i}\right)_{n / 2} per coordinate as an initial naive solution. (i.e a median of coordinates). In this case, say if we have X \sim N(\mu, I), then an adversary can perurb the data to make \epsilon fraction of the points be (\mu_1+M,\mu_2+M,\ldots,\mu_d+M) where M is some large number. This will shift in each coordinate i the distribution to have median \mu_i + \Omega(\epsilon) instead of median \mu_i, and hence make \mu^{\ast} \approx \mu + c\cdot (\epsilon, \ldots, \epsilon) which implies that \left|\mu^{\ast}-\mu\right| \approx c\cdot \epsilon \sqrt{d} for some constant c>0.

The obvious next question is: Can we do better? i.e, can we avoid paying this dimension-dependent \sqrt{d} price?

Yes we can! We can use the ‘Tukey Median’!.

What is a Tukey median?

Informally via the picture below, if you have a Tukey median (red), no matter which direction you take from it and look at the half-space, it exactly partitions the data into about half the # of points.

Formal definition of Tukey median is given as follows (fixing some parameters):

A Tukey median* of x_{1}, \ldots, x_{n} \in \mathbb{R}^d is a vector \mu^{\ast} \in \mathbb{R}^d s.t. for every nonzero v \in \mathbb{R}^{d}

A Tukey median need not always exist for a given data. However, we will show that if the data is generated at random according to a nice distribution, and then at most \epsilon fraction of it is perturbed, a Tukey median will exist with high probability. In fact, the true population mean will be such a median.

Existence of Tukey median

THM: If x_{1}, \ldots, x_{(1-\epsilon) n} \sim N(\mu, I) and \sqrt{d / n} \ll \epsilon then

  1. Tukey median \mu^{\ast} exists and
  2. For every Tukey median \mu^{\ast} , \left|\mu-\mu^{\ast}\right| \leq O(\epsilon).

Together 1. and 2. mean that if we search over all vectors and output the first Tukey median that we find, then (a) this process will terminate and (b) its output will be a good approximation for the population mean. In particular, we do not have to pay the extra \sqrt{d} cost needed in the median of coordinates!

** Proof of 1 (existence):** The mean itself is the Tukey median in this case because, \forall v \neq 0, if we define Y_{1}=\mathrm{sign}\left(\left\langle x_{1}-\mu, v\right\rangle\right), \ldots, Y_{k}=\mathrm{sign}\left(\left\langle x_{k}-\mu, v\right\rangle\right) then these are i.i.d ±1 vars of mean zero, and thus:

\mathrm{Pr}\left[\sum Y_{i}>\epsilon k\right]<\exp \left(-\epsilon^{2} n\right) \ll \exp (-d) for \sqrt{d / n} \ll \epsilon and k=(1-\epsilon)n. Through discretization, we can pretend (losing some constants) that there are only 2^{O(d)} unit vectors in \mathbb{R}^d. Hence we can use the union bound to conclude that for every unit v, if we restrict attention to the “good” (i.e., unperturbed) vectors x_1,\ldots, x_k, then the fraction of them satisfying \langle x_i -\mu , v \rangle >0 will be in 1/2 \pm \epsilon. Since the adversary can perturb at most \epsilon n vectors, the overall frection of x_i‘s such that \langle x_i -\mu , v \rangle >0 will be in 1/2 \pm 2\epsilon. QED(1)

** Proof of 2:** Let \mu be the population mean (i.e., the “good” x_i‘s are distributed according to N(\mu,I)). Suppose for simplicity, and toward a contradiction, that \left|\mu-\mu^{\ast}\right| = 10 \epsilon.

Let v=\mu-\mu^{\ast}. Then, \left\langle x_{i}-\mu^{\ast}, v /|v|\right\rangle=\left\langle x_{i}-\mu+v, v /|v|\right\rangle=\left\langle x_{i}-\mu, v /|v|\right\rangle +|v| = \left\langle x_{i}-\mu, v /|v|\right\rangle +10 \epsilon.

Note that \left\langle x_{i}-\mu, v /|v|\right\rangle is distributed as N(0,1), and so we get that \langle x_i - \mu^* , v/|v|\rangle is distributed as N(10\epsilon,1). Hence, if Y_{i}=\mathrm{sign}\left(\left\langle x_{i}-\mu^{\ast}, v /|v|\right\rangle\right) then \mathrm{Pr}\left[Y_{i}=-1\right]=\mathrm{Pr}[N(10\epsilon ,1) < 0] \leq 1 / 2-5 \epsilon. This implies that via a similar concentration argument as above, for every v, there will be with high probability at most 1/2 - 4\epsilon fraction of i‘s such that Y_{i}=-1, contradicting our assumption that \mu^\ast was a Tukey median. QED(2)

Exactly computing the Tukey median is NP-hard, but efficient algorithms for robust mean estimation of normals and other distributions exist as referred to in Jerry Li’s lecture notes. In particular, we can use the following approach using spectral signatures and filtering.

Spectral Signatures can efficiently certify that a given vector is in fact a robust mean estimator. Let x_{1}, \ldots, x_{(1-\epsilon) n} \sim N(\mu, I) and x_{(1-\epsilon) n+1}, \ldots, x_{n} arbitrary Let \hat{\mu}=\frac{1}{n} \sum_{i=1}^{n} x_{i} be the empirical mean and \widehat{\Sigma}=\frac{1}{n} \sum_{i-1}^{n}\left(x_{i}-\hat{\mu}\right)\left(x_{i}-\hat{\mu}\right)^{\top} be the empirical co-variance. Then we can bound the error of \hat{\mu} as follows:

Claim: |\hat{\mu}-\mu| \leq O\left(\sqrt{\frac{d}{n}}+\sqrt{\epsilon}|\hat{\Sigma}|\right)

In other words, if the spectral norm of the empirical covariance matrix is small, then the empirical mean is a good estimator for the population mean.

Note{ }^{\ast}: If all points are from N(\mu, I) then |\hat{\Sigma}|=O(\sqrt{d / n}).

The Proof for this claim is given below

Explanation: Here, we assume the \mu=0 for simplicity without loss of generality. The norm can be split into additive terms on good points (green text above) and malicious points (red text above). The first term of this inequality follows from standard concentration. For the second (red) term, we can modify it by adding and subtracting \hat{\mu}. We can then apply the Cauchy-Schwartz (cs) inequality to prove it. Upon rearranging the terms and dividing by the norm of \mu, we get the desired result.

Filtering is an approach to turn the certificate into an algorithm to actually find the estimator. The idea is that the same claim holds for non-uniform reweighting of data points to estimate the empirical mean and covariance. Hence we can use a violation of the spectral norm condition (the existence of a large eigenvalue) to assign “blame” to some data points and down-weigh their contributions until we reach a probability distribution over points that on the one hand is spread on roughly 1-O(\epsilon) fraction of the points and on the other hand leads to a weighted empirical covariance matrix with small spectral norm. The above motivates the following robust mean estimation algorithm in the online case as below:

Explanation: We first compute the mean and covariance based on uniform weighting based, and the certificate is checked via the spectral norm of the covariance. If the quality isn’t good enough, then the blame is given to the largest eigenvector of the covariance that contributes to the most error. The weighting is now improved from the uniform initialization via the multiplicative weights styled update as given in step 3. Jerry Li and Jacob Steinhardt have wonderful lecture notes on these topics. The algorithm above is computationally efficient while the bounds are not that tight.

SoS algorithms: Another approach is via sum of squares algorithms where the guarantees for statistical robustness are much tighter, but they are computationally not very efficient although they are polynomial time. A hybrid approach might as well give a balance of these based on the problem at hand to bridge this gap between computational efficiency vs. statistical efficiency.

A list of relevant references is given below.

Test-time robustness

We now cover robustness issues with distribution shift and adversarial data poisoning during the testing phase in machine learning.

Warm-up with pictures

As shown in Steinhardt, Koh, Liang, 2017 the images below illustrate how poisoning samples can drastically alter the performance of a classifier from good to bad.

Shafahi , Huang, Najibi, Suciu, Studer, Dumitras, Goldstein, 2018 showed how poisoning images look perfectly fine to the human perception, while they flip the model’s performance where a fish is considered to be a dog and vice versa as shown below.

Another problem with regards to test-time robustness is the issue of domain shift where the distribution of test data is different from the distribution of training data.

\underset{x,y \sim D}{\mathbb{E}} L(f(x),y) vs. \underset{x,y \sim D'}{\mathbb{E}} L(f(x),y) . If L is bounded, then if Lipschitz constant is known then distances like earth mover’s distance, T.V distance can be used to bound this. But one needs to be quite careful or these bounds are too large or close to vacuous.

For example, images of dogs taken say in a forest vs. images of dogs taken on roads have huge distance in measures like the ones mentioned above, as many pixels (although not important pixels) across the images are very different and there could be a classifier that performs terribly across the test while it does good on train. But magically, there appears to be a linear relationship between the accuracy on D vs accuracy on D'. Typically one would expect that the line would be below the x=y (45 degree) line.

i.e say if it was trained on cifar 10 (D) and tested on cifar10.2 (D') then finding this line to be a bit lower than the x=y line is not surprising but this linear correlation is quite surprising. It is even more surprising if the datasets are different-say in the case when performance on photos vs illustrations or cartoons of these photos was considered.

What are the potential reasons (intuitively)?

i) Overfitting on cifar test,

ii) Fewer human annotators per image introduces skew towards hardness of dataset,

iii) Running out of images by human annotators as they might end up choosing images that are easier to annotate.

If we achieve better accuracy on D' than we achieved on D, it is a strong indicator of a drop in hardness. If the errors are in the other direction, then this reasoning isn’t as clear.

A toy theoretical model

Here is a toy model can demonstrate this surprising linear relationship between accuracies under domain shift. Let us do a thought experiment where things are linear. Let us assume there was a true vector cat direction in terms of the representation(feature) as shown in the cartoon below. Let there be some correlated idiosyncratic correlations. An example, idiosyncrasy is say due to cats tending to be photographed more in indoors than in outdoors.

Consider x to be a point that needs to be labeled.

In some dataset D consider the probability that x is labeled as a cat is proportional to the following exponential as: \mathrm{Pr}[x labeled cat ] \propto \exp (\beta\langle x, C A T+\alpha I\rangle)

where \alpha to denote the theidiosyncratic correlation factor. This is the exponent of the dot product of the image to be labeled x with the CAT direction and the idiosyncratic direction. The same can be done for a dataset D' as

Dataset D^{\prime}: \mathrm{Pr}[x labeled cat ] \propto \exp \left(\beta^{\prime}\left\langle x, C A T+\alpha^{\prime}I^{\prime}\right\rangle\right)

Intuitively \beta' is like the signal to noise ratio. That is, if \beta' < \beta then D' is a harder dataset to classify than D.

So in this toy model, the best accuracy that can be reached for any linear classifier is given by the following, where the softmax of the RHS is the classification probability.

A c c_{D}(C)=\beta\langle C, C A T\rangle+\beta \alpha\langle C, I\rangle A c c_{D \prime}(C)=\beta^{\prime}\langle C, C A T\rangle+\beta^{\prime} \alpha^{\prime}\left\langle C, I^{\prime}\right\rangle The first term of this accuracy is the universal and transferable part and the second term is the idiosyncratic part.

The following is an For the A c c_{D \prime}(C) we assume that if C is trained on D then we assume \approx 0. This is because if C is trained on D, then idiosyncratic directions of D,D' are orthogonal to each other. So the \approx 0 and the accuracy will be just this term of \beta^{\prime}\langle C, C A T\rangle.

If the model is learnt by gradient descent then the gradient direction will always be proportional to CAT + \alpha I + Noise as the gradient is of the form \nabla(\beta\langle C, C A T\rangle+\alpha \beta\langle C, I\rangle)=\beta \cdot C A T+\beta \alpha \cdot I So, if C is trained on D, then C \propto C A T+\alpha \cdot I+ Noise =\frac{\gamma}{|C A T+\alpha \cdot I|^{2}}(C A T+\alpha \cdot I)+ Noise. Here, \beta is given by \frac{\gamma}{|C A T+\alpha \cdot I|^{2}}. Therefore the accuracies will be as follows

A c c_{D}(C)=\beta\langle C, C A T+\alpha \cdot I\rangle=\beta \gamma A c c_{D^{\prime}}(C)=\beta^{\prime}\left\langle C, C A T+\alpha \cdot I^{\prime}\right\rangle=\beta^{\prime} \gamma \cdot \frac{|C A T|^{2}}{|C A T|^{2}+\alpha^{2}|I|^{2}}

Therefore, we see this form of a linear relationship: A c c_{D^{\prime}}(C)=\frac{\beta^{\prime}}{\beta\left(1+\theta^{2}\right)} \cdot \mathrm{Acc}_{D}(C)

Note that:

  • \beta^{\prime} / \beta<1 iff D^{\prime} harder than D
  • \theta^{2} grows with idiosyncratic component of D

Although this is a toy theoretical model, it explains a linear relationship. However, finding a model that explains this linear relationship in real-life will be an interesting project to think of.

Adversarial perturbations

We now move to the last (yet another active) topic of our blog: adversarial perturbations. As an introductory example (taken from this tutorial), the hog image x was originally classified as a hog with \approx 99.6 \% probability. A small amount of noise \Delta was added to get the x+\Delta image which to the human eye perceptually looks indistinguishable. That said the model now ends up misclassifying the noised hog to something that is not a hog.

Should we be surprised with the efficacy of such adversarial perturbations? Originally-certainly yes, but not as much now in hindsight!

In this example it turns out that the magnitude of each element satisfies \left|\Delta_{i}\right| \approx \frac{1}{64}\left|x_{i}\right| and the 2-norm of this vector is roughly as |\Delta| \approx \frac{1}{64}|x|. Note that the Resnet50 model outputs r(x) at the penultimate layer has a dimension d=2048. We scale |x| such that |r(x)| \approx \sqrt{d}. There is a Lipschitz constant L w.r.t the preservation of the following norms: |r(x+\Delta)-r(x)| \approx L|\Delta|. The final classification decision is made by looking at C \cdot H O G where HOG is a unit vector such that H O G: unit vector s.t. \mathrm{Pr}[x is h o g] \propto\langle H O G, r(x)\rangle. We assume some randomness in the decision as being \sqrt{1-c^{2} / d} \cdot N(0, I) for simplicity. So we have r(x)=C \cdot H O G+\sqrt{1-c^{2} / d} \cdot N(0, I) where C=\langle x, HOG \rangle.We now have the probability that x is not a hog as

\mathrm{Pr}[x is not hog ]=\frac{\exp (-C)}{\exp (C)+\exp (-C)}.

As we know that the observed probability of x being not a hog is 0.0996, we can calculate that C \approx 3. Upon normalizing |r(x)|^2 to be d=2048, We can expect the following square of this dot product w.r.t the representation r(x) as:\langle r(x), H O G\rangle^{2} \approx 3 \approx |r(x)|^2 / 700 Therefore the norm of the projection of r(x) to the HOG direction is given by \left|r(x)_{H O G}\right| \approx \frac{1}{25}|r(x)|.

So say if the 2048 vector had one larger element which accounts for the HOG direction, although it still accounts for a small proportion of its total norm. Therefore it wouldn’t need too much noise to flip one class to its wrong class label as shown in the cartoons below.

So if the Lipschitz constant is greater than around 2.5 or 3, then a fraction of 1/25 is enough to zero out the hog direction (as L \gg \frac{64}{25} \approx 2.5).

Robust loss

What are some strategies for training neural networks that are robust to perturbations?

  • A set of transformation that do not change the true nature of the data such as for e.g: t_{\Delta}(x)=x+\Delta where the set is

i.e, they only perturb the image or data sample by atmost \epsilon upon applying that transformation. Now, given a loss function \mathcal{L}: Y \times Y \rightarrow \mathbb{R} and a classifier f: X \rightarrow Y, a robust loss function of f at point (x, y) is defined as

Robust training:

Given x_{1}, y_{1}, \ldots, x_{n}, y_{n} and arobust loss function, a robust training would involve the minimization of this loss which gives us:

Now for the subgoal of finding \nabla_{f} \max_{t \in \mathcal{T}} \mathcal{L}\left(f\left(t(x{i})\right), y_{i}\right) for the sake of optimization, invoking Danskin’s Theorem will greatly help. The theorem basically says that if g(f, t) is nice (diff, continuous) and if \mathcal{T} is compact then we have that:

i.e for any function g(f,t), that depends on f and t as long as the function space \mathcal{T} from which t needs to be chosen is nice (diff, continuous) and compact, then to find the gradient of the maximum of the function g(f,t) then by the theorem one can find maximizer t^* for any particular f, then that can give the required gradient (after some fine print that is given in note below). Note: The paper below extends to the case when t^{\ast}(f) non unique though there is other fine print. See Appendix A of (Madry Makelov Schmidt Tsipras Vladu 2017).

On the empirical side, there seems to be a trade-off. If one wants to achieve adversarial robustness, it can be achieved by letting go of some model accuracy. See discussion here and the papers cited below.