Skip to content

ssgosh/sparse_nn_recovery

Repository files navigation

Adversarial Training

Discriminative Neural networks trained on real data can have adversarial examples. One kind of adversarial examples that we are generating are out of distribution data points which the discriminative neural network classifies with high confidence as belonging to some class. Our generation method does not ensure that the generated examples are out of distribution. It only ensures that the generated examples are classified with high confidence as belonging to a desired target class. It also ensures that it follows the following conditions:

  • individual features (pixels) have valid values (between 0 and 1)
  • Generated examples are sparse

For MNIST, the real data domain also satisfied the above conditions. Hence the generated examples are not guaranteed to be OOD for MNIST. For other domains, such as CIFAR, the real data will typically not be sparse.

Objective of adversarial training experiments

Our end-goal is to train a neural network which does not have sparse adversarial images.

We train the neural network to recognize adversarial images generated via our method as belonging to fake classes. However, this trained network may again have sparse adversarial images different from the ones it was trained on. Hence we generate such images again and retrain the network on these new adversarial images. We carry on this process until our adversarial image generation method can no longer generate sparse adversarial images.

Current adversarial example generation limitations

We use L1 penalization to produce sparse adversarial images. However, these may not necessarily produce sparse images which are classified with high confidence as a given class. Or, the produced images may not be sparse. Or both.

  • Problem 1) Ensure sparse images
  • Problem 2) Ensure image classified with high confidence
  • Problem 3) Ensure out of distribution

If at any time during training we detect that our generated images violate any one of these conditions, then we should deem the trained network to be robust to adversarial images generated by our process.

However, there is a caveat to this - maybe our generation process is not good enough - for example, we may not be training for enough number of steps, or our lambda is too high (resulting in low confidence images) or too low (resulting in non-sparse images) Perhaps we can perform early stopping, with number of recovery steps set very high (say 10k). We stop when, for example, probability >= 0.9 and sparsity <= 300.

To ensure OOD, one may use a random mask to mask parts of the generated image. These parts will be held at 0, and gradients will not flow to them. This may ensure that the generated images have randomness in them, and do not look like the input distribution. However, this may only ensure one kind of OOD adversarial images, and network trained on such images may still be fooled by other OOD sparse images.

Metrics to be logged for Adversarial Training

What do we need to verify?

  1. Our adversarial image generation method successfully generates sparse adversarial images.
  2. Our adversarial image generation method successfully generates OOD images.
  3. Our adversarial training ensures that the network can detect adversarial images it was just trained on.
  4. Our adversarial training ensure that the network does not forget past images it was trained on.
  5. Our adversarial training ensure that the network can detect adversarial images that it was not trained on, but generated via the same process on the most recent network
  6. Our adversarial training ensure that the network can detect adversarial images that it was not trained on, but generated via the same process on a past network.
  7. Our adversarial training ensure that the network can detect adversarial images that it was not trained on, but generated via the same process on a completely different network, trained on a different dataset of the same distribution.
  8. Our adversarial training ensure that the network can detect adversarial images that it was not trained on, and generated via a different process on the same or different dataset coming from the same distribution. For example, use differential evolution to generate these images.
  9. Our adversarial training ensure that the final trained network does not have sparse adversarial images.
  10. Our adversarial training ensure that the network does well on real training data.
  11. Our adversarial training ensure that the network does well on real test data.
  • For (0), we should log average probability of adversarial image belonging to the target class, as well as average sparsity of image. Should also do this class-wise.
  • For (1), we don't know what metric should be monitored, or if there is a metric other than human inspection.
  • For (2), we should log the current adversarial batch's probability of belonging to the fake class. Should also log the loss. Should also log this for individual fake classes.
  • For (3), we should keep around all past training images and evaluate on them after each epoch. Target should be fake class while evaluation. If this is too costly, only evaluate on a sample of 1000 images for each epoch. Log both aggregate and per-class stats.
  • For (4) and (5), we should generate test adversarial images on each epoch, on which the network is not trained. We should log the same stats as for (3).
  • For (6), keep aside half the training dataset and train a network B, before starting adversarial training. We should generate adversarial images for network B using our process. Use this as test data with fake classes. Log the same stats as for (3).
  • For (7), we should use above network B and generate adversarial images using a different process such as differential evolution. Log the same stats as for (6).
  • (3), (4), (5), (6) and (7) are proxies for (8). Use as many adversarial image generation processes in (7) as possible.
  • For (9), log training loss and average probability of belonging to the real target class for each batch. Log both per-class and aggregate probabilities.
  • For (10), after each epoch, log same metrics as (9) on test data.

Final Model Selection

Final model needs to be accurate on real data but robust to adversarial data. For this reason, we can use the harmonic mean of real data accuracy and adversarial data accuracy. Reason for taking harmonic mean instead of the arithmetic mean is that a low score in any one criterion results in a low score of the overall criterion. Hence, scoring high on the harmonic mean implies that all accuracies are reasonably high. Note that the F-score typically used in NLP is the harmonic mean of Precision and Recall.

Adversarial data accuracy can be computed in one of the following ways:

  1. whether fake target class was correctly predicted
  2. whether final prediction points to a non-real class.

In case of more than one adversarial test dataset, harmonic mean should be taken in this way:

2/H = 1/A_real + 1/n * \sum(1/A_adv_i for i in [1, n])

This is to give priority to the real data accuracy over adversarial robustness over various datasets. We should consider each intermittent adversarial dataset as a different dataset in this score computation, so that the final selected network performs well on each adversarial dataset.

Perturbing an in-distribution image

We can create fooling images by perturbing a real data image with an adversarially generated image.

Let image P = clip(Q + \alpha * R), where Q is a real data image, and R is an adversarial image, with class(R) != class(Q). Let Q be classified by the network A with high confidence (> 0.9). Then, find 0 < \alpha < \epislon such that class(P) = class(R) with high confidence (> 0.9). Keep \epsilon \in {0.1, 0.2, 0.3, 0.4, 0.5}.

Can we find such an \alpha?

About

Sparse recovery of neural network inputs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages