Paper Review - Week 22
Here is the list of the most interesting papers published in this week:
Black Box Adversarial Prompting for Foundation Models
Before summarizing this paper, I need to explain some of the concepts used frequently in the paper and categorize them: adversarial examples, adversarial attack, gradiant-based approaches, and square attack.
-
Adversarial Example: An adversarial example is a sample of data that has been intentionally designed to mislead a machine learning model to make mistakes in its predictions. These examples are created by applying small, often imperceptible, perturbations to the original input data. For example, let us say that we have a language model for sentiment analysis of the input sentences. Then, assume that the original sentence is: I love this product. It is amazing. Then, an adversarial sentence would be: I love this product. It is terrible. So, as you see the adversarial sentence is quite ambiguous and it can be interpreted in multiple ways. But the question is that how generating such examples can be beneficial for the model? To answer this question, I would like to highlight the fact that the purpose of the adversarial examples is not to help the model to learn better. The main purpose is to help the model to be exposed to its vulnerabilities and limitations. For example, here, the adversary example is creafte in a way that introduces ambiguity and uncertainty.
-
Adversarial Attacks: Adversarial attacks are a set of techniques that are used for generating adversarial examples. Adversarial attacks encompass various methods and strategies aimed at making perturbations or modifications to the input data to create adversarial examples that can deceive or mislead machine learning models.
The goal of adversarial attacks is to expose vulnerabilities in machine learning models, understand their weaknesses, and evaluate their robustness against potential threats. By generating adversarial examples, researchers can assess how well a model performs in the presence of perturbations that are designed to mislead it. -
How adversarial attacks can be used to optimize an objective fuction? Let us talk about it in a more detailed way. Let us say that we want to find a prompt that can maximize the performance of a language model on a specific downstream task. Then, first we initialize a prompt and measure how well the model performs using that prompt. So, after that, we can generate a new prompt from the previous one using one of the adversarial attacks. Then, leveraging the performance of the languge model using the new prompt, we can decide which prompt we can use. for example, we can update a prompt using the genetic algorithm which is an evolutionary optimization algorithm that can be considered as a type of adversarial attack for generating adversary examples.
This process continues until we can find a prompt which leads to an optimized performance–minimized loss value. -
Adversarial Training: In adversarial training, the main focus is to control the training process. So, in adversarial training, the objective function is often a combination of the adversarial loss and the regular loss, and the main goal is to minimize the regular loss while simultaneously maximizing the adversarial loss.
-
What is the difference between adversarial training and adversarial attack? Adversarial training modifies the objective function during the model’s training phase. The objective function includes both the regular loss (e.g., cross-entropy loss) and the adversarial loss. One the other hand adversarial attack does not modify the model’s training objective function. Instead, it aims to generate specific adversarial examples that can mislead the model after it has been trained.
In better words, adversarial training is a defense mechanism applied during model training to improve its resilience against adversarial examples. It alters the objective function to encourage the model to make correct predictions on both clean and adversarial data. On the other hand, adversarial attack does not modify the model’s training process but instead focuses on finding specific inputs that can deceive the model after training, highlighting areas of vulnerability in the model’s decision-making process.
The UP5 Paper uses adversarial training technique. But, this paper leverages adversarial attack techniques. -
Square Attack: It is a type of adversarial attack which can also be applied in NLP domain. Let us say that we want to optimzize a soft prompt using the square attack technique. The steps are as follows:
-
Subset Selection: Among all the tokens in the soft prompt, we candidate a set of them (e.g., \(k\) tokens) for being replaced with other tokens. To replace them, we need to keep the indices of the candidation to-be-replaced tokens.
-
Sampling Values: To replace the \(k\) tokens, we need to generate \(k\) random vectors of the same size. Accordingly, we can generate \(k\) modified prompts, \(p_{i}\) for \(i = 0 .. k\). For example, \(p_i\) represents a new modifiend prompt where the token \(i\) is replaced with a generated random vector of the same size.
-
Update Prompt: Let us say that, for prompt \(p_{i}\) where \(i = 0 .. k\), the objective function is represented by \(f(p_{i})\). Then, the prompt \(p_j\) for which \(f(p_{i})\) is the best one is chosen as the updated prompt.
-
So, after providing such background information, it is much easier and more straightforward to explain this paper. This paper focuses on using adversarial attack techniques for optimizing soft prompts in order to improve the performace of the foundation models. The main objective of this paper is that their solution should be independent of the model architecture. So, in this case, it only views the foundation models as a black-box and can be adapted to other foundation models with different architectures.
The authors proposed their solution for both language models and vision models. But, since I am more familiar with the language models rather than vision models, I only explain the language-model-related part of the paper.
The process of the soft prompt optimization in this paper has the following stages:
- Initializing a random candidate word embedding as the discrete and hard prompt
- Project the hard prompt to another token embedding space to have soft prompts
- Using the adversarial attack techniques–more specifically, the square attack technique–to generate several adversary prompts
- Convert the generated soft prompts to the hard prompt–hard prompts which are the closest and the most similar ones to the soft ones.
- Calculating the language model loss for the newly-generated hard prompts
- Select the prompt which results the best performance
- Repeat steps 2 to 7 for the newly-selected prompt
Here are some more articles relevant to this one: