Paper Review - Week 13 | Shirin Tahmasebi

Fairness-guided Few-shot Prompting for LLMs

Several interesting papers are published in this week in NLP. Here is a list of them:

Fairness-guided Few-shot Prompting for LLMs

Fairness-guided Few-shot Prompting for LLMs

The paper aims to design algorithms that select in-context examples for prompts to enhance the accuracy of language models across various downstream tasks. The central hypothesis is that constructing prompts with less biased in-context examples can improve model accuracy compared to other selection strategies, such as the most similar, diverse, or random examples. They introduce the concept of “predictive bias” as a metric to measure how much a given prompt influences the model’s predictions. Predictive bias refers to the discrepancy in the model’s predictions when presented with different prompts or content-free inputs.

Paper General Overview

In the paper, the concept of predictive bias is quantified using entropy, specifically the entropy of the model’s predicted distribution. Here’s why entropy is used to measure predictive bias:

Ideal Uniform Distribution: In an ideal scenario, when a model makes predictions in the absence of meaningful information (also knows as “content-free input”), the predicted probabilities should be roughly equal for all possible outcomes. In other words, the model should be uncertain and not favor any particular outcome. This corresponds to a uniform distribution.

Measuring Deviation from Uniformity: Predictive bias refers to the deviation of the model’s predicted distribution from the ideal uniform distribution. When the model’s predictions are biased toward certain outcomes, the entropy of the distribution will be higher because it indicates greater uncertainty and randomness.

Entropy as a Quantitative Metric: Entropy provides a quantitative measure of this uncertainty or randomness. Higher entropy values indicate greater uncertainty and, by extension, a higher degree of predictive bias, as the model’s predictions are more dispersed across different outcomes.

So, by calculating the entropy of the predicted distribution generated by the model in response to a content-free input and a prompt, the paper is able to quantify how much the prompt biases the model’s predictions. Lower entropy values imply lower predictive bias, meaning the prompt has less influence on the model’s output. Higher entropy values indicate higher predictive bias, suggesting that the prompt strongly shapes the model’s predictions in a non-uniform way.

Prompt Selection Strategies

Algorithm 1: T-Fair-Prompting:

For T-Fair-Prompting, each training sample is individually used as an in-context example and passed to the model.
The bias (entropy) is measured for each sample’s predictions. Lower entropy indicates lower bias.
Samples are sorted by their entropy scores in ascending order (less biased first).
The top-k samples with the lowest entropy scores are selected as in-context examples.
Computational complexity: \(O(N)\), where \(N\) is the number of training samples.

Algorithm 2: G-Fair-Prompting:

G-Fair-Prompting considers fairness in the combination of remaining training samples and already selected in-context examples.
It calculates the fairness score for each possible combination and selects the one that minimizes bias.
It operates in an iterative manner, where it gradually builds the prompt by selecting and adding in-context examples. The algorithm continues until it determines that adding more examples will not significantly reduce bias.
The approach operates from a local to global perspective, considering individual sample bias initially and then reducing global predictive bias.
Computational complexity: \(O(N^2)\), where \(N\) is the number of training samples.

The selected prompts, with in-context examples chosen to minimize bias, are used for inference during downstream tasks. During inference, the model’s accuracy is indeed measured according to the downstream task’s objectives. The goal is to have the model perform well on the specific task it was designed for, and the choice of a more fair prompt is expected to improve the model’s overall accuracy and effectiveness in these tasks.

How Entropy Measures Uncertainty?

In this section, a comprehensive explanation is provided regarding how entropy metric can be used to measure the confidence of the model.

Entropy as a Measure of Confidence in Language Models:

Entropy is a concept borrowed from information theory and probability theory.
In the context of language models, entropy quantifies the uncertainty or disorder in the model’s predicted probability distribution over possible outcomes or classes.
It serves as a measure of the model’s confidence in its predictions.
High entropy indicates high uncertainty, while low entropy indicates high confidence in predictions.

Entropy calculation: Here is the entropy formula: \(H = -\Sigma (p_i * log_2^{(p_i)})\), where \(p_i\) represents the probability assigned to each class.

Example Scenarios: Let us consider the two following scenarios:

Scenario 1 (High Confidence):
- In this scenario, a language model predicts the probabilities \([1, 0, 0, 0]\) for four classes.
- Entropy Calculation: \(H = -(1 * log_2^{(1)} + 0 * log_2^{(0)} + 0 * log_2^{(0)} + 0 * log_2^{(0)}) = 0\) (because \(log_2^{(1)}\) is 0)
- Interpretation: The entropy of 0 indicates that the model is highly confident and certain that the input belongs to the first class, as it assigns a probability of 1 to that class and 0 to all others.
Scenario 2 (Low Confidence):
- In this scenario, a language model predicts the probabilities [.25, .25, .25, .25] for four classes.
- Entropy Calculation: \(\begin{split} H = -(0.25 * log_2^{(0.25)} + 0.25 * log_2^{(0.25)} + 0.25 * log_2^{(0.25)} + 0.25 * log_2^{(0.25)}) = \\ - (0.25 * (-2) + 0.25 * (-2) + 0.25 * (-2) + 0.25 * (-2)) = 2 \end{split}\)
- Interpretation: The entropy of 2 indicates that the model is very uncertain and evenly distributes its probabilities across all four classes. It lacks confidence in its prediction and is unsure about the correct class.

Use in Decision-Making:

In classification tasks, entropy can be used as a criterion for decision-making.
Higher entropy suggests that the model is uncertain, and additional information may be needed for confident decisions.
Lower entropy indicates high confidence in the model’s prediction.

In summary, entropy is a valuable metric for assessing the confidence of a language model’s predictions. It quantifies how spread out or concentrated the predicted probabilities are, with higher entropy indicating greater uncertainty and lower confidence, while lower entropy indicates higher confidence in the model’s predictions.

Experiments and Results

Here’s the key points from the experimental results:

Model Comparison: The experiments involve comparing different language models, including BLOOM and various sizes of LLaMA models (e.g., 65B). The choice of LLaMA models is due to API access restrictions for GPT-3.
Datasets: The paper uses several text classification datasets, such as SST-2, AGNews, CoLA, TREC, and RTE. The RTE dataset has sentences that are too long for LLaMA, given its maximum input length of 512.
Performance Metrics: The primary metric for evaluating the strategies is accuracy on the downstream classification tasks.
Comparison Strategies: The paper compares its approach, G-fair-Prompting (Greedy), with two existing strategies, namely, the diversity-guided strategy (Global view) and the similarity-guided strategy (Local view). These strategies select demonstrations from the training set based on diversity or similarity criteria.
Performance Findings:
- G-fair-Prompting Approximates Enumeration: G-fair-Prompting is shown to achieve results that closely approximate the performance of enumerating all possible candidates. This indicates that G-fair-Prompting is effective in finding high-quality prompts.
- Outperformance of T-fair-Prompting: G-fair-Prompting consistently outperforms T-fair-Prompting, another strategy proposed in the paper. This demonstrates that the greedy search approach of G-fair-Prompting is more effective in improving prompt quality.
- Importance of the Number of Demonstrations: The experiments show that selecting a smaller number of demonstrations (Top-2) can significantly outperform selecting more demonstrations (Top-4) in most cases. This suggests that the number of demonstrations chosen plays a crucial role in prompt quality.

Overall, the results demonstrate that G-fair-Prompting is an effective approach for selecting demonstrations and improving prompt quality for different language models and datasets, achieving performance close to enumeration but with lower computational costs.