Paper Review - Week 39 | Shirin Tahmasebi

Chain-of-Verification Reduces Hallucination in LLMs

Here is the list of the most interesting papers published in this week:

Chain-of-Verification Reduces Hallucination in LLMs

Chain-of-Verification Reduces Hallucination in LLMs

Introduction

In the world of Large Language Models (LLMs), generating accurate responses is a challenge, especially when it comes to lesser-known facts. Despite having massive amounts of training data, models often end up creating plausible but incorrect information, known as hallucination. This paper tackles this issue head-on by introducing the Chain-of-Verification (CoVe) method. CoVe is all about making language models think twice before they speak. It guides the model through a four-step process: first, it generates a response; then, it plans and executes verification questions to fact-check itself, and finally, it refines the initial response based on the verification. The goal is to reduce hallucinations and make language models more reliable.

Research Gap

The authors addresses the challenge of hallucination in large language models (LLMs). Hallucination refers to the generation of plausible yet incorrect factual information by language models. The paper identifies this issue as an unsolved problem in the context of LLMs, particularly when dealing with lesser-known facts or facts occurring relatively rarely in training corpora.

Solution: CoVe

The CoVe (Chain-of-Verification) method consists of four main steps aimed at reducing hallucinations in LLMs. These steps are performed by prompting the same language model in different ways to obtain the desired responses at each stage. The overall goal of CoVe is to leverage the model’s ability to reason and fact-check its own responses, leading to a final verified response with reduced hallucinations. The method aims to provide a more accurate and reliable output by iteratively assessing and correcting the model’s initial responses.

Here’s a brief explanation of each step:

Generate Baseline Response: The process begins by having the language model generate an initial response to a given query. This initial response serves as the baseline that the method aims to improve upon.
Plan Verifications: Conditioned on both the original query and the baseline response, the model is prompted to generate a series of verification questions. These questions are designed to test the factual claims made in the baseline response. The language model is free to phrase these questions in any form, and they don’t necessarily match the original text’s wording.
Execute Verifications: The planned verification questions are then answered independently. Instead of relying on external tools, the language model itself is used to check its own work. The goal is to assess whether there are any inconsistencies or mistakes in the original response by systematically answering the verification questions.
Generate Final Verified Response: Based on the answers to the verification questions and any identified inconsistencies, a final improved response is generated. This step incorporates the results of the verification process, aiming to produce a response that is more accurate and less prone to hallucination than the baseline generated in the first step.

The Execute Verifications step in the CoVe method involves answering the planned verification questions. The paper explores different variants of this step to further improve the overall performance. Here’s a brief explanation of each variant:

Joint: In the joint method, the planning and execution of verification questions are accomplished using a single prompt. The few-shot demonstrations provided to the model include both the verification questions and their corresponding answers immediately after the questions. This approach aims to streamline the process but has potential downsides, as the verification answers may condition on the initial response, possibly leading to repetition of hallucinations.
Two-Step: The two-step approach separates the planning and execution into distinct steps, each with its own prompt. The planning prompt conditions on the baseline response, generating the verification questions. The answers to these questions are then obtained in a separate execution step, where the context given to the prompt contains only the questions and not the original baseline response. This separation aims to avoid repetition and potential biases introduced by the baseline response.
Factored: The factored approach involves answering each verification question independently as separate prompts. The key distinction is that these prompts do not contain the original baseline response, reducing the likelihood of simply copying or repeating hallucinations. This method provides the advantage of removing potential interference not only from the baseline response but also between answer contexts.
Factor+Revise: The factor+revise approach introduces an extra step after answering the verification questions. In this step, the CoVe pipeline explicitly cross-checks whether the verification answers indicate an inconsistency with the original responses. This is executed as a deliberate step via an additional prompt. Unlike the verification questions and answers, this phase needs to condition on both the baseline response and the verification question and answer pairs.

These variants offer different ways to structure the execution of verification questions, with the aim of improving the correctness of the overall response and reducing hallucinations. The paper investigates and compares these variants across various tasks to assess their effectiveness in mitigating inaccuracies.

Experiments

The primary aspects of the experimental setups are as follows:

Tasks:
1. Wikidata:
  - Task: List-based questions generated using the Wikidata API.
  - Example: “Who are some [Profession]s who were born in [City]?”
2. Wiki-Category List:
  - Task: Set-generation task using the QUEST dataset created from Wikipedia Category lists.
  - Example: “Name some Mexican animated horror films.”
3. MultiSpanQA:
  - Task: Reading comprehension benchmark with questions having multiple independent answers.
  - Example: “Who invented the first printing press and in what year?”
4. Longform Generation of Biographies:
  - Task: Generating biographies using the prompt “Tell me a bio of <entity>.”
Dataset: The experiments use specific datasets for each task, including automatically generated questions from the Wikidata API, the QUEST dataset derived from Wikipedia Category lists, and benchmarks like MultiSpanQA for reading comprehension.
Baselines: Llama 65B and Llama 2 (Instruction Fine-Tuned)
Evaluation Metrics:
Wikidata and Wiki-Category List:
- Precision (micro-averaged) for measuring performance on list-based questions.
MultiSpanQA:
- F1 score for closed-book question-answering tasks.
Longform Generation of Biographies:
- FACTSCORE metric, which uses retrieval-augmented language models for fact-checking responses.

Conclusion

This paper takes on the challenge of fixing a persistent problem in language models, hallucination. By introducing deliberate reasoning steps, CoVe offers a systematic approach for language models to assess and correct their own responses. The experiments conducted across diverse language tasks, including list-based questions, closed-book QA, and longform text generation, showcase the effectiveness of CoVe in mitigating hallucinations. The exploration of variants, such as joint, two-step, factored, and factor+revise, provides valuable insights into the nuances of verification processes. Notably, CoVe demonstrates substantial improvements over baseline responses, outperforming existing methods. The findings underscore the significance of incorporating reasoning mechanisms in LLMs to enhance their accuracy, paving the way for further advancements in mitigating hallucination challenges and fostering more reliable language understanding models.