Paper Review - Week 6 | Shirin Tahmasebi

Toolformer
Offsite-Tuning

Several interesting papers are published in this week in NLP. Here is a list of them:

Toolformer

This paper is really interesting, because it shows that language models, such as GPT-3, can be trained to perform arithmetic operations, such as addition, subtraction, multiplication, and division, with a high level of accuracy. While probabilistic methods are often used to generate answers to questions, some questions have a definite answer that should be calculated rather than generated randomly. In such cases, the use of language models that can generate exact answers can be beneficial.

The idea in Toolformer is to leverage some external tools (e.g., a calculator, Q&A systems, search engines, translators, a calendar) to enhance the model’s ability for doing some basic tasks, such as arithmetic operations and factual lookups. For training a language model for this purpose, a self-supervised approach is followed. First, given a corpus, it is necessary to create the train dataset for training the model. For doing so, each sentence of the corpus is modified and augmented by injecting the API calls. Among all the available positions of a sentence, the API calls are added where the probabiltiy of having the API calls is the most, based on the predictions of a langugae model such as GPT-J. Then, the response of the API call is added right after that. But, how to make sure that this sentence is good enough to be added to the train dataset? To ensure that the sentence is good enough, they calculate two loss functions for the augmented sentence. But, what are the two loss functions?

The augmented sentence is first passed to GPT-J to generate the output. Then, the cross-entropy of the generated sentence (by GPT-J) and the original sentence is calcuated. This cross-entropy shows how much the generated sentence is close to its original version. The more similar the better. We use \(loss_1\) to refer to this loss value.
The original sentence (without any API calls) and the augmented sentence without the response of the API calls are passed to the GPT-J model. The cross-entropy of the both generated sentences againt the original sentence is calculated. Then, the minumum of the two calculated cross-entropy is considered as the seond loss value. We use \(loss_2\) to refer to this loss value.

After calculating \(loss_1\) and \(loss_2\), if (\(loss_2\) - \(loss_1\)) is more than a specific threshold, then it means that adding the API call can help the model to generate a sentence similar to the original one. So, in this case, the augmented sentence is added to the train dataset.

Offsite-Tuning

This problem focuses on a very interesting problem. Recently, Large Language Models (LLMs) are being increasingly popular because of their fantastic performance for different task. However, since these models are very large and also significantly-high computation power is required for training or even fine-tuning them, users cannot download the model weights on their systems to adapt them based on their data and task. Apart from that, in some cases the model owners are also very reluctant to share their models publicly–such as GPT-3 proposed by OpenAI. The approach that is now being used by both model owners and data owner is that the data owner sends the data to the model owner data. Then, the model owner fine-tune the model based on the data and provides the access to the fine-tuned model to the data owner Link. This process violates the privacy of data owners, because they need to send their data directly to the model owners.

The offsite-tuning provides a solution to this problem. First of all, let us make a list of the requirements that such solution needs to satisfy:

The solution should allow the model owner not to share the model weights.
The solution should allow the data owner not to share their data directly with the model owners.
The solution should result in a better performance compared to using the zero-shot reasoning capabiltiy of the model.

Keeping these requirements in mind, the approach in offsite-tuning can be summarized as follows:

The LM can be separated into two part, the concatenation of which makes the whole language model:
1. A freezed part: In this part, all the wieght are fixed. No matter what the data is or what the task is, the weights in this part are always freezed.
2. An adaptable part: This part should be fine-tuned based on the data. So, this part can vary based on the data for each data owner.
According to this separation, the model owner sends two things to data owners:
1. A compressed version of the freezed part, named emulator.
2. The initialized parameters of the adaptable part, named adapter.
Data owners can use the concatenation of emulator and adapter to train the adapter part based on their data.
After training the adapter, the data owners can send the trained adapter to the model owner. So, the model owner can create the fine-tuned model by concatenating the freezed part and the trained adapter. And then, the model owner provides the access of this newly-created model to the data owner.

Accordingly, the three requirements metioned above can be formulated as follows:

The emulator’s performance should be significantly less than the main model’s performance. In this way, we can ensure that the model owner would not be concerned of sharing their emulator with users.
The emulator and the adapter should be light-weight enough for being fine-tuned by data owners.
Using the newly-created model (provided by the model owner) should result in better performance compared to when the original model is used in its few-shot mode.

In general, this is a very novel approach for addressing the violation of data privacy problem in fine-tuning LLMs.