Here is the list of the most interesting papers published in this week:


LLM-Rec: Personalized Recommendation via Prompting Large Language Models

The main focus of this paper is to explore the impact of using various prompting approaches on improving the quality in recommendation systems. Let us consider having a recommendation system which receives both item embedding and user embedding as input and, as output, determines whether or not the item should be recommended to the user.

The core architecture of LLM-Rec is a two-layer Multi-Layer Perception (MLP), which recieves the item description as input. We will delve into how item descriptions are generated shortly. By creating such item description and passing it to the MLP, the output of the last layer of MLP is considered as the final item embedding. The next step involves calculating the dot product between the final item embedding and the user ID embedding (it is not clear what is considered as the user ID embedding). This product is then passed through a Sigmoid function, producing the final relevance score between the user and the item. Notably, the Binary Cross Entropy Loss function is used for training the MLP model, treating each user-item interaction as a binary value.

Now, let us focus on how the item descriptions are created. First of all, in some cases, as in the MovieLens benchmark, language models such as GPT-3 can be leveraged to obtain a more detailed description of items. For example, in the MovieLens benchmark, the GPT-3 model is instructed to give a summarization of each movie. Once the detailed item description is generated, various prompting approaches are employed to enrich the item descriptions. These prompting approaches are:

  1. Basic Prompting: The basic prompting approach can be divided into three different promptings. All these three approaches share the common characteristic of not providing additional information to the language model.
    • \(p_{para}\): The language model is instructed to paraphrase item descriptions. For example, the prompt can be: The description of an item is as follows: {DESCRIPTION}. Paraphrase it!.
    • \(p_{tag}\): The language model is instructed to paraphrase item descriptions using tags. For example, The description of an item is as follows: {DESCRIPTION}. Summarize it with tags..
    • \(p_{infer}\): The language model is instructed to infer the charactristic of the provided item deescription. For instance, The description of an item is as follows: {DESCRIPTION}. What kind of emotions can it evoke?.
  2. Recommendation-driven Prompting: In this approach, a recommendation-driven instruction is added to the basic prompting approaches. Considering the type of the basic prompting being used, the recommendation-driven prompts can be represented by \(p^{rec}_{para}\), \(p^{rec}_{tag}\), and \(p^{rec}_{infer}\).
    • \(p^{rec}_{para}\): An example can be: The description of an item is as follows: {DESCRIPTION}. What else should I say if I want to recommend it to others?
    • \(p^{rec}_{tag}\): An example can be: The description of an item is as follows: {DESCRIPTION}. What tags should I use if I want to recommend it to other?
    • \(p^{rec}_{infer}\): An example can be: The description of an item is as follows: {DESCRIPTION}. Recommend it to others with a focus on the emotions it can evoke.
  3. Engagement-guided Prompting: This type of prompting, denoted by \(p^{eng}\), is designed by taking the \(T\) most important neighbor items into account. In other words, first the \(T\) most important neighbor items are identified–the importance measuring metric is Personalized PageRank (PPR). Let us assume that \(d_{target}\) represent the target item description (the item which we want to decide whether to recommend to a specific user or not), and \(d_1, d_2, ..., d_T\) denote the description of the \(T\) most similar neighbor items, then an example of this prompting approach can be:
    Summarize the commonalities among the following descriptions: $$d_{target}, d_1, d_2, ..., d_T$$
    

For evaluating their apporach, they have used two datasets:

  1. MovieLens-1M, which contains user ratings to some movies. In MovieLens, the movie data only inlcudes its name and genre. However, LLM-Rec uses a more detailed information of movies. For providing such details, the movie name is passed to a language model–more specifically, GPT-3–which is asked to provide a summary of the movie. So, instead of only using the movie name and its genre as the input, a more detailed summary of the movie is used as input. The input of the GPT-3 model for asking it to summarize the movie is as follows:
    Summarize the movie [title] with one sentence. The answer cannot include the movie title.
    
  2. Recipe, which contains the details and review of different recipes extracted from Food.com. The details of each recipe contain ratings, reviews, recipe names, descriptions, ingredients.

The evaluation metrics used for measuring the performance are:

  • Precision@K: Precision measures the accuracy of the positive predictions made by a recommendation system.Precision@K evaluates the precision of the top \(K\) recommendations. It calculates the ratio of relevant items among the top \(K\) recommended items. Higher Precision@K indicates that a higher proportion of the recommended items are relevant to the user.
  • Recall@K: Recall measures the ability of a recommendation system to find all relevant items. Recall@K evaluates the recall of the top \(K\) recommendations. It calculates the ratio of relevant items found among the top \(K\) recommendations. Higher Recall@K indicates that a higher proportion of all relevant items were included in the top \(K\) recommendations.
  • NDCG@K (Normalized Discounted Cumulative Gain at \(K\)): NDCG is a metric that considers both the relevance of recommended items and their ranking. NDCG@K evaluates the quality of recommendations at a specific cutoff point \(K\). It assigns higher scores to highly relevant items that are ranked higher in the recommendation list. It takes into account the diminishing returns of relevance for items further down the list, normalizing the score between 0 and 1. Higher NDCG@K indicates better-ordered recommendations with more relevant items at the top of the list.

My Comments: This summary is for the second version of this paper, published in August 16. In this version, the results are presented in Table 2 and Figure 4. However, it is not clear that which prompting strategy is used for Table 2. According to the numbers in Table 2, LLM-Rec can outperform other baselines for both MovieLens and Recipe. However, Figure 4 reports performance per prompting strategy, and the numbers seem inconsistent with those in Table 2. According to Figure 4, LLM-Rec does not seem to outperform two of the baselines (AutoInt and DCN-V2) for Recipe. Furthermore, the improvement for MovieLens, as indicated in Figure 4, does not appear to be significant.