SQL Query Recommendation Systems

2009: Recommending Join Queries via Query Log Analysis; In this pioneering paper, the authors designed a system for recommending join predicates—including join tables and join conditions. The sources of such recommendations are query logs and database schema. This system receives two inputs; input specification—the tables used in WHERE, and output specification—attributes in SELECT clauses. Then, given these two specifications as the system's input, the system generates a join query graph based on which it recommends join predicates. Therefore, the output of the system is a join query graph. The authors evaluated their work on a dataset named 'AT&T Proprietary Database,' which is not publicly available. The evaluation metric is the accuracy of the suggested join tables.; However, this work has several critical limitations. First, the recommendations are based on a static join query graph that does not update by the addition of new data to the database. Second, it only recommends the join tables and conditions without focusing on the query structure or any other query parts. Third, it is neither session-aware nor sequence-aware.; Interactive Query Refinement; The main focus of this paper is to address the many/few answers problem. This problem states that, while submitting queries to databases, for some queries, too many or very few tuples are returned. In such situations, the proposed innovative model aids users in refining their queries to return a reasonable number of tuples. Specifically, this model receives a query as input and returns a refined set of recommendations for adjusting the ranges of WHERE attributes. The recommendations are according to the database schema and data. However, the challenge is that multiple ways exist for narrowing down or expanding the ranges of WHERE attributes. Thus, in this work, the policy used for choosing a proper way to refine the ranges is to interact with users.; The limitations of this work can be outlined as follows: first, it may require resubmitting queries, which can cause a heavy workload for most databases. Second, it only recommends the ranges of WHERE attributes without considering other aspects of the query structure. Third, the refinement process heavily relies on user involvement. Fourth, it does not take user sessions or the sequence of queries in each session into account for recommendations.
2010: SnipSuggest: Context-aware autocompletion for SQL; SnipSuggest is a context-aware auto-completion system for SQL queries. In this case, context-aware denotes that the suggestions given by the system depend on the query written thus far. The main approach in this insightful work is as follows; first, a Directed Acyclic Graph (DAG) is created based on the log of submitted SQL queries, which is called workload DAG. Subsequently, depending on what the user is typing, the nodes of the workload DAG are ranked according to the probability of their appearance in the continuation of the query. Finally, the most probable nodes are provided as suggested fragments to the user. The recommendations are generated based on query log files. This model is evaluated on Sloan Digital Sky Survey (SDSS) dataset. The evaluation metric is the accuracy of the recommended fragments.; The limitations of this approach can be summarized as follows: first, the recommendations rely on a static DAG that is created only once and not updated as new data is added to the database. Second, the system does not consider user sessions or the sequence of queries within each session for generating recommendations.
2011: Interactive SQL Query Suggestion: Making Databases User-friendly; In this paper, the authors proposed an SQL recommendation system, named SQLSugg, which takes the current partial query written thus far as input; then, as output, the system recommends the fragments for the same query. The suggestions are based on database schema and data. The approach has two steps: (a) an offline step, in which two sets of graphs—schema and data graphs, known as templates—are created and indexed. (b) an online step in which users type the query keywords. Then, the keywords are mapped to database attributes. Based on the matching between keywords and attributes, the most relevant schema graphs are selected. These selected graphs are ranked based on the number and relevance of the matched data graphs. The model is evaluated on two datasets: (1) DBLP, a dataset of publication records, and (2) DBLife, a dataset of activity information of top people in the database community. To evaluate their approach, they asked experts to score the relevance of their suggestions.; However, this approach has several critical limitations, including its reliance on having a static database schema and data, failure to consider user sessions, and neglect of the sequence of queries in each session.
2013: QueRIE: Collaborative Database Exploration; In this paper, the authors proposed a model, named QueRIE, which takes a full query from users as input. Then, as output, it selects the most probable query in the log file and returns it as the recommended next query. This work is evaluated on the SDSS dataset. Notably, this work considers the concept of user sessions; to do so, it leverages the queries within each session to create a vector representation for each session. Then, by comparing these vectors, it can identify similar user sessions.; The limitation of this work is that it does not consider the sequence of queries in each user session for recommendations
2016: Cluster-driven Navigation of the Query Space; Blaeu is an interactive system that facilitates data exploration and refinement. As input, Blaeu takes the users' initial query and clusters the data based on the result set of the query. The system then presents interactive cluster maps to users, allowing them to navigate and zoom into areas of interest. Moreover, Blaeu provides users with a query that can be used for selecting that area of data. The authors proposed several algorithms for creating the data clusters and evaluated their accuracy on two datasets: (1) US Bureau of Transportation Statistics, which describes delays of US internal flights during January 2010. (2) Hollywood films, which describes a few economic indicators for 785 movies released between 2007 and 2012.; The limitations of this work are: first, generating data maps requires resubmitting queries, which can be computationally expensive for large databases. Second, users need to have a high level of involvement in the process. Third, it does not consider user sessions and sequence of queries for recommendation.
2019: ExplIQuE: Interactive Databases Exploration with SQL; ExplIQuE is a framework for query refinement recommendations to assist users in improving their queries. The model takes a full query as input and produces clustered result sets, accompanied by the WHERE clauses needed to retrieve each cluster. The recommendations are based on data records rather than logs and database schema. The system is evaluated on a dataset for bacteria growth on solid plates.; The limitations of this work can be summarized as follows: the approach presented is only applicable when the database schema and data are static and do not change. Furthermore, the recommendation system does not consider user sessions or the sequence of queries in each session for recommendations.
2021: PyExplore: Query Recommendations for Data Exploration without Query Logs; PyExplore is a framework that takes users' initial query with WHERE clause. Then, as output, it recommends a query with refined WHERE clause based on the data. The method involves measuring the correlation between attributes in the database and dividing them into several groups. Each group is then represented by one attribute, and a decision tree is created per representative attribute. Each node in each of these decision trees defines the split point of the corresponding attribute at that level. Then, the data rows of the database are clustered based on the leaves of the created decision tree. Thus, when users type an input query, the input query is mapped to a node of the decision tree. By navigating from the mapped node to the root, sample data is selected for each node and recommended to the user. The framework is evaluated on several datasets: CORDIS, SDSS, Movies, Car Sales (IBM), Intel Lab Data. For measuring the performance, expert users are asked to score the recommendations.; The limitations of this work include the assumption that the database schema and data are static and never change, the absence of user session and sequence information for recommendation.; Scalable and Data-aware SQL Query Recommendations; In this inspiring paper, the authors provide a data-aware query recommendation system, called DASQR, capable of suggesting complete queries, query templates, and query fragments. The system is considered data-aware as it takes into account actual data values when recommending filtering conditions and predicates. In this work, the methods used for query representation are (1) feature-based, (2) tuple-based, and (3) access-area-based. The similarity between queries is evaluated based on the utilized representation approach. In case of using the first and second representation approach, cosine similarity is used as the metric. In case of using the third representation approach, two metrics—named overlap and closeness—are proposed for evaluation.; The limitation of this work is that it does not consider the sequence of queries in each user session for recommendations.
2023: 1) Sequence-Aware Query Recommendation Using Deep Learning; 2) Workload-Aware Query Recommendation Using Deep Learning; In these insightful papers, an innovative recommendation system is introduced, which takes a full query as input and recommends the templates and fragments for the next query as output. This system is capable of providing sequence-aware and session-based suggestions at both the query fragment and query template levels. For the query fragment level, they exploited several models based on the encoder-decoder architecture—such as Sequence-to-Sequence (Seq2Seq) CNN, Seq2Seq RNN, and transformers—to predict the next query; then, the predicted query is parsed to extract its fragments. Also, for the query template level, the same Seq2Seq models are used as a classification task. In both levels, one-hot encoding is leveraged as the way of query vectorization. They have evaluated their methods on two open-source datasets—SDSS and SQLShare. The evaluation metrics for fragment prediction are precision—which equals the number of correct fragment predictions over the number of total fragment predictions, and recall, which equals the number of correct fragment predictions over the number of total target fragments. The evaluation metric for template prediction is prediction accuracy. A noteworthy advantage of this system is that it is session-aware and sequence-aware.; However, a critical limitation of this work is the utilization of one-hot encoding for embedding queries, which does not consider word similarities in queries and consequently leads to reduced performance.