Algorithms For Building Compact Representatives And Processing Ranking Queries
Abstract
Ranked retrieval model has rapidly replaced the traditional Boolean retrieval model
as the de facto way for query processing when a large portion of (big) data matches
a given query. Returning all the query results in these cases is not efficient nor
informative. Unlike the Boolean retrieval model, the ranked retrieval model orders
the matching tuples according to an often proprietary ranking function and returns
the top-k of them. In this dissertation, we study ranked retrieval model and propose
exact and approximate algorithms for (i) building representatives for fast query
processing, and (ii) online processing of ranking queries. We study the problem
both in the general cases and in the special environment of web databases, a natural
fit for the ranked retrieval model.
We start the dissertation by building representatives that serve as indices for
ranking query processing. A critical observation is that skyline, also known as
Pareto-optimal, (resp. k sky-band) is a set that contains the top-1 (resp. top-k) for
every possible ranking function following the monotonic order of attribute values.
Thus, first, we study the problem crowdsourcing Pareto-optimal object finding,
in the case where objects do not have explicit attributes and preference relations
on objects are strict partial orders. Then, we initiate the research into the novel
problem of skyline discovery over hidden web databases, which enables a wide
variety of innovative third-party applications over one or multiple web databases.
A major problem with the ranking queries representatives, i.e., skyline and
convex hull, is that as in real-world applications the representative can be a
significant portion of the data, its performance in the ranking query processing is
greatly reduced. Thus, computing a subset limited to r tuples that minimize the
user’s dissatisfaction with the result from the limited set is of interest. We make
several fundamental theoretical as well as practical advances in developing such a
compact set.
Finally, considering the limitations of top-k indices, while focusing on the
client-server databases, we propose query reranking third-party service that uses
public interface of the database to enable the on-the-fly processing of ranking
queries.