Enabling Exploratory Mining Over Hidden Databases
MetadataShow full item record
Almost all popular websites (such as Amazon, EBay, microblogs such as Twitter, Instagram, collaborative content sites such as IMDB, Yelp etc) are powered internally by large data repositories. We designate them as hidden databases as their underlying data is accessible only through proprietary form-like interfaces that require users to query the system by entering desired values for a few attributes. Further, these web databases also impose a number of restrictions. For example, the top-k output constraint ensures that when there are a large number of tuples matching the query, only a few of them (top-k) are preferentially selected and returned by the website, often according to a proprietary ranking function. The rate limit constraint restricts the number of queries/API calls that could be issued per day. The rank information of low ranked tuples not in top-k are often not provided due to rank constraint. Most microblogging platforms such as Twitter also enforce recency constraint that limits the results of their APIs to recent data. This stymies the efforts to perform analytics over historic data. Finally, most collaborative content sites provide only aggregate information over items such as number of likes, average rating etc instead of granular information needed for mining. These restrictions prevent scientists with limited resources from performing novel analytics tasks. Similarly, it also prevents third parties from building innovative services over these data.Most prior work on exploratory analysis and mining are not applicable for hidden databases due to the aforementioned restrictions. In this dissertation, we present efficient techniques for enabling exploratory mining over hidden databases. This is achieved by developing novel algorithms that allows a third party (such as an analyst or a scientist) to retrieve relevant data from hidden databases for exploratory mining by issuing a small number of carefully constructed queries that enables them to work around the restrictions. We design algorithms that sidestep the top-k output constraint so that it is possible to retrieve the top-h tuples where h > k. In order to work around rank constraint, we designed statistical estimators that can estimate the rank of a given tuple which works well for both high and lowly ranked tuples. For microblog platforms, we design algorithms that allows users to perform aggregate estimation over historic content thereby circumventing the recency constraint. Finally, we propose a novel featureset uncertainty model and algorithms that can enable exploratory mining over coarse aggregate user feedback data. For all the problems, we provide rigorous theoretical analysis and extensive experiments over real-world data and online experiments over popular hidden web databases.