Enabling Exploratory Mining Over Hidden Databases

Thirumuruganathan, Saravanan

dc.contributor.author	Thirumuruganathan, Saravanan	en_US
dc.date.accessioned	2015-12-11T23:20:05Z
dc.date.available	2015-12-11T23:20:05Z
dc.date.submitted	January 2015	en_US
dc.identifier.other	DISS-13278	en_US
dc.identifier.uri	http://hdl.handle.net/10106/25352
dc.description.abstract	Almost all popular websites (such as Amazon, EBay, microblogs such as Twitter, Instagram, collaborative content sites such as IMDB, Yelp etc) are powered internally by large data repositories. We designate them as hidden databases as their underlying data is accessible only through proprietary form-like interfaces that require users to query the system by entering desired values for a few attributes. Further, these web databases also impose a number of restrictions. For example, the top-k output constraint ensures that when there are a large number of tuples matching the query, only a few of them (top-k) are preferentially selected and returned by the website, often according to a proprietary ranking function. The rate limit constraint restricts the number of queries/API calls that could be issued per day. The rank information of low ranked tuples not in top-k are often not provided due to rank constraint. Most microblogging platforms such as Twitter also enforce recency constraint that limits the results of their APIs to recent data. This stymies the efforts to perform analytics over historic data. Finally, most collaborative content sites provide only aggregate information over items such as number of likes, average rating etc instead of granular information needed for mining. These restrictions prevent scientists with limited resources from performing novel analytics tasks. Similarly, it also prevents third parties from building innovative services over these data.Most prior work on exploratory analysis and mining are not applicable for hidden databases due to the aforementioned restrictions. In this dissertation, we present efficient techniques for enabling exploratory mining over hidden databases. This is achieved by developing novel algorithms that allows a third party (such as an analyst or a scientist) to retrieve relevant data from hidden databases for exploratory mining by issuing a small number of carefully constructed queries that enables them to work around the restrictions. We design algorithms that sidestep the top-k output constraint so that it is possible to retrieve the top-h tuples where h > k. In order to work around rank constraint, we designed statistical estimators that can estimate the rank of a given tuple which works well for both high and lowly ranked tuples. For microblog platforms, we design algorithms that allows users to perform aggregate estimation over historic content thereby circumventing the recency constraint. Finally, we propose a novel featureset uncertainty model and algorithms that can enable exploratory mining over coarse aggregate user feedback data. For all the problems, we provide rigorous theoretical analysis and extensive experiments over real-world data and online experiments over popular hidden web databases.	en_US
dc.description.sponsorship	Das, Gautam	en_US
dc.language.iso	en	en_US
dc.publisher	Computer Science & Engineering	en_US
dc.title	Enabling Exploratory Mining Over Hidden Databases	en_US
dc.type	Ph.D.	en_US
dc.contributor.committeeChair	Das, Gautam	en_US
dc.degree.department	Computer Science & Engineering	en_US
dc.degree.discipline	Computer Science & Engineering	en_US
dc.degree.grantor	University of Texas at Arlington	en_US
dc.degree.level	doctoral	en_US
dc.degree.name	Ph.D.	en_US

Files in this item

Name:: Thirumuruganathan_uta_2502D_13 ...
Size:: 3.850Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Show simple item record