Show simple item record

dc.contributor.authorThirumuruganathan, Saravananen_US
dc.date.accessioned2015-12-11T23:20:05Z
dc.date.available2015-12-11T23:20:05Z
dc.date.submittedJanuary 2015en_US
dc.identifier.otherDISS-13278en_US
dc.identifier.urihttp://hdl.handle.net/10106/25352
dc.description.abstractAlmost all popular websites (such as Amazon, EBay, microblogs such as Twitter, Instagram, collaborative content sites such as IMDB, Yelp etc) are powered internally by large data repositories. We designate them as hidden databases as their underlying data is accessible only through proprietary form-like interfaces that require users to query the system by entering desired values for a few attributes. Further, these web databases also impose a number of restrictions. For example, the top-k output constraint ensures that when there are a large number of tuples matching the query, only a few of them (top-k) are preferentially selected and returned by the website, often according to a proprietary ranking function. The rate limit constraint restricts the number of queries/API calls that could be issued per day. The rank information of low ranked tuples not in top-k are often not provided due to rank constraint. Most microblogging platforms such as Twitter also enforce recency constraint that limits the results of their APIs to recent data. This stymies the efforts to perform analytics over historic data. Finally, most collaborative content sites provide only aggregate information over items such as number of likes, average rating etc instead of granular information needed for mining. These restrictions prevent scientists with limited resources from performing novel analytics tasks. Similarly, it also prevents third parties from building innovative services over these data.Most prior work on exploratory analysis and mining are not applicable for hidden databases due to the aforementioned restrictions. In this dissertation, we present efficient techniques for enabling exploratory mining over hidden databases. This is achieved by developing novel algorithms that allows a third party (such as an analyst or a scientist) to retrieve relevant data from hidden databases for exploratory mining by issuing a small number of carefully constructed queries that enables them to work around the restrictions. We design algorithms that sidestep the top-k output constraint so that it is possible to retrieve the top-h tuples where h > k. In order to work around rank constraint, we designed statistical estimators that can estimate the rank of a given tuple which works well for both high and lowly ranked tuples. For microblog platforms, we design algorithms that allows users to perform aggregate estimation over historic content thereby circumventing the recency constraint. Finally, we propose a novel featureset uncertainty model and algorithms that can enable exploratory mining over coarse aggregate user feedback data. For all the problems, we provide rigorous theoretical analysis and extensive experiments over real-world data and online experiments over popular hidden web databases.en_US
dc.description.sponsorshipDas, Gautamen_US
dc.language.isoenen_US
dc.publisherComputer Science & Engineeringen_US
dc.titleEnabling Exploratory Mining Over Hidden Databasesen_US
dc.typePh.D.en_US
dc.contributor.committeeChairDas, Gautamen_US
dc.degree.departmentComputer Science & Engineeringen_US
dc.degree.disciplineComputer Science & Engineeringen_US
dc.degree.grantorUniversity of Texas at Arlingtonen_US
dc.degree.leveldoctoralen_US
dc.degree.namePh.D.en_US


Files in this item

Thumbnail


This item appears in the following Collection(s)

Show simple item record