FRAMEWORKS, ALGORITHMS, AND SYSTEMS FOR EFFICIENT DISCOVERY OF DATA-BACKED FACTS

Zhang, Gensheng

dc.contributor.advisor	Li, Chengkai
dc.creator	Zhang, Gensheng
dc.date.accessioned	2018-02-15T20:36:45Z
dc.date.available	2018-02-15T20:36:45Z
dc.date.created	2017-12
dc.date.issued	2017-12-07
dc.date.submitted	December 2017
dc.identifier.uri	http://hdl.handle.net/10106/27183
dc.description.abstract	This thesis studies the problem of finding facts from semi-structured and structured data. The amount of data in our world is exploding, and the proliferation of data is making them increasingly inaccessible. It is now more challenging than ever how to efficiently identify useful information where a vast amount of data is available. This thesis first studies the problem of finding facts in semi-structured data, specifically, in knowledge graphs. We built Maverick, a general, extensible framework that discovers exceptional facts about entities in knowledge graphs. We model an exceptional fact about an entity of interest as a context-subspace pair, in which a subspace is a set of attributes and a context is defined by a graph query pattern of which the entity is a match. The entity is exceptional among the entities in the context, with regard to the subspace. The search spaces of both patterns and subspaces are exponentially large. Maverick conducts beam search on the patterns which uses a match-based pattern construction method to evade the evaluation of invalid patterns. It applies two heuristics to select promising patterns to form the beam in each iteration. Maverick traverses and prunes the subspaces organized as a set enumeration tree by exploiting the upper bound properties of exceptionality scoring functions. Maverick demonstrated substantial performance improvement of the proposed framework over the baselines as well as its effectiveness in discovering exceptional facts. This thesis further investigates the problem of finding facts in structured data. In particular, our objective is to find top-k prominent streaks in multi-dimensional multi-sequence data. Given a sequence of values, a prominent streak is a long consecutive subsequence consisting of only large (small) values, e.g., consecutive games of outstanding performance in sports. To efficiently discover prominent streaks, we exploited the properties of local prominent streaks (LPS), which is a superset of prominent streaks. We showed that LPS-based algorithms exhibited orders of magnitude performance improvement against the baseline method. This thesis also presents a fact-finding system FactWatcher, which helps journalists identify data-backed, attention-seizing facts which serve as leads to news stories. FactWatcher discovers three types of facts, including situational facts, one-of-the-few facts, and prominent streaks, through a unified suite of data model, algorithm framework, and fact ranking measure. Furthermore, FactWatcher provides multiple features in striving for an end-to-end system, including fact ranking, fact-to-statement translation and keyword-based fact search.
dc.format.mimetype	application/pdf
dc.language.iso	en_US
dc.subject	Computational journalism
dc.subject	Fact-finding
dc.subject	Knowledge graph mining
dc.subject	Sequence data mining
dc.subject	Interestingness ranking
dc.title	FRAMEWORKS, ALGORITHMS, AND SYSTEMS FOR EFFICIENT DISCOVERY OF DATA-BACKED FACTS
dc.type	Thesis
dc.degree.department	Computer Science and Engineering
dc.degree.name	Doctor of Philosophy in Computer Science
dc.date.updated	2018-02-15T20:38:52Z
thesis.degree.department	Computer Science and Engineering
thesis.degree.grantor	The University of Texas at Arlington
thesis.degree.level	Doctoral
thesis.degree.name	Doctor of Philosophy in Computer Science
dc.type.material	text
dc.creator.orcid	0000-0002-9936-4407

Files in this item

Name:: ZHANG-DISSERTATION-2017.pdf
Size:: 3.453Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Show simple item record