Show simple item record

dc.contributor.advisorLi, Chengkai
dc.creatorZhang, Gensheng
dc.date.accessioned2018-02-15T20:36:45Z
dc.date.available2018-02-15T20:36:45Z
dc.date.created2017-12
dc.date.issued2017-12-07
dc.date.submittedDecember 2017
dc.identifier.urihttp://hdl.handle.net/10106/27183
dc.description.abstractThis thesis studies the problem of finding facts from semi-structured and structured data. The amount of data in our world is exploding, and the proliferation of data is making them increasingly inaccessible. It is now more challenging than ever how to efficiently identify useful information where a vast amount of data is available. This thesis first studies the problem of finding facts in semi-structured data, specifically, in knowledge graphs. We built Maverick, a general, extensible framework that discovers exceptional facts about entities in knowledge graphs. We model an exceptional fact about an entity of interest as a context-subspace pair, in which a subspace is a set of attributes and a context is defined by a graph query pattern of which the entity is a match. The entity is exceptional among the entities in the context, with regard to the subspace. The search spaces of both patterns and subspaces are exponentially large. Maverick conducts beam search on the patterns which uses a match-based pattern construction method to evade the evaluation of invalid patterns. It applies two heuristics to select promising patterns to form the beam in each iteration. Maverick traverses and prunes the subspaces organized as a set enumeration tree by exploiting the upper bound properties of exceptionality scoring functions. Maverick demonstrated substantial performance improvement of the proposed framework over the baselines as well as its effectiveness in discovering exceptional facts. This thesis further investigates the problem of finding facts in structured data. In particular, our objective is to find top-k prominent streaks in multi-dimensional multi-sequence data. Given a sequence of values, a prominent streak is a long consecutive subsequence consisting of only large (small) values, e.g., consecutive games of outstanding performance in sports. To efficiently discover prominent streaks, we exploited the properties of local prominent streaks (LPS), which is a superset of prominent streaks. We showed that LPS-based algorithms exhibited orders of magnitude performance improvement against the baseline method. This thesis also presents a fact-finding system FactWatcher, which helps journalists identify data-backed, attention-seizing facts which serve as leads to news stories. FactWatcher discovers three types of facts, including situational facts, one-of-the-few facts, and prominent streaks, through a unified suite of data model, algorithm framework, and fact ranking measure. Furthermore, FactWatcher provides multiple features in striving for an end-to-end system, including fact ranking, fact-to-statement translation and keyword-based fact search.
dc.format.mimetypeapplication/pdf
dc.language.isoen_US
dc.subjectComputational journalism
dc.subjectFact-finding
dc.subjectKnowledge graph mining
dc.subjectSequence data mining
dc.subjectInterestingness ranking
dc.titleFRAMEWORKS, ALGORITHMS, AND SYSTEMS FOR EFFICIENT DISCOVERY OF DATA-BACKED FACTS
dc.typeThesis
dc.degree.departmentComputer Science and Engineering
dc.degree.nameDoctor of Philosophy in Computer Science
dc.date.updated2018-02-15T20:38:52Z
thesis.degree.departmentComputer Science and Engineering
thesis.degree.grantorThe University of Texas at Arlington
thesis.degree.levelDoctoral
thesis.degree.nameDoctor of Philosophy in Computer Science
dc.type.materialtext
dc.creator.orcid0000-0002-9936-4407


Files in this item

Thumbnail


This item appears in the following Collection(s)

Show simple item record