Prominent Streaks Discovery On Blog Articles

Philip, Jijo John

dc.contributor.author	Philip, Jijo John	en_US
dc.date.accessioned	2013-03-20T19:12:05Z
dc.date.available	2013-03-20T19:12:05Z
dc.date.issued	2013-03-20
dc.date.submitted	January 2012	en_US
dc.identifier.other	DISS-11964	en_US
dc.identifier.uri	http://hdl.handle.net/10106/11580
dc.description.abstract	We are surrounded by data in various forms such as instant messages, Twitter tweets, Facebook status updates, news, media, blogs and much more. Extracting meaning from such a massive collection of unstructured data would lead to interesting stories. Examples of such stories can be ``\emph{Who was the most popular actor in a particular month}''or ``\emph{Which diseases were people most concerned about in year 2008}''. In this thesis, we propose to discover popular entities mentioned in blog articles based on the concept of prominent streak. Given a sequence of values for a named entity (e.g., a person, a place, etc.), where each value is the occurrence frequency of the entity in blog articles during a corresponding period of time, a prominent streak is a long consecutive subsequence of only large (small) values. Whether a streak is prominent also depends on how it fares against streaks for comparable entities. Using the distributed data processing framework Mapreduce, particularly Hadoop which is one of its open-source implementations, we find entity occurrences in a set of blog articles with a trie-based data structure. Prominent streak discovery algorithms are applied over the detected sequences of entities occurrences to derive interesting stories. Our experiments and evaluation are done over the ICWSM'09 Spinn3r blog dataset, which contains over 44 million blog articles for the months of August and September in 2008.	en_US
dc.description.sponsorship	Li, Chengkai	en_US
dc.language.iso	en	en_US
dc.publisher	Computer Science & Engineering	en_US
dc.title	Prominent Streaks Discovery On Blog Articles	en_US
dc.type	M.S.	en_US
dc.contributor.committeeChair	Li, Chengkai	en_US
dc.degree.department	Computer Science & Engineering	en_US
dc.degree.discipline	Computer Science & Engineering	en_US
dc.degree.grantor	University of Texas at Arlington	en_US
dc.degree.level	masters	en_US
dc.degree.name	M.S.	en_US

Files in this item

Name:: Philip_uta_2502M_11964.pdf
Size:: 333.3Kb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Show simple item record