ATTENTION: The works hosted here are being migrated to a new repository that will consolidate resources, improve discoverability, and better show UTA's research impact on the global community. We will update authors as the migration progresses. Please see MavMatrix for more information.
Show simple item record
dc.contributor.author | Philip, Jijo John | en_US |
dc.date.accessioned | 2013-03-20T19:12:05Z | |
dc.date.available | 2013-03-20T19:12:05Z | |
dc.date.issued | 2013-03-20 | |
dc.date.submitted | January 2012 | en_US |
dc.identifier.other | DISS-11964 | en_US |
dc.identifier.uri | http://hdl.handle.net/10106/11580 | |
dc.description.abstract | We are surrounded by data in various forms such as instant messages, Twitter tweets, Facebook status updates, news, media, blogs and much more. Extracting meaning from such a massive collection of unstructured data would lead to interesting stories. Examples of such stories can be ``\emph{Who was the most popular actor in a particular month}''or ``\emph{Which diseases were people most concerned about in year 2008}''. In this thesis, we propose to discover popular entities mentioned in blog articles based on the concept of prominent streak. Given a sequence of values for a named entity (e.g., a person, a place, etc.), where each value is the occurrence frequency of the entity in blog articles during a corresponding period of time, a prominent streak is a long consecutive subsequence of only large (small) values. Whether a streak is prominent also depends on how it fares against streaks for comparable entities. Using the distributed data processing framework Mapreduce, particularly Hadoop which is one of its open-source implementations, we find entity occurrences in a set of blog articles with a trie-based data structure. Prominent streak discovery algorithms are applied over the detected sequences of entities occurrences to derive interesting stories. Our experiments and evaluation are done over the ICWSM'09 Spinn3r blog dataset, which contains over 44 million blog articles for the months of August and September in 2008. | en_US |
dc.description.sponsorship | Li, Chengkai | en_US |
dc.language.iso | en | en_US |
dc.publisher | Computer Science & Engineering | en_US |
dc.title | Prominent Streaks Discovery On Blog Articles | en_US |
dc.type | M.S. | en_US |
dc.contributor.committeeChair | Li, Chengkai | en_US |
dc.degree.department | Computer Science & Engineering | en_US |
dc.degree.discipline | Computer Science & Engineering | en_US |
dc.degree.grantor | University of Texas at Arlington | en_US |
dc.degree.level | masters | en_US |
dc.degree.name | M.S. | en_US |
Files in this item
- Name:
- Philip_uta_2502M_11964.pdf
- Size:
- 333.3Kb
- Format:
- PDF
This item appears in the following Collection(s)
Show simple item record