PARSING CODE-SWITCHED TAGLISH LANGUAGE BY CREATING CONSTITUENTS
Abstract
When extracting meaning from language, a common first step is to break down language into constituents, or words that work together as a unit. This task, known as parsing, typically follows a specific grammar in order decompose the language into its underlying structure composed of constituents. Difficulties with this grammar-based parsing occur, however, with real-world natural language due to its unstructured nature. Code-switching, the phenomenon of alternating between languages while communicating, further complicates this task by requiring us to parse based on two (or more) languages instead of one. In this thesis, a data-driven method to parse code-switched language into its constituents is presented. The code- switched language used in this thesis is Taglish, comprised of English and Tagalog, and the data is collected from the social media site Twitter.