Show simple item record

dc.contributor.advisorHuber, Manfred
dc.creatorBhatt, Saumya
dc.date.accessioned2022-01-25T18:28:27Z
dc.date.available2022-01-25T18:28:27Z
dc.date.created2021-12
dc.date.issued2022-01-06
dc.date.submittedDecember 2021
dc.identifier.urihttp://hdl.handle.net/10106/30243
dc.description.abstractThe Vision and Language Navigation task came to life from the idea that we can build a robot or an autonomous system that can be instructed in human language and that will navigate using the instructions given. For example, we tell the agent to “Go down past some room dividers toward a glass top desk and turn into the dining area. Wait next to the large glass dining table” and not only does it reach the goal state but it follows the instructions while navigating. With the current developments, this may not seem like a distant problem anymore and in recent years a number of systems have been developed that attempt to address this task. To accomplish this task, the artificial agent must understand two modalities with which humans perceive the world, vision, and language, and then translate these into actions. While significant progress has been made in recent years to develop systems capable of performing this task, these systems still fail in a significant number of cases. To investigate reasons and potential ways to overcome this, this thesis explores a few ways in which the navigation task with multiple modalities can be grounded and can be aligned temporally and visually. This thesis analyzes the failures of the previously used Environment Drop method with Back translation and investigates what happens when pre-trained embeddings, as well as auxiliary tasks, are utilized with it. In particular, it proposes an augmentation to the architecture for the Vision and language Navigation task with pretrained language tokens and a navigator with reasoning to oversee the progress and to co-ground vision and language rather than to only use temporal attention mechanism. The underlying base architecture on which the modifications have been implemented has been a highly successful method and uses the Environment Drop method with Back translation. While results with the modified architecture and proposed improvements did not show a significant increase in the success rate of the chosen base architecture, the analysis of the results has provided valuable insights to help determine the direction of potential further research.
dc.format.mimetypeapplication/pdf
dc.language.isoen_US
dc.subjectVision and language Navigation
dc.subjectMultimodal
dc.subjectNatural language processing
dc.subjectMachine translation
dc.subjectMatterport3D
dc.subjectRoom2Room
dc.subjectAutonomous systems
dc.subjectLSTM
dc.subjectVision and language grounding
dc.subjectRobot
dc.subjectNavigation
dc.titleLanguage Pre-Training and Auxiliary Tasks for Vision and Language Navigation
dc.typeThesis
dc.contributor.committeeMemberPark, Deok Gun
dc.degree.departmentComputer Science and Engineering
dc.degree.nameMaster of Science in Computer Science
dc.date.updated2022-01-25T18:28:27Z
thesis.degree.departmentComputer Science and Engineering
thesis.degree.grantorThe University of Texas at Arlington
thesis.degree.levelMasters
thesis.degree.nameMaster of Science in Computer Science
dc.type.materialtext


Files in this item

Thumbnail


This item appears in the following Collection(s)

Show simple item record