Using antipatterns to improve database code fragments, and utilizing knowledge graphs and NLP patterns to extract standardized data element names

Alshemaimri, Bader

dc.contributor.advisor	Elmasri, Ramez
dc.creator	Alshemaimri, Bader
dc.date.accessioned	2022-06-28T15:10:35Z
dc.date.available	2022-06-28T15:10:35Z
dc.date.created	2022-05
dc.date.issued	2022-04-06
dc.date.submitted	May 2022
dc.identifier.uri	http://hdl.handle.net/10106/30366
dc.description.abstract	Database code fragments exist in software systems by using SQL as the stan- dard language for relational databases. Traditionally, developers bind databases as backends to software systems for supporting user applications. However, these bind- ings are low-level code and implemented to persist user data, so Object Relational Mapping (ORM) frameworks take place to abstract database access details. These approaches are prone to problematic database code fragments that negatively im- pact the quality of software systems. In the first part of the dissertation, we survey problematic database code fragments in the literature and examine antipatterns that occur in low-level database access code using SQL and high-level counterparts in ORM frameworks. We also study problematic database code fragments in different popular software architectures such as Service Oriented Architecture (SOA), Microservice Ar- chitecture (MA), and Model View Controller (MVC). We create a novel categorization of both SQL schema and query antipatterns in terms of performance, maintainability, portability, and data integrity. In the second part of this dissertation, we create NLP patterns that support data architects when modeling and naming data element definitions. We design and develop rule-based natural language processing (NLP) techniques to automatically extract standardized data element names from data element definitions written in American English. The goal is to study how using NLP techniques can improve the accuracy of extracting standardized data element names in a domain-independent context. It is a challenge to come up with NLP patterns in natural language definitions as opposed to unambiguous code. To achieve automated data element naming, we first identify heuristic patterns that mine noun phrases and relationships from data element definitions. Then, we use these noun phrases and relationships as input to determine components of data element names. The output of the patterns is reviewed by a domain expert. We apply our method to extract the five standard components of a data element name in the Railway and Transportation domains. We first achieved 80% accuracy, then by improving the rules and adding a similarity function using knowledge graphs, we improved the accuracy to 95% in our final experiments. We also introduce our tool entitled as Data Element Naming Automation (DENA) tool. The tool consists of four components: DENA NLP, DENA assem- bly, preprocessing, and duplicate checker. In the last part of the dissertation, we propose how we preprocess data element definitions and evaluate the deduplication detection.
dc.format.mimetype	application/pdf
dc.language.iso	en_US
dc.subject	Data management
dc.subject	Artificial intelligence
dc.title	Using antipatterns to improve database code fragments, and utilizing knowledge graphs and NLP patterns to extract standardized data element names
dc.type	Thesis
dc.degree.department	Computer Science and Engineering
dc.degree.name	Doctor of Philosophy in Computer Science
dc.date.updated	2022-06-28T15:10:35Z
thesis.degree.department	Computer Science and Engineering
thesis.degree.grantor	The University of Texas at Arlington
thesis.degree.level	Doctoral
thesis.degree.name	Doctor of Philosophy in Computer Science
dc.type.material	text

Files in this item

Name:: ALSHEMAIMRI-DISSERTATION-2022.pdf
Size:: 2.555Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Show simple item record