On the Efficacy of Knowledge Graph Completion Methods, Accuracy Measures and Evaluation Protocols

Akrami, Farahnaz

dc.contributor.advisor	Li, Chengkai
dc.creator	Akrami, Farahnaz
dc.date.accessioned	2021-09-14T13:54:46Z
dc.date.available	2021-09-14T13:54:46Z
dc.date.created	2021-08
dc.date.issued	2021-08-25
dc.date.submitted	August 2021
dc.identifier.uri	http://hdl.handle.net/10106/29977
dc.description.abstract	In the active research area of employing embedding models for knowledge graph completion, particularly for the task of link prediction, most prior studies used some specific benchmark datasets to evaluate such models. Most triples in those datasets belong to reverse and duplicate relations, which exhibit high data redundancy due to semantic duplication, correlation, or data incompleteness. This is a case of excessive data leakage—a model is trained using features that otherwise would not be available when the model needs to be applied for real prediction. There are also Cartesian product relations for which every triple formed by the Cartesian product of applicable subjects and objects is a true fact. Link prediction on the aforementioned relations is easy and can be achieved with even better accuracy using straightforward rules instead of sophisticated embedding models. A more fundamental defect of these models is that the link prediction scenario, given such data, is non-existent in the real world. This dissertation thoroughly examines the impact of the aforementioned problems on the performance of different embedding models and provides a systematic evaluation of the true effectiveness of the models when the unrealistic triples are removed. Our experiment results show that the reverse triples led to a substantial overestimation of the accuracy of the embedding models. We argue that the popular benchmark datasets are entirely misleading and should not be used anymore. In this dissertation, we also demonstrate the inadequacy of existing evaluation metrics. The frequently used metrics are based on the closed-world assumption and thus have flaws when a model correctly predicts a triple that does not exist in the benchmark dataset. Models are penalized for generating such correct predictions, which contradicts with the exact goal of link prediction—finding correct triples that do not already exist in the knowledge graph. Another limitation of the current metrics is that they aggregate the predictions’ accuracy of all triples into a single value. This makes it impossible to discern the specific strengths and weaknesses of the models for different predictions tasks. We present the results per relation for various models and the results on relations with varying difficulty levels (1-1 vs. 1-N vs. N-M), in order to provide more insights into embedding models and show performance differences that global metrics would not reveal. Link prediction task, the most generic evaluation protocol, verifies that models prioritize correct answers over wrong ones for a question that is already known to have an answer. This evaluation setup can be misleading as we cannot verify if a model ranks false or nonsensical triples lower than correct triples. Hence, we report the results of an alternative protocol called entity-pair ranking to rank all possible triples for a specific relation. Also, the current evaluation protocol is based on the assumption that the presence of a particular property on an entity is already known. The evaluation focuses on whether a model can derive the correct property values. In reality, though, it remains a challenge to determine whether a property is valid for a given entity in the first place. Therefore, we propose the property prediction task as another evaluation protocol. The performances of the models on these two tasks are unsatisfactory and differ considerably from link prediction results. Another way of evaluation is to find the models' performance on triple classification. This task is the binary classification of triples regarding whether they are true or false facts. We compared the classification results using two sets of negative triples, one that complies with type constraints of Freebase and one that violates them. The results show that when negative triples are type consistent, classification performance degrades considerably. The results of these different evaluation protocols suggest that better knowledge graph embedding models or training strategies are needed.
dc.format.mimetype	application/pdf
dc.language.iso	en_US
dc.subject	Knowledge graph
dc.subject	Embedding models
dc.subject	Knowledge graph completion
dc.title	On the Efficacy of Knowledge Graph Completion Methods, Accuracy Measures and Evaluation Protocols
dc.type	Thesis
dc.degree.department	Computer Science and Engineering
dc.degree.name	Doctor of Philosophy in Computer Science
dc.date.updated	2021-09-14T13:54:47Z
thesis.degree.department	Computer Science and Engineering
thesis.degree.grantor	The University of Texas at Arlington
thesis.degree.level	Doctoral
thesis.degree.name	Doctor of Philosophy in Computer Science
dc.type.material	text

Files in this item

Name:: AKRAMI-DISSERTATION-2021.pdf
Size:: 1.974Mb
Format:: PDF

View/Open

This item appears in the following Collection(s)

Show simple item record