DEEP LEARNING FOR PROTEIN PROPERTY AND STRUCTURE PREDICTION
Abstract
I present my work towards solving the fundamental, challenging, and valuable problem for protein property and structure prediction. Specifically, I focus on solving the problem from three critical aspects: (1) designing powerful deep learning networks for specific protein structure property prediction tasks; (2) proposing general methods that enhancing the protein sequence homologous feature, which is an important input feature of relevant tasks; (3) developing a self-supervised pre-training model for learning structure embeddings from protein tertiary structures. To evaluate the effectiveness of the developed methods, I apply several protein downstream tasks including protein secondary structure, solvent accessibility, backbone dihedral angles, protein structure quality assessment, and protein-protein interaction site prediction.
I accomplish my work step by step. Firstly, I start from the protein secondary structure prediction task, and constantly attempt and design different deep learning networks according to the characteristics of specific prediction tasks to learn the protein data representation. In order to learn the powerful representation of protein data and utilize the characteristics of protein secondary structure, I propose an EnsembleASP method, which is protein ensemble learning with Atrous Spatial Pyramid networks for secondary structure prediction. Moreover, since the homologous information of some proteins is insufficient, I propose a Bagging method which targets at improving the performance of low-quality data in the prediction task. In addition, in order to further solve the problem of uneven distribution of the homologous information in the data, as well as facilitate scientists and researchers to quickly apply and experiment on existing models, I propose a plug-and-play method, WeightAln, which is developed based on the attention mechanism. WeightAln learns the weight of the homologous feature of a target protein, and applies it in the calculation process to obtain a stronger sequence homologous information of the target protein. Last but not least, in order to help protein structure-related downstream tasks, I propose a pre-training model for learning structure embeddings from protein tertiary structures. The model is optimized with a self-supervised loss function, which only relies on protein structures and does not require any additional supervision.