GAN-Based Domain Translation for Hand Pose Estimation and Face Reconstruction

Farahanipad, Farnaz

View/Open

FARAHANIPAD-DISSERTATION-2022.pdf (13.73Mb)

Date

2022-08-05

Author

Farahanipad, Farnaz

Metadata

Show full item record

Abstract

Deep learning solutions for hand pose estimation are now very reliant on comprehensive datasets covering diverse camera perspectives, lighting conditions, shapes, and pose variations. Since, acquiring such datasets is a challenging task that may be infeasible for many novel applications, several studies aim to develop semi/self supervised learning methods, that learn to estimate hand pose from a few labeled/unlabeled data. Therefore, in this dissertation, we investigate new advances in semi/self supervised learning which will remove the bottleneck of obtaining time-consuming frameby- frame manual annotations through generative adversarial networks (GANs). To handle above mentioned challenges, this thesis makes the following contributions. First, we present a comprehensive study on effective hand pose estimation approaches, which are comprised of the leveraged generative adversarial network (GAN), providing a comprehensive training dataset with different modalities. We also, evaluate related hand pose datasets and performance comparison of some of these methods for the hand pose estimation problem. The quantitative and qualitative results indicate that these methods are able to beat the baseline approaches with better visual quality and higher values in most of the metrics (PCK and ME) on benchmark hand pose datasets. The second contribution is based on the progress of the Generative Adversarial Network (GAN) and image-style transfer. We propose a two-stage semi-supervised pipeline which is able to accurately localize the fi ngertip position even in severe self occlusion on depth images using Cycle-consistent Generative Adversarial Network (Cycle-GAN). Due to need for huge amount of labeled data for training neural networks, semi/self-supervised learning is very appealing for CNN training. Experiments on the challenging NYU hand dataset have demonstrated that our approach outperforms state-of-the-art approaches on 2-D fingertip estimation by a significant margin even in the presence of severe self-occlusion and irrespective of user orientation. Moreover, we develop a GUI in MATLAB R2020a, to obtain 12-joints hand pose annotations of depth images. We prepare a comprehensive dataset of 10000 depth hand images collected by Microsoft Kinect V2 along with 7 keypoints on depth hand dataset. Third, we present a novel framework and formulate 2D hand keypoint localization in sequenced data as a problem of conditional video generation. We aim to learn a mapping function from an input depth video in the source domain to target depth video by enforcing temporal consistency constraints. To the best of our knowledge, this is the first work ever performed on fingertip localization on depth videos through domain adaptation. Our comparative experimental results with the state-of-the-art single-frame hand pose estimation on the challenging NYU dataset demonstrates that by exploiting temporal information, our model manifests better hand appearance consistency in video-to-video synthesis stage which leads to accurate estimations of 2D hand poses under motion blur by fast hand motion. In addition, we design and develop a novel game-based system for wrist rehabilitation, called HandReha. This is a unique and novel approach because the gestures are selected from a set of human gestures suitable for wrist rehabilitation and implemented to control a game built in a 3D environment as compared to previous works where most of the games designed for rehabilitation purposes are built in a 2D environment. Finally, we propose a general domain translation framework that can be used to reconstruct the hidden part of face concealed by mask. We have employed GAN-based unpaired domain translation technique to translate masked face images from the source to the unmasked images in the destination domain which can be used for facial identification and secure authentication in human-computer interaction. The obtained results demonstrate that our model outperforms other representative state-of- the-art face completion approaches both qualitatively and quantitatively.

URI

http://hdl.handle.net/10106/30943