GAN-Based Domain Translation for Hand Pose Estimation and Face Reconstruction
Abstract
Deep learning solutions for hand pose estimation are now very reliant on comprehensive
datasets covering diverse camera perspectives, lighting conditions, shapes,
and pose variations. Since, acquiring such datasets is a challenging task that may be
infeasible for many novel applications, several studies aim to develop semi/self supervised
learning methods, that learn to estimate hand pose from a few labeled/unlabeled
data. Therefore, in this dissertation, we investigate new advances in semi/self supervised
learning which will remove the bottleneck of obtaining time-consuming frameby-
frame manual annotations through generative adversarial networks (GANs).
To handle above mentioned challenges, this thesis makes the following contributions.
First, we present a comprehensive study on effective hand pose estimation
approaches, which are comprised of the leveraged generative adversarial network
(GAN), providing a comprehensive training dataset with different modalities. We
also, evaluate related hand pose datasets and performance comparison of some of
these methods for the hand pose estimation problem. The quantitative and qualitative
results indicate that these methods are able to beat the baseline approaches
with better visual quality and higher values in most of the metrics (PCK and ME)
on benchmark hand pose datasets.
The second contribution is based on the progress of the Generative Adversarial
Network (GAN) and image-style transfer. We propose a two-stage semi-supervised
pipeline which is able to accurately localize the fi ngertip position even in severe self
occlusion on depth images using Cycle-consistent Generative Adversarial Network
(Cycle-GAN). Due to need for huge amount of labeled data for training neural networks,
semi/self-supervised learning is very appealing for CNN training. Experiments
on the challenging NYU hand dataset have demonstrated that our approach outperforms
state-of-the-art approaches on 2-D fingertip estimation by a significant margin
even in the presence of severe self-occlusion and irrespective of user orientation.
Moreover, we develop a GUI in MATLAB R2020a, to obtain 12-joints hand pose
annotations of depth images. We prepare a comprehensive dataset of 10000 depth
hand images collected by Microsoft Kinect V2 along with 7 keypoints on depth hand
dataset.
Third, we present a novel framework and formulate 2D hand keypoint localization
in sequenced data as a problem of conditional video generation. We aim to learn
a mapping function from an input depth video in the source domain to target depth
video by enforcing temporal consistency constraints. To the best of our knowledge,
this is the first work ever performed on fingertip localization on depth videos through
domain adaptation. Our comparative experimental results with the state-of-the-art
single-frame hand pose estimation on the challenging NYU dataset demonstrates that
by exploiting temporal information, our model manifests better hand appearance consistency
in video-to-video synthesis stage which leads to accurate estimations of 2D
hand poses under motion blur by fast hand motion.
In addition, we design and develop a novel game-based system for wrist rehabilitation,
called HandReha. This is a unique and novel approach because the gestures
are selected from a set of human gestures suitable for wrist rehabilitation and implemented
to control a game built in a 3D environment as compared to previous
works where most of the games designed for rehabilitation purposes are built in a 2D
environment.
Finally, we propose a general domain translation framework that can be used
to reconstruct the hidden part of face concealed by mask. We have employed GAN-based
unpaired domain translation technique to translate masked face images from
the source to the unmasked images in the destination domain which can be used for
facial identification and secure authentication in human-computer interaction. The
obtained results demonstrate that our model outperforms other representative state-of-
the-art face completion approaches both qualitatively and quantitatively.