Semi Automatic Hand Pose Annotation using a Single Depth Camera
Abstract
This thesis investigates the problem of 3D hand pose annotation using a single depth camera. While hand pose annotations are critically important for training deep neural networks, creating such reliable training data is challenging and manual labor intensive. Current datasets that rely on manual annotation on real images are limited in size due to the difficulty of annotating them. Although, large datasets have been generated using tracking based methods followed by manual refinement, these methods are prone to annotation errors due to tracking failure. Synthetic images have also been used to create large datasets but synthetic frames does not capture the sensor characteristics such as noise while also producing kinematically implausible and unnatural hand poses. We propose a semi-automatic method for efficiently and accurately labeling the 3D hand key-points in a hand depth video. The process starts by selecting a subset of frames that are representative of all the frames in the dataset and the user only provides an estimate of the 2D hand key-points in these selected frames. We use this information to infer the 3D location of the joints for all the frames by enforcing appearance, temporal and distance constraints. Finally, we demonstrate that our method can generate 3D training data more accurately using less manual intervention and offering more flexibility in
comparison to other state-of-the-art methods.