Phone: (+61 8) 6488 3222
Fax: (+61 8) 6488 1089
Human motion capture in images and videos using discriminative and hybrid learning methods
Vision-based human pose estimation and tracking is a popular research area that has generated a great deal of interest in the last decade. This is motivated by the fact that this research area has many applications including video surveillance, clinical rehabilitation and the analysis of athlete performance. It is also non-intrusive and does not require markers to be attached to the body parts, as opposed to the marker based motion capture systems. In this thesis, two machine learning and one feature representation techniques have been developed to automatically capture human motion from images and videos.
During the last two decades there has been much work in markerless human motion capture. This thesis contributes to the existing body of work by providing three new algorithms. First, an appearance descriptor is proposed for human pose estimation from monocular images. Second, a discriminative learning-based fusion algorithm is proposed to combine shape and appearance features for human pose estimation from monocular images. Third, a hybrid discriminative and generative method that takes into account prediction uncertainty of the discriminative model is proposed for 3D human pose tracking from both single and multiple cameras.
Shape-based features such as silhouettes and appearance features are commonly used for pose estimation from monocular images using regression based techniques. Silhouette features require a segmentation step to obtain only information pertinent to the shape of the occluding body parts and discards appearance information that can potentially be useful for pose estimation. In order to utilize appearance information, we present an appearance descriptor that involves dimensionality reduction and vector quantization and that is suitable for regression-based human pose estimation. To objectively compare the state of art shape and appearance descriptors with our appearance descriptor, we conducted a quantitative evaluation using the HumanEva-I dataset.
Shape-based features such as silhouettes are insensitive to background variations but they can be associated with more than one pose, resulting in ambiguities. Appearance features, on the other hand, can be more distinctive than shape features but they may be affected by background clutter and variations in the clothing of the human subject which can make appearance features unstable. While neither shape nor appearance features are self-sufficient for a robust estimation of human poses, they have the potential to complement each other because one may not be sensitive to conditions that affect the other. This thesis presents a novel fusion method based on discriminative learning to combine the proposed appearance descriptor with a shape descriptor to exploit their complementary properties for human pose estimation from monocular images. The proposed method, which is named “localized decision level fusion” technique, is based on clustering the output pose space into several partitions and learning a decision level fusion of the regression models for the shape and appearance descriptor in each region. The evaluation of the proposed fusion method using the HumanEva-I dataset demonstrates that the proposed feature combination method gives a more accurate pose estimation than that from each individual feature type and from other fusion techniques.
Discriminative methods are limited by the domain of their training examples while generative methods are flexible and provide room for using partial knowledge of the solution space. Due to their different ways of predicting the final output, a hybrid model of these two methods has the potential to improve performance. In this thesis, we propose a hybrid discriminative and generative method to track the 3D human pose from both single and multiple cameras. The discriminative models are obtained by the training of a mixture of Gaussian Process (GP) regression models. In the tracking step, the probabilistic predictions from the GP regression models are combined with a particle filter (which is a generative method) and annealing to track the 3D pose in each video frame. To the best of our knowledge, this is the first method that takes into account the predictive uncertainty of the discriminative model and combines it within a generative method. To validate the proposed methods in term of correctness, we tested our algorithm on the HumanEva-I and HumanEva-II datasets.
Vision-based motion capture systems use images of human subjects captured using single or multiple digital cameras. Since digital cameras are non-invasive visual sensors, they can be used in ambient environments to record the motions. Such systems are promising and can be made accessible at a relatively low cost for commercial use and to the general public. Furthermore, automatic understanding of human motions from images and videos is applicable to a wide range of areas including security, health and entertainment. The list of applications includes
1. Security/Surveillance: In video-based smart surveillance systems, human motion can be used to infer the action of a human subject in a scene e.g. for behavior analysis purposes. The system can be used to assist the security personnel to focus their attention on events of interest.
2. Human Computer Interface (HCI): Human motion can provide pervasive computer Interfaces whereby computers can be controlled by natural human gestures. For example, in an “intelligent home scenario”, cameras located in a room would be able to perform several additional tasks such as turning on/off lights and TV based on the estimated human gestures. Likewise, it can provide an advanced human computer interface for gaming and virtual reality applications.
3. Health: The captured human motions can be used in clinical gait analysis for medical rehabilitation, diagnostic of orthopedic patients to identify posture related problems in people with injuries. It has been argued that gait analysis can have the same level of importance as other clinical tests . It can also be incorporated with motion analysis to improve the performance of athletes and to minimize injuries.
4. Biometric: Human motions have been used in biometric systems for gait analysis to perform gender discrimination, age discrimination, and person identification.
5. Entertainment: Motion capture is increasingly being used in movies and animation industry to create special effects . Many movies use motion capture to interpret the action of an actor and then animate it using a digital character. For instance, the animation of several characters in the motion picture Avatar were produced using motion capture systems of the marker-based type.
6. Content-based image retrieval: Human motion analysis is useful for searching and indexing large image databases using the pose of persons as a key or a search query.