Automatic understanding of human activity and action is a very important and challenging research area of Computer Vision with wide scale applications in video surveillance, motion analysis, virtual reality interfaces, robot navigation and recognition, video indexing, content based video retrieval, HCI, health care, choreography and sports video analysis etc. This thesis presents a series of techniques to solve the problem of human action recognition in video. First approach towards this goal is based on the a probabilistic optimization model of body parts using hidden markov model (HMM). This strong model based approach is able to distinguish between similar actions by only considering the body parts having major contributions to the actions, for example legs for walking and jogging; arms for boxing and clapping. Next approach is based on the observation that the action recognition can be done using only the visual cue, i.e. human pose during the action, even with the information of few frames instead of examining the whole sequence. In this method, actions are represented by a Bag-of-key-poses model to capture the human pose variation during an action. To tackle the problem of recognizing the action in complex scenes, we propose a model free approach which is based on the Spatio-temporal interest point (STIP) and local feature. To this end, a novel STIP detector is proposed which uses a mechanism similar to that of the non-classical receptive field inhibition that is exhibited by most orientation selective neurons in the primary visual cortex. An extension of the selective STIP based action recognition is applied to the human action recognition in multi-camera system. In this case, selective STIPs from each camera view point are combined using the 3D reconstructed data, to form 4D STIPs [3D space + time] for multi-view action recognition. The concluding part of the thesis dedicates to the continuous visual event recognition (CVER) on large scale video dataset. This is an extremely challenging problem due to high scalability, diverse real environment state and wide scene variability. To address these issues, a motion region extraction technique is applied as a preprocessing step. A max-margin generalized Hough Transform framework is used to learn the feature vote distribution around the activity center to obtain an activity hypothesis which is verified by a Bag-of-words + SVM action recognition system. We validate our proposed approaches on several benchmark action recognition datasets as well as small scale and large scale activity recognition datasets. We obtain state-of-the results which shows a progressive improvement of our proposed techniques to solve human action and activity recognition in video.
- Action recognition
- Bag-of-woords models