Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video
Lionel Pigou / February 2016
Disclaimer: This post is summarized by an AI chatbot and is based on the chapter in my Ph.D dissertation on page 105 and my IJCV 2016 paper.
Recent studies have demonstrated the power of recurrent neural networks for machine translation, image captioning and speech recognition. For the task of capturing temporal structure in video, however, there still remain numerous open research questions. Current research suggests using a simple temporal feature pooling strategy to take into account the temporal aspect of video. We demonstrate that this method is not sufficient for gesture recognition, where temporal information is more discriminative compared to general video classification tasks.
We explore deep architectures for gesture recognition in video and propose a new end-to-end trainable neural network architecture incorporating temporal convolutions and bidirectional recurrence. Our main contributions are twofold; first, we show that recurrence is crucial for this task; second, we show that adding temporal convolutions leads to significant improvements. We evaluate the different approaches on the Montalbano gesture recognition dataset, where we achieve state-of-the-art results.
Network Architectures
We investigate several architectures for gesture recognition in videos. The baseline models include a single-frame CNN and a temporal pooling model. The single-frame CNN processes each frame independently, while the temporal pooling model aggregates spatial features over a certain time window. We also propose two architectures that combine CNNs with RNNs. The first architecture uses bidirectional recurrence to capture temporal dependencies, and the second architecture adds temporal convolutions to the CNN layers to extract motion features.
Experiments and Results
We evaluate our models on the Montalbano gesture recognition dataset. We preprocess the data by cropping and resizing the images. We train the models end-to-end, optimizing the network parameters using gradient descent. We also apply data augmentation and regularization techniques to improve generalization.
Our results show that the temporal convolution architecture outperforms the single-frame and temporal pooling models. The models combining CNNs with RNNs achieve even better performance. Both LSTM cells and standard cells are effective for capturing temporal dependencies. Our best model achieves a high Jaccard index score, indicating accurate gesture recognition.
Conclusion and Future Work
In this chapter, we demonstrated the effectiveness of combining temporal convolutions and recurrence for gesture recognition in videos. The unified end-to-end neural network architecture achieved significant improvements in accuracy compared to traditional methods. In future work, we plan to apply these techniques to sign language recognition, which poses additional challenges such as a larger vocabulary and context-dependent signs.
By leveraging the power of deep learning, we can continue to advance gesture recognition and make progress in understanding and interpreting human movements.