home..

Multimodal Language Model for Sign Language Recognition (SLR)

machine learning sign language recognition

Disclaimer: This post is summarized by an AI chatbot and is based on the chapter in my Ph.D dissertation on page 155.

VRT news broadcast with overlayed sign language interpreter.
VRT news broadcast with overlayed sign language interpreter.

TV broadcasting organizations like the BBC (British Broadcasting Corporation) or the VRT (Flemish Radio and Television Broadcasting Organization) are making news broadcasts accessible to deaf people by overlaying a sign language interpreter on the screen. This creates a large amount of data where spoken language is interpreted into sign language. However, translating sign language to written Dutch is challenging due to the lack of examples and the lack of a one-to-one mapping between sign language and written language. In this chapter, we explore the problem of sign language recognition in TV news broadcasts and propose a model that aligns subtitles with sign language video fragments.

VRT news dataset

We collected a dataset of news videos with Flemish Sign Language interpreter overlays in collaboration with VRT. The dataset contains daily news broadcasts from September 2012 to July 2015, totaling 575 hours of usable and subtitled footage. The interpreter overlay is mixed with the background news broadcast, and there are four different interpreters and two different static backgrounds in the dataset.

Methodology

Our approach involves embedding the source sign language video fragments into a shared vector space using a convolutional neural network (CNN). We leverage an existing Word2Vec model trained on Dutch words to encode the target subtitles into the same vector space. We then calculate the matching scores between the source video fragments and the target words using dot products in the embedding vector space.

An overview of how the source is embedded into the existing Word2Vec space. The model embeds the source frames into the shared $d_\text{emb}$-dimensional vector space. To find a matching score $s_i$ between the source and every target word $w_i$, the dot-product is used.
An overview of how the source is embedded into the existing Word2Vec space. The model embeds the source frames into the shared $d_\text{emb}$-dimensional vector space. To find a matching score $s_i$ between the source and every target word $w_i$, the dot-product is used.

To train the model, we employ a ranking-based objective that aims to maximize the similarity between the video representation and the target subtitles while minimizing the similarity with random subtitles.

Experiments

We preprocess the data by cropping the video frames, converting them to grayscale, and subtracting the previous frame to remove static information. We also filter and clean the subtitles, using a pretrained Word2Vec model for word embeddings. The dataset is split into training, validation, and test sets. We train the model using the Adam update rule and orthogonal weight initialization. The margin loss converges to around 0.9, and the accuracy of matching positive and negative targets reaches approximately 74%.

The margin loss and the $s_{\text{aggr}}^+ > s_{\text{aggr}}^-$  accuracy over time during the training phase. Note that one iteration consists of 640 updates, but does not cover a full pass through the dataset.
The margin loss and the $s_{\text{aggr}}^+ > s_{\text{aggr}}^-$ accuracy over time during the training phase. Note that one iteration consists of 640 updates, but does not cover a full pass through the dataset.
The architecture of the 3D convolutional neural network used to encode a video fragment into a 100-dimensional embedding.
The architecture of the 3D convolutional neural network used to encode a video fragment into a 100-dimensional embedding.

Results

The model shows promising results in predicting the general theme of the video fragments in some cases, with an estimated 25% rate of sensible results in terms of topic. The model also learns certain key points in the vector space, with specific neighboring words consistently emerging for different topics. However, the model’s performance is still far from achieving accurate translation between sign language and written Dutch.

A list of
cherry picked samples from the test set. The Target is
the group of 32 subtitle words surrounding the fragment.
The Neighbors are a top-10 of the most similar words in
the Word2Vec space. The Gloss is a list of signs that are
performed in the fragment.
A list of cherry picked samples from the test set. The Target is the group of 32 subtitle words surrounding the fragment. The Neighbors are a top-10 of the most similar words in the Word2Vec space. The Gloss is a list of signs that are performed in the fragment.

Conclusion

We have developed a model that embeds sign language video fragments into a shared vector space with Dutch subtitles. While the model shows some ability to predict the topic of video fragments and learns specific patterns, the overall results are not yet at the level of accurate translation. Future improvements could involve hand tracking, using pretrained language models for encoding groups of words or sentences, incorporating sign language corpora in training, and exploring larger models.



© 2024 Lionel Pigou