Sign to Language Translation

This project explores the domain of Sign Language Translation (SLT) using the How2Sign dataset. Initially, we implemented a Transformer-based model trained on I3D video features, following prior research that benchmarks SLT performance on large-scale datasets.

Building upon this foundation, we introduced several enhancements to improve translation accuracy. First, we replaced the conventional I3D features with VideoSwin features pre-trained on BOBSL, investigating their impact on translation performance. Additionally, we incorporated rephrasing techniques during both training and inference, ranging from simple synonym replacement to advanced neural-based methods. This augmentation strategy aimed to diversify the dataset, improve robustness, and refine the fluency of generated translations.

These innovations contribute to advancing SLT by exploring new feature representations and data augmentation strategies, ultimately improving the quality of sign language translation into spoken text.

View Github

Introduction

The translation of sign language into spoken language presents unique challenges, primarily due to the visual and dynamic nature of sign languages like American Sign Language (ASL). These languages involve a complex combination of hand gestures, facial expressions, and body movements, each crucial for conveying meaning. This complexity poses significant challenges for computational translation models, which often struggle to capture the nuances of these visual languages. The How2Sign dataset represents a significant advancement in this field. It offers a comprehensive collection of over 80 hours of ASL videos spanning a wide range of topics, providing a larger vocabulary and more varied expressions than previous datasets like Phoenix-2014T. Incorporating the Transformer model and attention mechanisms to process video data for translation adds another layer of innovation to the paper model. The Transformer model, known for its effectiveness in natural language processing, is adapted to handle the sequential and complex nature of sign language videos.

Data

The How2sign dataset is incredibly rich compared to the former state of the art Dataset Phoenix-2014T. It included more than 80 hours of video recording, 3 different angles, English transcript, 10 different topics and more than 7,000 words. Another particularity of the How2Sign Dataset is that it does not include Gloss, which were formerly an intermediary between signs and translation.

Model Architecture and pipeline

The use of Transformers and attention mechanisms is crucial for translating sign language videos into text. Transformers are adept at handling sequential data, making them ideal for interpreting the time-based sequences of sign language. The attention mechanism within the Transformer is particularly essential, as it allows the model to focus on specific parts of the video sequence at a time, effectively capturing the nuances and subtleties of sign language gestures and expressions.

The model is a Transformers composed of 6 encoder layers, using attention mechanisms to process I3D features extracted from the video sequences. This attention allows the model to focus on significant gestures and movements within each frame, effectively capturing the nuances of sign language. Post feature extraction, the video data is tokenized and then passed through the encoder layers where attention-driven processing occurs. Concurrently, the textual data undergoes preprocessing, including lowercasing and tokenization before entering the decoder. The decoder, composed of 3 layers, translates these tokenized inputs into text. The final step involves post- processing the decoder's output through detokenization and truecasing, ensuring that the translated text is grammatically correct and readable. The combination of a Transformer model with attention mechanisms is indispensable for accurately translating the rich, contextual visual information of sign language into coherent text.

Metrics

The BLEU score and reduced BLEU (rBLEU) are utilized as metrics to assess the performance of the sign language translation models. The BLEU score calculates the similarity between the machine-generated translation and the ground-truth text by evaluating the correspondence of phrases. However, it can be inflated by repetitive patterns that may not be meaningful in context. To address this, rBLEU is introduced, which excludes certain common but less semantically important words, such as articles and prepositions, from the evaluation. This metric aims to provide a more accurate representation of the model’s ability to capture meaningful content, making it particularly suited to the evaluation of sign language translation, where context and semantic accuracy are crucial.

Training and Performance

Our model training was conducted on a T4 GPU within a Google Cloud VM, adhering to the hyper-parameters specified: a vocabulary size of 7,000 words, a batch size of 32, and a total of 108 epochs. The model's architecture remained consistent with that detailed in the paper, featuring six encoder layers, three decoder layers, four attention heads, a feed-forward network (FFN) size of 256, and a hidden size of 1024. After a rigorous training period extending close to 20 hours to complete the epochs, the results obtained were a BLEU score of 8.13 and a rBLEU score of 2.24 on the test set.

Video features

In the architecture of the SLT model based on the provided image, video features play a pivotal role in the translation process. These features are extracted from a series of images that, when sequenced together, form a video corresponding to a sentence in sign language. Specifically, the model utilizes pre-trained I3D (Inflated 3D ConvNet) features, which are adept at capturing spatial and temporal information from the video data. This is essential as it allows the model to understand and encode the dynamic and complex motions of sign language. The I3D model inflates the filters and pooling kernels of a 2D ConvNet into 3D, enabling it to learn from both the appearance and motion features of the sign language video sequences. This extraction of rich, multi-dimensional features is a critical step in accurately modeling and translating sign language.

From I3D features to VideoSwin

Replacing I3D features with VideoSwin features in SLT models presents an intriguing prospect due to the architectural differences between the two. While I3D features excel at capturing spatiotemporal information through 3D convolutions of 2D ConvNet architectures, VideoSwin features leverage a hierarchical Transformer whose inductive biases are conducive to modeling long- range interactions, providing a more refined understanding of the temporal dynamics in videos. This ability to capture complex dependencies across video frames could potentially lead to more accurate representations of the nuanced gestures and movements in sign language, which are critical for effective translation. The shift to VideoSwin features represents an exploration into the enhanced capacity of Transformers to handle the intricacies of sign language translation.

Incorporating VideoSwin features into our existing codebase required a nuanced approach due to the format differences from I3D features. Unlike the I3D features, which were provided per clip or sentence, VideoSwin features were extracted for entire videos. To bridge this gap, we initiated by downloading timestamps for How2Sign sentences from the dataset's official repository. These timestamps were then meticulously converted to align with the VideoSwin feature indices, a process that necessitated the development of a conversion function to account for the features extracted at a stride of 2 with a sliding window of 16 frames.

Post conversion, we saved these sub-sequences of features in a manner compatible with the I3D feature repository structure, thus enabling the seamless integration of VideoSwin features into the SLT model's pipeline.

Results

Upon training our model with a T4 GPU for 15 hours, we strictly adhered to the model architecture and hyperparameters previously established. The conversion function was applied to align VideoSwin features with our model framework, enabling us to proceed with training. However, the outcomes were disappointing. Our model achieved a BLEU score of less than 0.1 and a reduced BLEU score of less than 0.01 on both validation and test sets.

This underperformance could be anticipating considering the VideoSwin model were not fine-tuned on the How2Sign dataset, unlike the I3D model, which were specifically fine-tuned for this dataset. This lack of dataset- specific fine-tuning likely contributed to the subpar results.

Reformulation module

The integration of a rephrasing module holds considerable promise for enhancing SLT models, particularly in augmenting text during training and testing. This augmentation can introduce linguistic diversity to the training process, potentially improving the model's generalization abilities. In the context of BLEU score evaluation, where the congruence of predicted sentences to reference text is crucial, rephrasing at test time could be invaluable. By comparing the predicted sentences against multiple rephrased ground truth sentences, the evaluation could become more forgiving for semantically correct but lexically diverse translations. This approach could result in a more comprehensive assessment of the model's translation accuracy, leading to improvements in BLEU scores.

Initially, our rephrasing module employed a simple synonym replacement strategy, which proved to be suboptimal due to occasional loss of meaning in the sentences. Recognizing the limitation, we pivoted to utilizing advanced language models, specifically GPT-2 and GPT-3. The process involved fine-tuning these models by prepending the prompt "reformulate:" to sentences, we then tokenized the prefix and the sentence with a GPT- Tokenizer and entered it into the pre-trained model. While GPT-2 displayed efficiency in rephrasing longer sentences, its performance on shorter ones was less impressive. Conversely, GPT-3 showcased remarkable reformulation capabilities across all sentence lengths, although its API- based implementation but the reformulation time was much longer.

Reformulation during Training

Rephrasing as a data augmentation technique during training serves several benefits, such as improving model generalization and mitigating overfitting. By introducing a range of linguistic variations, the model is less likely to memorize specific sentence structures and more likely to focus on the meaning. During the training phase, we aimed to enrich the linguistic variety within our dataset by applying rephrasing to the training set sentences destined for the decoder. We employed GPT-2 and GPT-3 models to generate 3 additional paraphrases per sentence, which were then stored in a text file. In each epoch, the decoder was fed either the original sentence or one of the three paraphrased versions. We trained two separate models: one utilizing the GPT-2-generated reformulations and the other with those from GPT-3. Both models underwent a similar training duration of approximately 15 hours on a Google Cloud T4 GPU. However, the process of creating the reformulated dataset was time-intensive, especially for the GPT-3 reformulations, due to the slower response time of the API calls.

Reformulation during Testing

At testing time, we explored the impact of the rephrasing module on the test and validation sets. For each translated sentence produced by the model, we generated 3 reformulation using both GPT-2 and GPT-3 models. The BLEU and rBLEU scores were computed for each of the rephrased sentences, and the highest score was selected. This approach aimed to reduce result variability and emphasize meaning over word choice. By evaluating if the model captures the essence of the sentence rather than just the literal words, we gain insight into its semantic accuracy. Post-training, we applied this method to the output files from the test and validation sets across three models: the original, one with GPT-2 rephrasing, and another with GPT-3 rephrasing, to assess the effectiveness of our rephrasing strategy. For each of the model we computed 2 reformulation (GPT2 and GPT3) files from the output of the test and validation set.

Results

After implementing data augmentation through rephrasing for the training set, we developed two additional models: one utilizing GPT-2 reformulations and the other with GPT-3. The training retained the original hyperparameters and architecture but varied the sentences provided to the decoder. Training took approximately 20 hours per model on a T4 GPU. For each model's output (sentence), we conducted 3 rephrasing with the GPT2 and GPT3 model and computed BLEU and rBLEU scores for the evaluation and test files, selecting the highest scores from the rephrased and original sentence groups.

The results indicated a relative improvement, particularly with the GPT-3 reformulations on the test set, suggesting that rephrasing at test time contributed more to performance enhancements than training set rephrasing. However, the marginal difference between the models does not conclusively indicate whether rephrasing or inherent model characteristics were the primary drivers of performance.

Conclusion

In our quest to enhance the original SLT model, we ventured into the integration of VideoSwin features and the implementation of rephrasing modules during both the training and testing phases. Our experiments with VideoSwin features did not yield the anticipated outcomes, largely attributable to the lack of fine-tuning on the How2Sign dataset a process that the I3D model underwent, enhancing its performance. On the other hand, the rephrasing approach demonstrated promising results, especially when employing the GPT-3 model on the test set phrases, which suggested potential slight improvements. We also did not get to reach the state-of-the-art result from their paper How2sign with a BLEU score of 8.03 on the test set. For a more definitive analysis of the rephrasing module's efficacy, expanding the quantity of GPT-3- generated phrases and subsequent retraining could be informative. Furthermore, exploring other avenues of improvement, such as leveraging Large Language Models (LLMs) or finetuning VideoSwin features, might offer additional gains. Equally important is the consideration of alternative metrics like ROUGE or BERT-Score, which could provide a more nuanced measure of model performance and shed light on whether data augmentation through rephrasing is more effective than initially observed with BLEU scores alone. The path forward is replete with opportunities to refine our approach, ensuring that each step taken contributes to the overarching goal of achieving more accurate and reliable sign language translation.