This project is a bidirectional translation system designed to bridge the communication gap between spoken language and Indian Sign Language (ISL). It provides a seamless, real-time solution for two-way communication, enabling spoken language users to understand ISL and ISL users to understand spoken language.
The system integrates speech recognition, Natural Language Processing (NLP), and deep learning-based gesture recognition to function as a complete communication aid.
- Bidirectional Translation: Provides both Speech-to-Sign and Sign-to-Speech functionality.
- Speech-to-Sign Mode: Converts spoken English into animated Indian Sign Language (ISL) gestures (GIFs).
- Sign-to-Speech Mode: Uses a webcam to recognize live ISL hand gestures and converts them into text and natural-sounding spoken audio.
- Hybrid Online/Offline Support:
- Online: Utilizes Google Speech API for high-accuracy speech recognition.
- Offline: Employs CMU Sphinx for on-device, English-only speech recognition when no internet is available.
- Real-time Processing: Optimized for high-speed, low-latency performance using MediaPipe for hand tracking and TensorFlow Lite for efficient gesture classification.
- Intelligent Error Handling: If a spoken phrase doesn't have a direct sign in the database, the system defaults to spelling the words out using the alphabetic ISL representation.
The system operates in two distinct modes, supported by a user-friendly Tkinter interface. The overall data flow is illustrated below:
graph TD
User(User)
%% Define Subgraphs
subgraph Backend
%% Path 1: Speech-to-Sign
Speak[Speak words or phrase]
SpeechRec["Speech Recognition (Google Speech API / CMU Sphinx)"]
LangDetect["Language Detection (Google Translator API)"]
Translate["Translation (Transformer-based Model)"]
GenerateGIF[Generate GIF from Dataset]
SpeakAgain{Speak Again?}
%% Path 2: Sign-to-Speech
Webcam[Start Webcam]
Keypoints["Hand keypoint Input (MediaPipe Hand Tracking)"]
Features["Feature Extraction (Keypoint Angles & Distances)"]
Detect[Detect hand gesture using model]
GenerateSpeech["Generate Speech (Google TTS)"]
StartAgain{Start Again?}
end
subgraph Storage
Dataset[GIF Dataset]
end
%% Define End State
ExitApp[Exit]
End((End))
%% Define Connections
User -- Selects Speech-to-Sign --> Speak
Speak --> SpeechRec
SpeechRec --> LangDetect
LangDetect --> Translate
Translate --> GenerateGIF
Dataset -- Provides GIFs --> GenerateGIF
GenerateGIF -- Displays GIF to --> User
GenerateGIF --> SpeakAgain
SpeakAgain -- Yes --> Speak
SpeakAgain -- No --> ExitApp
User -- Selects Sign-to-Speech --> Webcam
Webcam -- Captures User's Gesture --> Keypoints
Keypoints --> Features
Features --> Detect
Detect --> GenerateSpeech
GenerateSpeech -- Plays Audio to --> User
GenerateSpeech --> StartAgain
StartAgain -- Yes --> Webcam
StartAgain -- No --> ExitApp
ExitApp --> End
This mode converts spoken words into visual ISL gestures.
- Speech Input: The user speaks a phrase. The audio is captured using the microphone.
- Speech Recognition: The system uses the Google Speech API (online) or CMU Sphinx (offline) to transcribe the audio into text.
- NLP Processing: The transcribed text undergoes NLP processing to make it compatible with ISL grammar. This includes removing stop words and adjusting tenses.
- Gesture Generation: The system searches its database for the corresponding ISL animation (GIF) for the processed text and displays it on the screen.
This mode converts ISL gestures from a webcam into spoken audio.
- Video Capture: The system activates the webcam to monitor the user's hand movements.
- Hand Tracking: MediaPipe is used to detect and track 21 key hand landmarks in real-time from the video feed.
- Gesture Classification: The extracted landmark data is fed into a TensorFlow Lite-based CNN model, which classifies the hand gesture.
- Speech Output: The classified gesture is mapped to its corresponding English word or phrase. This text is then converted into natural-sounding speech using the Google Text-to-Speech (TTS) API.
The dataset used for training the Sign-to-Speech model can be found here: WLASL Processed Dataset
To expand the dataset with new signs, follow these steps:
- Place your
.mp4video files into theSign-to-speech/data/videosdirectory. - Ensure the video file names are descriptive (e.g.,
hello-01.mp4,goodbye-01.mp4). The system will use the part before the first hyphen as the sign name. - The system will automatically process these videos and extract hand landmarks when the
load_dataset()function is executed (e.g., during model training or reference sign creation).
To run the Universal Sign Language Communicator, execute the main.py file located in the root directory of the project. This will launch the graphical user interface (GUI) from which you can select either "Speech to Sign" or "Sign to Speech" mode.
It is highly recommended to use a virtual environment to manage dependencies.
# Create a virtual environment
python -m venv venv
# Activate the virtual environment
# On Windows:
.\venv\Scripts\activate
# On macOS/Linux:
# source venv/bin/activate
# Install dependencies
pip install -r Requirements.txt
# Run the application
python main.py- Core Language: Python
- Gesture Recognition: MediaPipe
- Deep Learning Model: TensorFlow Lite (TFLite)
- Speech-to-Text (Online): Google Speech API
- Speech-to-Text (Offline): CMU Sphinx
- Text-to-Speech: Google Text-to-Speech (gTTS) API
- User Interface (GUI): Tkinter
- NLP: Custom scripts for ISL grammatical alignment.
The system was evaluated for accuracy, speed, and efficiency.
- Sign-to-Speech Accuracy: The gesture recognition model achieved 96.4% classification accuracy on the ISL dataset.
- Speech-to-Sign Accuracy: The module achieved a 90.8% translation accuracy, with a low Word Error Rate (WER) of 7.9% and a high BLEU score of 0.81.
- Real-time Latency:
-
Sign-to-Speech: ~0.75 seconds per gesture.
-
Speech-to-Sign: ~1.1 seconds per phrase.
-
Future plans to enhance this system include:
- Expanding the Database: Adding a wider range of ISL phrases, sentences, and expressions.
- Improved Offline Mode: Training on-device models for speech recognition and translation to reduce internet dependency.
- Enhanced Robustness: Improving speech recognition to handle diverse accents and noisy environments.
- 3D Avatars: Integrating 3D avatar-based sign synthesis for more expressive and fluid ISL translations.
We were proud to virtually present this project at the 16th International IEEE Conference on Computing, Communication and Networking Technologies (ICCCNT 2025).
