🗣️ Speech Emotion Recognition & Audio Transcription Using Wav2Vec2

📋 Project Overview

This repository implements a Speech Emotion Recognition (SER) and Automatic Speech Recognition (ASR) pipeline based on Wav2Vec2 pre-trained models.
The goal is to detect the emotional state from raw speech audio and transcribe spoken content into text.

Tasks addressed:

Emotion Classification (e.g., Happy, Sad, Angry, etc.)
Audio-to-Text Transcription (using ASR model)

📚 Datasets

We combined several popular open-source emotional speech datasets:

CREMA-D (Crowd-sourced Emotional Multimodal Actors Dataset)
RAVDESS (Ryerson Audio-Visual Database of Emotional Speech and Song)
TESS (Toronto Emotional Speech Set)
SAVEE (Surrey Audio-Visual Expressed Emotion)

All datasets were downloaded automatically using the kagglehub library.

🏷️ Supported Emotion Classes

Neutral
Happy
Sad
Angry
Fear
Disgust
Surprise
Calm

Each dataset has its own labeling format; a unified mapping was applied to create a consistent labeling system across datasets.

⚙️ Data Preprocessing

.wav files were traversed recursively.
Emotion labels were extracted based on filename patterns.
Only 8 main emotions were retained.
Samples were split into Train/Validation/Test sets:
- Train: 70%
- Validation: 20%
- Test: 10%
Class distribution for each split was visualized using bar plots.

The splits were saved as CSV files:

train_dataset.csv
valid_dataset.csv
test_dataset.csv

🧠 Model Architecture

🗣️ Speech Emotion Recognition (SER)

Base Model: facebook/wav2vec2-large-xlsr-53
Head: Classification head for 8 emotion classes.
Feature Extractor: Wav2Vec2FeatureExtractor from Hugging Face Transformers.

Model fine-tuned on emotional labels with:

Input sampling rate: 16kHz
Input length: 2 seconds (32000 samples)
Padding or trimming applied as needed

📝 Audio Transcription (ASR)

Model: facebook/wav2vec2-large-960h
Processor: Wav2Vec2Processor
Transcribes input audio to text in English.

🏋️ Training Procedure

Trainer API (from Hugging Face) used for model fine-tuning.
Loss Function: CrossEntropyLoss
Optimizer: AdamW
Training Arguments:
- Epochs: 20
- Learning Rate: 2e-5
- Batch Size: 32
- Mixed Precision: Enabled (fp16)
- Evaluation Strategy: Every epoch

The model checkpoints and results were saved under the ./results_sentiment directory.

📈 Evaluation Metrics

After training, the model was evaluated on the test set using:

Accuracy
Precision
Recall
F1-Score
Confusion Matrix

🔥 Final Test Set Results

📊 Confusion Matrix

🔥 Live Predictions

Random samples from the test set were selected and predictions were made.
For each sample:

True label vs Predicted label printed.
Audio playback available directly inside notebook (using IPython.display.Audio).

Example:

✨ Audio Transcription Feature

In addition to emotion detection, the project includes speech-to-text transcription using a separately pre-trained ASR model:

Audio files (max 2.5s) were processed and decoded into English text.
Transcriptions were printed alongside actual audio playback.

Example output:

🛠 Technical Stack

Python 3.10
PyTorch for deep learning model implementation
Hugging Face Transformers for pre-trained model loading
Librosa for audio processing
Scikit-learn for evaluation metrics
Seaborn/Matplotlib for visualization
kagglehub for automatic dataset download

🚀 How to Run the Project

Install dependencies:

pip install torch librosa transformers kagglehub scikit-learn matplotlib seaborn pandas

Clone the repository and navigate inside:

git clone //github.com/Seyed07/Speech-Emotion-Recognition-using-Wav2Vec2.git

Run the Jupyter notebook:
```
jupyter notebook
```
Execute all cells step-by-step to train and evaluate the models.

📬 Contact

For any questions or suggestions, feel free to create an issue or contact me via email or GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
images		images
README.md		README.md
sound.ipynb		sound.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🗣️ Speech Emotion Recognition & Audio Transcription Using Wav2Vec2

📋 Project Overview

📚 Datasets

🏷️ Supported Emotion Classes

⚙️ Data Preprocessing

🧠 Model Architecture

🗣️ Speech Emotion Recognition (SER)

📝 Audio Transcription (ASR)

🏋️ Training Procedure

📈 Evaluation Metrics

🔥 Final Test Set Results

📊 Confusion Matrix

🔥 Live Predictions

✨ Audio Transcription Feature

🛠 Technical Stack

🚀 How to Run the Project

📬 Contact

About

Uh oh!

Releases

Packages

Languages

Seyed07/Speech-Emotion-Recognition-using-Wav2Vec2

Folders and files

Latest commit

History

Repository files navigation

🗣️ Speech Emotion Recognition & Audio Transcription Using Wav2Vec2

📋 Project Overview

📚 Datasets

🏷️ Supported Emotion Classes

⚙️ Data Preprocessing

🧠 Model Architecture

🗣️ Speech Emotion Recognition (SER)

📝 Audio Transcription (ASR)

🏋️ Training Procedure

📈 Evaluation Metrics

🔥 Final Test Set Results

📊 Confusion Matrix

🔥 Live Predictions

✨ Audio Transcription Feature

🛠 Technical Stack

🚀 How to Run the Project

📬 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages