Skip to content

Commit 6963f1a

Browse files
committed
add speech recognition tutorial with transformers
1 parent 1facc7f commit 6963f1a

8 files changed

+2611
-0
lines changed

README.md

+1
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
4646
- [Conversational AI Chatbot with Transformers in Python](https://www.thepythoncode.com/article/conversational-ai-chatbot-with-huggingface-transformers-in-python). ([code](machine-learning/nlp/chatbot-transformers))
4747
- [How to Pretrain BERT using Transformers in Python](https://www.thepythoncode.com/article/pretraining-bert-huggingface-transformers-in-python). ([code](machine-learning/nlp/pretraining-bert))
4848
- [How to Perform Machine Translation using Transformers in Python](https://www.thepythoncode.com/article/machine-translation-using-huggingface-transformers-in-python). ([code](machine-learning/nlp/machine-translation))
49+
- [Speech Recognition using Transformers in Python](https://www.thepythoncode.com/article/speech-recognition-using-huggingface-transformers-in-python). ([code](machine-learning/nlp/speech-recognition-transformers))
4950
- ### [Computer Vision](https://www.thepythoncode.com/topic/computer-vision)
5051
- [How to Detect Human Faces in Python using OpenCV](https://www.thepythoncode.com/article/detect-faces-opencv-python). ([code](machine-learning/face_detection))
5152
- [How to Make an Image Classifier in Python using TensorFlow and Keras](https://www.thepythoncode.com/article/image-classification-keras-python). ([code](machine-learning/image-classifier))
Binary file not shown.
Binary file not shown.
Binary file not shown.

machine-learning/nlp/speech-recognition-transformers/AutomaticSpeechRecognition_PythonCodeTutorial.ipynb

+2,457
Large diffs are not rendered by default.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# %%
2+
# !pip install transformers==4.11.2 datasets soundfile sentencepiece torchaudio pyaudio
3+
4+
# %%
5+
from transformers import *
6+
import torch
7+
import soundfile as sf
8+
# import librosa
9+
import os
10+
import torchaudio
11+
12+
# %%
13+
# model_name = "facebook/wav2vec2-base-960h" # 360MB
14+
model_name = "facebook/wav2vec2-large-960h-lv60-self" # 1.18GB
15+
16+
processor = Wav2Vec2Processor.from_pretrained(model_name)
17+
model = Wav2Vec2ForCTC.from_pretrained(model_name)
18+
19+
# %%
20+
# audio_url = "http://www.fit.vutbr.cz/~motlicek/sympatex/f2bjrop1.0.wav"
21+
# audio_url = "http://www.fit.vutbr.cz/~motlicek/sympatex/f2bjrop1.1.wav"
22+
# audio_url = "http://www.fit.vutbr.cz/~motlicek/sympatex/f2btrop6.0.wav"
23+
# audio_url = "https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/speech-recognition/16-122828-0002.wav"
24+
audio_url = "https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/speech-recognition/30-4447-0004.wav"
25+
# audio_url = "https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/speech-recognition/7601-291468-0006.wav"
26+
# audio_url = "https://file-examples-com.github.io/uploads/2017/11/file_example_WAV_1MG.wav"
27+
# audio_url = "http://www0.cs.ucl.ac.uk/teaching/GZ05/samples/lathe.wav"
28+
29+
# %%
30+
# load our wav file
31+
speech, sr = torchaudio.load(audio_url)
32+
speech = speech.squeeze()
33+
# or using librosa
34+
# speech, sr = librosa.load(audio_file, sr=16000)
35+
sr, speech.shape
36+
37+
# %%
38+
# resample from whatever the audio sampling rate to 16000
39+
resampler = torchaudio.transforms.Resample(sr, 16000)
40+
speech = resampler(speech)
41+
speech.shape
42+
43+
# %%
44+
# tokenize our wav
45+
input_values = processor(speech, return_tensors="pt", sampling_rate=16000)["input_values"]
46+
input_values.shape
47+
48+
# %%
49+
# perform inference
50+
logits = model(input_values)["logits"]
51+
logits.shape
52+
53+
# %%
54+
# use argmax to get the predicted IDs
55+
predicted_ids = torch.argmax(logits, dim=-1)
56+
predicted_ids.shape
57+
58+
# %%
59+
# decode the IDs to text
60+
transcription = processor.decode(predicted_ids[0])
61+
transcription.lower()
62+
63+
# %%
64+
def get_transcription(audio_path):
65+
# load our wav file
66+
speech, sr = torchaudio.load(audio_path)
67+
speech = speech.squeeze()
68+
# or using librosa
69+
# speech, sr = librosa.load(audio_file, sr=16000)
70+
# resample from whatever the audio sampling rate to 16000
71+
resampler = torchaudio.transforms.Resample(sr, 16000)
72+
speech = resampler(speech)
73+
# tokenize our wav
74+
input_values = processor(speech, return_tensors="pt", sampling_rate=16000)["input_values"]
75+
# perform inference
76+
logits = model(input_values)["logits"]
77+
# use argmax to get the predicted IDs
78+
predicted_ids = torch.argmax(logits, dim=-1)
79+
# decode the IDs to text
80+
transcription = processor.decode(predicted_ids[0])
81+
return transcription.lower()
82+
83+
# %%
84+
get_transcription(audio_url)
85+
86+
# %%
87+
import pyaudio
88+
import wave
89+
90+
# the file name output you want to record into
91+
filename = "recorded.wav"
92+
# set the chunk size of 1024 samples
93+
chunk = 1024
94+
# sample format
95+
FORMAT = pyaudio.paInt16
96+
# mono, change to 2 if you want stereo
97+
channels = 1
98+
# 44100 samples per second
99+
sample_rate = 16000
100+
record_seconds = 10
101+
# initialize PyAudio object
102+
p = pyaudio.PyAudio()
103+
# open stream object as input & output
104+
stream = p.open(format=FORMAT,
105+
channels=channels,
106+
rate=sample_rate,
107+
input=True,
108+
output=True,
109+
frames_per_buffer=chunk)
110+
frames = []
111+
print("Recording...")
112+
for i in range(int(sample_rate / chunk * record_seconds)):
113+
data = stream.read(chunk)
114+
# if you want to hear your voice while recording
115+
# stream.write(data)
116+
frames.append(data)
117+
print("Finished recording.")
118+
# stop and close stream
119+
stream.stop_stream()
120+
stream.close()
121+
# terminate pyaudio object
122+
p.terminate()
123+
# save audio file
124+
# open the file in 'write bytes' mode
125+
wf = wave.open(filename, "wb")
126+
# set the channels
127+
wf.setnchannels(channels)
128+
# set the sample format
129+
wf.setsampwidth(p.get_sample_size(FORMAT))
130+
# set the sample rate
131+
wf.setframerate(sample_rate)
132+
# write the frames as bytes
133+
wf.writeframes(b"".join(frames))
134+
# close the file
135+
wf.close()
136+
137+
# %%
138+
get_transcription("recorded.wav")
139+
140+
# %%
141+
142+
143+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# [Speech Recognition using Transformers in Python](https://www.thepythoncode.com/article/speech-recognition-using-huggingface-transformers-in-python)
2+
To get it running:
3+
- `pip3 install -r requirements.txt`
4+
5+
Check the [the tutorial](https://www.thepythoncode.com/article/speech-recognition-using-huggingface-transformers-in-python) and the [Colab notebook](https://colab.research.google.com/drive/1-0M8zvQrOzlZ8U8l7KdPOuLBNtzqtlsz?usp=sharing) for more information.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
transformers==4.11.2
2+
soundfile
3+
sentencepiece
4+
torchaudio
5+
pyaudio

0 commit comments

Comments
 (0)