|
1 | 1 | # AutoSub
|
2 | 2 |
|
3 |
| -- [About](#about) |
4 |
| -- [Motivation](#motivation) |
5 |
| -- [Installation](#installation) |
6 |
| -- [Docker](#docker) |
7 |
| -- [How-to example](#how-to-example) |
8 |
| -- [How it works](#how-it-works) |
9 |
| -- [TO-DO](#to-do) |
10 |
| -- [Contributing](#contributing) |
11 |
| -- [References](#references) |
| 3 | +- [AutoSub](#autosub) |
| 4 | + - [About](#about) |
| 5 | + - [Installation](#installation) |
| 6 | + - [Docker](#docker) |
| 7 | + - [How-to example](#how-to-example) |
| 8 | + - [How it works](#how-it-works) |
| 9 | + - [Motivation](#motivation) |
| 10 | + - [Contributing](#contributing) |
| 11 | + - [References](#references) |
12 | 12 |
|
13 | 13 | ## About
|
14 | 14 |
|
15 | 15 | AutoSub is a CLI application to generate subtitle files (.srt, .vtt, and .txt transcript) for any video file using [Mozilla DeepSpeech](https://github.com/mozilla/DeepSpeech). I use the DeepSpeech Python API to run inference on audio segments and [pyAudioAnalysis](https://github.com/tyiannak/pyAudioAnalysis) to split the initial audio on silent segments, producing multiple small files.
|
16 | 16 |
|
17 | 17 | ⭐ Featured in [DeepSpeech Examples](https://github.com/mozilla/DeepSpeech-examples) by Mozilla
|
18 | 18 |
|
19 |
| -## Motivation |
20 |
| - |
21 |
| -In the age of OTT platforms, there are still some who prefer to download movies/videos from YouTube/Facebook or even torrents rather than stream. I am one of them and on one such occasion, I couldn't find the subtitle file for a particular movie I had downloaded. Then the idea for AutoSub struck me and since I had worked with DeepSpeech previously, I decided to use it. |
22 |
| - |
23 |
| - |
24 | 19 | ## Installation
|
25 | 20 |
|
26 |
| -* Clone the repo. All further steps should be performed while in the `AutoSub/` directory |
| 21 | +* Clone the repo |
27 | 22 | ```bash
|
28 | 23 | $ git clone https://github.com/abhirooptalasila/AutoSub
|
29 | 24 | $ cd AutoSub
|
30 | 25 | ```
|
31 |
| -* Create a pip virtual environment to install the required packages |
| 26 | +* Create a virtual environment to install the required packages. All further steps should be performed while in the `AutoSub/` directory |
32 | 27 | ```bash
|
33 |
| - $ python3 -m venv sub |
| 28 | + $ python3 -m pip install --user virtualenv |
| 29 | + $ virtualenv sub |
34 | 30 | $ source sub/bin/activate
|
35 |
| - $ pip3 install -r requirements.txt |
36 | 31 | ```
|
37 |
| -* Download the model and scorer files from DeepSpeech repo. The scorer file is optional, but it greatly improves inference results. |
| 32 | +* Use the corresponding requirements file depending on whether you have a GPU or not. Make sure you have the appropriate [CUDA](https://deepspeech.readthedocs.io/en/v0.9.3/USING.html#cuda-dependency-inference) version |
38 | 33 | ```bash
|
39 |
| - # Model file (~190 MB) |
40 |
| - $ wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.pbmm |
41 |
| - # Scorer file (~950 MB) |
42 |
| - $ wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-models.scorer |
| 34 | + $ pip3 install -r requirements.txt |
| 35 | + OR |
| 36 | + $ pip3 install -r requirements-gpu.txt |
43 | 37 | ```
|
44 |
| -* Create two folders `audio/` and `output/` to store audio segments and final SRT and VTT file |
| 38 | +* Use `getmodels.sh` to download the model and scorer files with the version number as argument |
45 | 39 | ```bash
|
46 |
| - $ mkdir audio output |
| 40 | + $ ./getmodels.sh 0.9.3 |
47 | 41 | ```
|
48 |
| -* Install FFMPEG. If you're running Ubuntu, this should work fine. |
| 42 | +* Install FFMPEG. If you're on Ubuntu, this should work fine |
49 | 43 | ```bash
|
50 | 44 | $ sudo apt-get install ffmpeg
|
51 | 45 | $ ffmpeg -version # I'm running 4.1.4
|
52 | 46 | ```
|
53 |
| - |
54 |
| -* [OPTIONAL] If you would like the subtitles to be generated faster, you can use the GPU package instead. Make sure to install the appropriate [CUDA](https://deepspeech.readthedocs.io/en/v0.9.3/USING.html#cuda-dependency-inference) version. |
55 |
| - ```bash |
56 |
| - $ source sub/bin/activate |
57 |
| - $ pip3 install deepspeech-gpu |
58 |
| - ``` |
59 | 47 |
|
60 | 48 |
|
61 | 49 | ## Docker
|
62 | 50 |
|
63 |
| -* Installation using Docker is pretty straight-forward. |
64 |
| - * First start by downloading training models by specifying which version you want: |
65 |
| - * if you have your own, then skip this step and just ensure they are placed in project directory with .pbmm and .scorer extensions |
| 51 | +* If you don't have the model files, get them |
66 | 52 | ```bash
|
67 |
| - $ ./getmodel.sh 0.9.3 |
| 53 | + $ ./getmodels.sh 0.9.3 |
68 | 54 | ```
|
69 |
| - |
70 |
| - * Then for a CPU build, run: |
| 55 | +* For a CPU build |
71 | 56 | ```bash
|
72 | 57 | $ docker build -t autosub .
|
73 | 58 | $ docker run --volume=`pwd`/input:/input --name autosub autosub --file /input/video.mp4
|
74 | 59 | $ docker cp autosub:/output/ .
|
75 | 60 | ```
|
76 |
| - |
77 |
| - * For a GPU build that is reusable (saving time on instantiating the program): |
| 61 | +* For a GPU build that is reusable (saving time on instantiating the program) |
78 | 62 | ```bash
|
79 | 63 | $ docker build --build-arg BASEIMAGE=nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 --build-arg DEPSLIST=requirements-gpu.txt -t autosub-base . && \
|
80 | 64 | docker run --gpus all --name autosub-base autosub-base --dry-run || \
|
81 | 65 | docker commit --change 'CMD []' autosub-base autosub-instance
|
82 | 66 | ```
|
83 |
| - * Then |
| 67 | +* Finally |
84 | 68 | ```bash
|
85 |
| - $ docker run --volume=`pwd`/input:/input --name autosub autosub-instance --file video.mp4 |
| 69 | + $ docker run --volume=`pwd`/input:/input --name autosub autosub-instance --file ~/video.mp4 |
86 | 70 | $ docker cp autosub:/output/ .
|
87 | 71 | ```
|
88 | 72 |
|
89 | 73 | ## How-to example
|
90 | 74 |
|
91 |
| -* Make sure the model and scorer files are in the root directory. They are automatically loaded |
92 |
| -* After following the installation instructions, you can run `autosub/main.py` as given below. The `--file` argument is the video file for which SRT file is to be generated |
| 75 | +* The model files should be in the repo root directory and will be loaded automatically. But incase you have multiple versions, use the `--model` and `--scorer` args while executing |
| 76 | +* After following the installation instructions, you can run `autosub/main.py` as given below. The `--file` argument is the video file for which subtitles are to be generated |
93 | 77 | ```bash
|
94 | 78 | $ python3 autosub/main.py --file ~/movie.mp4
|
95 | 79 | ```
|
96 | 80 | * After the script finishes, the SRT file is saved in `output/`
|
97 |
| -* Open the video file and add this SRT file as a subtitle, or you can just drag and drop in VLC. |
98 | 81 | * The optional `--split-duration` argument allows customization of the maximum number of seconds any given subtitle is displayed for. The default is 5 seconds
|
99 | 82 | ```bash
|
100 | 83 | $ python3 autosub/main.py --file ~/movie.mp4 --split-duration 8
|
101 | 84 | ```
|
102 |
| -* By default, AutoSub outputs in a number of formats. To only produce the file formats you want use the `--format` argument: |
| 85 | +* By default, AutoSub outputs SRT, VTT and TXT files. To only produce the file formats you want, use the `--format` argument |
103 | 86 | ```bash
|
104 | 87 | $ python3 autosub/main.py --file ~/movie.mp4 --format srt txt
|
105 | 88 | ```
|
| 89 | +* Open the video file and add this SRT file as a subtitle. You can just drag and drop in VLC. |
| 90 | +
|
106 | 91 |
|
107 | 92 |
|
108 | 93 | ## How it works
|
109 | 94 |
|
110 |
| -Mozilla DeepSpeech is an amazing open-source speech-to-text engine with support for fine-tuning using custom datasets, external language models, exporting memory-mapped models and a lot more. You should definitely check it out for STT tasks. So, when you first run the script, I use FFMPEG to **extract the audio** from the video and save it in `audio/`. By default DeepSpeech is configured to accept 16kHz audio samples for inference, hence while extracting I make FFMPEG use 16kHz sampling rate. |
| 95 | +Mozilla DeepSpeech is an open-source speech-to-text engine with support for fine-tuning using custom datasets, external language models, exporting memory-mapped models and a lot more. You should definitely check it out for STT tasks. So, when you run the script, I use FFMPEG to **extract the audio** from the video and save it in `audio/`. By default DeepSpeech is configured to accept 16kHz audio samples for inference, hence while extracting I make FFMPEG use 16kHz sampling rate. |
111 | 96 |
|
112 |
| -Then, I use [pyAudioAnalysis](https://github.com/tyiannak/pyAudioAnalysis) for silence removal - which basically takes the large audio file initially extracted, and splits it wherever silent regions are encountered, resulting in smaller audio segments which are much easier to process. I haven't used the whole library, instead I've integrated parts of it in `autosub/featureExtraction.py` and `autosub/trainAudio.py` All these audio files are stored in `audio/`. Then for each audio segment, I perform DeepSpeech inference on it, and write the inferred text in a SRT file. After all files are processed, the final SRT file is stored in `output/`. |
| 97 | +Then, I use [pyAudioAnalysis](https://github.com/tyiannak/pyAudioAnalysis) for silence removal - which basically takes the large audio file initially extracted, and splits it wherever silent regions are encountered, resulting in smaller audio segments which are much easier to process. I haven't used the whole library, instead I've integrated parts of it in `autosub/featureExtraction.py` and `autosub/trainAudio.py`. All these audio files are stored in `audio/`. Then for each audio segment, I perform DeepSpeech inference on it, and write the inferred text in a SRT file. After all files are processed, the final SRT file is stored in `output/`. |
113 | 98 |
|
114 |
| -When I tested the script on my laptop, it took about **40 minutes to generate the SRT file for a 70 minutes video file**. My config is an i5 dual-core @ 2.5 Ghz and 8 gigs of RAM. Ideally, the whole process shouldn't take more than 60% of the duration of original video file. |
| 99 | +When I tested the script on my laptop, it took about **40 minutes to generate the SRT file for a 70 minutes video file**. My config is an i5 dual-core @ 2.5 Ghz and 8GB RAM. Ideally, the whole process shouldn't take more than 60% of the duration of original video file. |
115 | 100 |
|
116 | 101 |
|
117 |
| -## TO-DO |
| 102 | +## Motivation |
118 | 103 |
|
119 |
| -* Pre-process inferred text before writing to file (prettify) |
120 |
| -* Add progress bar to `extract_audio()` |
121 |
| -* GUI support (?) |
| 104 | +In the age of OTT platforms, there are still some who prefer to download movies/videos from YouTube/Facebook or even torrents rather than stream. I am one of them and on one such occasion, I couldn't find the subtitle file for a particular movie I had downloaded. Then the idea for AutoSub struck me and since I had worked with DeepSpeech previously, I decided to use it. |
122 | 105 |
|
123 | 106 |
|
124 | 107 | ## Contributing
|
|
0 commit comments