A simple audio transcription application that uses Google's Gemini model to transcribe audio files and generate both text and SRT format transcripts. Tested for short news
- Simple web interface using Gradio
- Utilizing Google's gemini multimodal LLM
- Supports various audio file formats. The supported format currently supported are mp3, m4a, wav
- Generates timestamped transcripts
- Exports in both TXT and SRT formats
- Runs all components locally except for the Gemini model
- Install the required dependencies:
pip install -r requirements.txt
- Set up your Google API key:
# On Windows
set GOOGLE_API_KEY=your_api_key_here
# On Linux/macOS
export GOOGLE_API_KEY=your_api_key_here
- Run the application:
python app.py
or
$env:GOOGLE_API_KEY='your_api_key_here'; python app.py #if you want just to use disposable API
- Open your browser and navigate to
http://localhost:7860
- Upload an audio file using the file upload button
- Click "Transcribe" to process the file
- Once processing is complete, you can:
- Download the transcript in TXT format (timestamped text)
- Download the transcript in SRT format (subtitle format)
- Preview the transcript directly in the browser
The TXT format includes timestamps in [MM:SS] format followed by the transcribed text:
[00:00] This is the beginning of the transcript
[00:15] This is the next segment
The SRT format follows the standard subtitle format:
1
00:00:00,000 --> 00:00:02,000
This is the beginning of the transcript
2
00:00:15,000 --> 00:00:17,000
This is the next segment
- The application uses Gradio for the web interface, making it lightweight and easy to use
- All processing is done locally except for the transcription which uses Google's Gemini model
- Currently no speaker diarization is performed - the transcript focuses on the content only
- Timestamps are grouped logically to maintain context rather than breaking at every pause
- Make translation feature
- Supporting diarization
- Token counter and API price counter
- Multi batch processing
- Adjusting API rate limit