Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

OpenAI Whisper is a robust automatic speech recognition (ASR) system built on a Transformer sequence-to-sequence model. It performs multilingual speech recognition, speech translation, spoken language identification, and voice activity detection as a unified multitask model.

## Development Commands

### Installation
```bash
# Install package in development mode with dependencies
pip install -e ".[dev]"

# Or install from requirements
pip install -r requirements.txt
```

### Code Quality & Linting
```bash
# Format code with black
black .

# Sort imports with isort
isort .

# Lint with flake8
flake8

# Run all pre-commit hooks
pre-commit run --all-files
```

### Testing
```bash
# Run all tests
pytest

# Run tests with verbose output
pytest -v

# Run specific test file
pytest tests/test_transcribe.py

# Run tests requiring CUDA
pytest -m requires_cuda
```

### Package Building
```bash
# Build package
python -m build

# Install built package
pip install dist/openai_whisper-*.whl
```

## Architecture Overview

### Core Components

**whisper/__init__.py**: Main entry point with model loading (`load_model()`) and model registry (`_MODELS` dict mapping model names to download URLs)

**whisper/model.py**:
- `ModelDimensions`: Configuration dataclass for model architecture
- `Whisper`: Main model class implementing the Transformer architecture
- Audio encoder and text decoder components with multi-head attention
- Optimized layers (`LayerNorm`, `Linear`) for mixed-precision training

**whisper/transcribe.py**:
- `transcribe()`: High-level transcription function with sliding window processing
- `cli()`: Command-line interface implementation
- Handles batch processing, temperature sampling, and output formatting

**whisper/decoding.py**:
- `DecodingOptions`/`DecodingResult`: Configuration and result classes
- `decode()`: Core decoding logic with beam search and sampling strategies
- `detect_language()`: Language identification functionality

**whisper/audio.py**: Audio preprocessing utilities including mel-spectrogram computation, padding/trimming to 30-second windows

**whisper/tokenizer.py**: BPE tokenization with special tokens for task specification (transcription vs translation) and language identification

**whisper/timing.py**: Word-level timestamp alignment using cross-attention weights from specific attention heads

**whisper/normalizers/**: Text normalization for different languages to improve transcription accuracy

### Model Pipeline Flow

1. Audio → Mel-spectrogram (whisper/audio.py)
2. Spectrogram → Audio encoder features (whisper/model.py)
3. Language detection via decoder (whisper/decoding.py)
4. Text generation with task-specific tokens (whisper/transcribe.py)
5. Optional word-level timestamp alignment (whisper/timing.py)

### Available Models

Six model sizes with different accuracy/speed tradeoffs:
- `tiny`, `base`, `small`, `medium`, `large`, `turbo`
- English-only variants: `*.en` (better for English)
- Models auto-download to `~/.cache/whisper/`

## Testing Structure

- **tests/conftest.py**: pytest configuration with CUDA markers and random seeds
- **tests/jfk.flac**: Reference audio file for integration tests
- Tests cover audio processing, tokenization, normalization, timing, and transcription functionality

## Code Style

- **Black** formatter (88 char line length)
- **isort** for import sorting (black profile)
- **flake8** linting with specific ignores for E203, E501, W503, W504
- **pre-commit hooks** enforce consistency

## Key Dependencies

- **PyTorch**: Core ML framework
- **tiktoken**: Fast BPE tokenization
- **numba**: JIT compilation for audio processing
- **tqdm**: Progress bars for model downloads and processing
- **triton**: GPU kernel optimization (Linux x86_64)
Loading