Back to Repositories

Testing Speech-to-Text Transcription Implementation in OpenAI Whisper

This test suite validates the transcription functionality of the Whisper speech recognition model across different model variants. It focuses on accuracy, language detection, timestamp generation, and token processing for speech-to-text conversion.

Test Coverage Overview

The test suite provides comprehensive coverage of Whisper’s core transcription capabilities.

Key areas tested include:

Model loading and inference across all available Whisper models
Language detection and text extraction
Word-level timestamp accuracy
Token processing and decoding verification
Specific phrase detection in transcribed output

Implementation Analysis

The testing approach uses pytest’s parametrize feature to execute tests across multiple Whisper model variants. The implementation leverages PyTorch for model operations and includes device-aware loading (CPU/CUDA).

Notable patterns include:

Parametrized test execution
Device-agnostic model loading
Comprehensive assertion chains
Tokenizer integration testing

Technical Details

Testing tools and configuration:

pytest as the testing framework
PyTorch for model operations
Whisper’s native tokenizer
Sample audio file (jfk.flac)
Zero-temperature inference settings
Word-level timestamp validation

Best Practices Demonstrated

The test implementation showcases several testing best practices.

Notable examples include:

Comprehensive assertion coverage
Dynamic model testing
Hardware adaptation (CPU/GPU)
Explicit timestamp verification
Token-level accuracy validation
Integration of multiple component tests in a single flow

openai/whisper

tests/test_transcribe.py

            
import os

import pytest
import torch

import whisper
from whisper.tokenizer import get_tokenizer


@pytest.mark.parametrize("model_name", whisper.available_models())
def test_transcribe(model_name: str):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = whisper.load_model(model_name).to(device)
    audio_path = os.path.join(os.path.dirname(__file__), "jfk.flac")

    language = "en" if model_name.endswith(".en") else None
    result = model.transcribe(
        audio_path, language=language, temperature=0.0, word_timestamps=True
    )
    assert result["language"] == "en"
    assert result["text"] == "".join([s["text"] for s in result["segments"]])

    transcription = result["text"].lower()
    assert "my fellow americans" in transcription
    assert "your country" in transcription
    assert "do for you" in transcription

    tokenizer = get_tokenizer(model.is_multilingual, num_languages=model.num_languages)
    all_tokens = [t for s in result["segments"] for t in s["tokens"]]
    assert tokenizer.decode(all_tokens) == result["text"]
    assert tokenizer.decode_with_timestamps(all_tokens).startswith("<|0.00|>")

    timing_checked = False
    for segment in result["segments"]:
        for timing in segment["words"]:
            assert timing["start"] < timing["end"]
            if timing["word"].strip(" ,") == "Americans":
                assert timing["start"] <= 1.8
                assert timing["end"] >= 1.8
                timing_checked = True

    assert timing_checked