Back to Repositories

Testing Speech-to-Text Transcription Implementation in OpenAI Whisper

This test suite validates the transcription functionality of the Whisper speech recognition model across different model variants. It focuses on accuracy, language detection, timestamp generation, and token processing for speech-to-text conversion.

Test Coverage Overview

The test suite provides comprehensive coverage of Whisper’s core transcription capabilities.

Key areas tested include:
  • Model loading and inference across all available Whisper models
  • Language detection and text extraction
  • Word-level timestamp accuracy
  • Token processing and decoding verification
  • Specific phrase detection in transcribed output

Implementation Analysis

The testing approach uses pytest’s parametrize feature to execute tests across multiple Whisper model variants. The implementation leverages PyTorch for model operations and includes device-aware loading (CPU/CUDA).

Notable patterns include:
  • Parametrized test execution
  • Device-agnostic model loading
  • Comprehensive assertion chains
  • Tokenizer integration testing

Technical Details

Testing tools and configuration:
  • pytest as the testing framework
  • PyTorch for model operations
  • Whisper’s native tokenizer
  • Sample audio file (jfk.flac)
  • Zero-temperature inference settings
  • Word-level timestamp validation

Best Practices Demonstrated

The test implementation showcases several testing best practices.

Notable examples include:
  • Comprehensive assertion coverage
  • Dynamic model testing
  • Hardware adaptation (CPU/GPU)
  • Explicit timestamp verification
  • Token-level accuracy validation
  • Integration of multiple component tests in a single flow

openai/whisper

tests/test_transcribe.py

            
import os

import pytest
import torch

import whisper
from whisper.tokenizer import get_tokenizer


@pytest.mark.parametrize("model_name", whisper.available_models())
def test_transcribe(model_name: str):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = whisper.load_model(model_name).to(device)
    audio_path = os.path.join(os.path.dirname(__file__), "jfk.flac")

    language = "en" if model_name.endswith(".en") else None
    result = model.transcribe(
        audio_path, language=language, temperature=0.0, word_timestamps=True
    )
    assert result["language"] == "en"
    assert result["text"] == "".join([s["text"] for s in result["segments"]])

    transcription = result["text"].lower()
    assert "my fellow americans" in transcription
    assert "your country" in transcription
    assert "do for you" in transcription

    tokenizer = get_tokenizer(model.is_multilingual, num_languages=model.num_languages)
    all_tokens = [t for s in result["segments"] for t in s["tokens"]]
    assert tokenizer.decode(all_tokens) == result["text"]
    assert tokenizer.decode_with_timestamps(all_tokens).startswith("<|0.00|>")

    timing_checked = False
    for segment in result["segments"]:
        for timing in segment["words"]:
            assert timing["start"] < timing["end"]
            if timing["word"].strip(" ,") == "Americans":
                assert timing["start"] <= 1.8
                assert timing["end"] >= 1.8
                timing_checked = True

    assert timing_checked