Back to Repositories

Testing Tacotron2 D-Vector Speaker Embeddings Training in Coqui-AI TTS

This test suite validates the Tacotron2 text-to-speech model training with d-vectors (speaker embeddings) in the Coqui-AI TTS framework. It covers the complete training pipeline including configuration, model training, checkpoint management, and inference capabilities.

Test Coverage Overview

The test suite provides comprehensive coverage of Tacotron2 TTS model training and inference workflows with speaker embeddings.

  • Model configuration validation and persistence
  • Initial training epoch execution
  • Checkpoint management and restoration
  • Inference pipeline with speaker embeddings
  • Training continuation from saved checkpoints

Implementation Analysis

The test implements a systematic approach to validating the Tacotron2 training pipeline using CLI commands and configuration management.

It utilizes command-line interface patterns for both training and inference, validates configuration persistence, and implements checkpoint management patterns specific to the TTS framework.

Technical Details

  • Uses Tacotron2Config for model configuration
  • Implements CUDA device management
  • Validates speaker embedding dimensions (256)
  • Tests phoneme cache functionality
  • Verifies audio preprocessing settings
  • Implements checkpoint discovery and restoration

Best Practices Demonstrated

The test exhibits robust testing practices for deep learning model training validation.

  • Systematic configuration validation
  • Resource cleanup after test execution
  • Comprehensive pipeline validation from training to inference
  • Proper GPU device management
  • Efficient test data organization

coqui-ai/tts

tests/tts_tests/test_tacotron2_d-vectors_train.py

            
import glob
import json
import os
import shutil

from trainer import get_last_checkpoint

from tests import get_device_id, get_tests_output_path, run_cli
from TTS.tts.configs.tacotron2_config import Tacotron2Config

config_path = os.path.join(get_tests_output_path(), "test_model_config.json")
output_path = os.path.join(get_tests_output_path(), "train_outputs")

config = Tacotron2Config(
    r=5,
    batch_size=8,
    eval_batch_size=8,
    num_loader_workers=0,
    num_eval_loader_workers=0,
    text_cleaner="english_cleaners",
    use_phonemes=False,
    phoneme_language="en-us",
    phoneme_cache_path=os.path.join(get_tests_output_path(), "train_outputs/phoneme_cache/"),
    run_eval=True,
    test_delay_epochs=-1,
    epochs=1,
    print_step=1,
    print_eval=True,
    use_speaker_embedding=False,
    use_d_vector_file=True,
    test_sentences=[
        "Be a voice, not an echo.",
    ],
    d_vector_file="tests/data/ljspeech/speakers.json",
    d_vector_dim=256,
    max_decoder_steps=50,
)

config.audio.do_trim_silence = True
config.audio.trim_db = 60
config.save_json(config_path)

# train the model for one epoch
command_train = (
    f"CUDA_VISIBLE_DEVICES='{get_device_id()}' python TTS/bin/train_tts.py --config_path {config_path} "
    f"--coqpit.output_path {output_path} "
    "--coqpit.datasets.0.formatter ljspeech_test "
    "--coqpit.datasets.0.meta_file_train metadata.csv "
    "--coqpit.datasets.0.meta_file_val metadata.csv "
    "--coqpit.datasets.0.path tests/data/ljspeech "
    "--coqpit.test_delay_epochs 0 "
)
run_cli(command_train)

# Find latest folder
continue_path = max(glob.glob(os.path.join(output_path, "*/")), key=os.path.getmtime)

# Inference using TTS API
continue_config_path = os.path.join(continue_path, "config.json")
continue_restore_path, _ = get_last_checkpoint(continue_path)
out_wav_path = os.path.join(get_tests_output_path(), "output.wav")
speaker_id = "ljspeech-1"
continue_speakers_path = config.d_vector_file

# Check integrity of the config
with open(continue_config_path, "r", encoding="utf-8") as f:
    config_loaded = json.load(f)
assert config_loaded["characters"] is not None
assert config_loaded["output_path"] in continue_path
assert config_loaded["test_delay_epochs"] == 0

# Load the model and run inference
inference_command = f"CUDA_VISIBLE_DEVICES='{get_device_id()}' tts --text 'This is an example.' --speaker_idx {speaker_id} --speakers_file_path {continue_speakers_path} --config_path {continue_config_path} --model_path {continue_restore_path} --out_path {out_wav_path}"
run_cli(inference_command)

# restore the model and continue training for one more epoch
command_train = f"CUDA_VISIBLE_DEVICES='{get_device_id()}' python TTS/bin/train_tts.py --continue_path {continue_path} "
run_cli(command_train)
shutil.rmtree(continue_path)