Back to Repositories

Testing Silero TTS Voice Synthesis Integration in text-generation-webui

This test suite validates the Silero TTS (Text-to-Speech) functionality in the text-generation-webui project. It tests voice synthesis parameters, text preprocessing, and audio output generation with configurable voice characteristics.

Test Coverage Overview

The test suite covers essential TTS functionality including:
  • Voice parameter configuration (pitch, speed, speaker selection)
  • Text preprocessing and XML escaping
  • Audio file generation and output handling
  • Integration with Silero TTS model loading
Edge cases include empty string handling and special character processing.

Implementation Analysis

The testing approach utilizes a modular structure with clear separation of concerns between model loading, text processing, and output generation. It implements PyTorch for model handling and leverages SSML (Speech Synthesis Markup Language) for voice characteristic control.

Key patterns include parameter management through dictionaries and systematic voice configuration options.

Technical Details

Testing tools and components:
  • PyTorch for model operations
  • Silero Models hub integration
  • Custom tts_preprocessor module
  • WAV file output handling
  • XML escape character mapping

Best Practices Demonstrated

The test implementation showcases several quality practices:
  • Global parameter management for configuration flexibility
  • Proper error handling for empty inputs
  • Clean separation of preprocessing and synthesis steps
  • Efficient resource management with model loading
  • Structured voice parameter organization

oobabooga/text-generation-webui

extensions/silero_tts/test_tts.py

            
import time
from pathlib import Path

import torch
import tts_preprocessor

torch._C._jit_set_profiling_mode(False)


params = {
    'activate': True,
    'speaker': 'en_49',
    'language': 'en',
    'model_id': 'v3_en',
    'sample_rate': 48000,
    'device': 'cpu',
    'show_text': True,
    'autoplay': True,
    'voice_pitch': 'medium',
    'voice_speed': 'medium',
}

current_params = params.copy()
voices_by_gender = ['en_99', 'en_45', 'en_18', 'en_117', 'en_49', 'en_51', 'en_68', 'en_0', 'en_26', 'en_56', 'en_74', 'en_5', 'en_38', 'en_53', 'en_21', 'en_37', 'en_107', 'en_10', 'en_82', 'en_16', 'en_41', 'en_12', 'en_67', 'en_61', 'en_14', 'en_11', 'en_39', 'en_52', 'en_24', 'en_97', 'en_28', 'en_72', 'en_94', 'en_36', 'en_4', 'en_43', 'en_88', 'en_25', 'en_65', 'en_6', 'en_44', 'en_75', 'en_91', 'en_60', 'en_109', 'en_85', 'en_101', 'en_108', 'en_50', 'en_96', 'en_64', 'en_92', 'en_76', 'en_33', 'en_116', 'en_48', 'en_98', 'en_86', 'en_62', 'en_54', 'en_95', 'en_55', 'en_111', 'en_3', 'en_83', 'en_8', 'en_47', 'en_59', 'en_1', 'en_2', 'en_7', 'en_9', 'en_13', 'en_15', 'en_17', 'en_19', 'en_20', 'en_22', 'en_23', 'en_27', 'en_29', 'en_30', 'en_31', 'en_32', 'en_34', 'en_35', 'en_40', 'en_42', 'en_46', 'en_57', 'en_58', 'en_63', 'en_66', 'en_69', 'en_70', 'en_71', 'en_73', 'en_77', 'en_78', 'en_79', 'en_80', 'en_81', 'en_84', 'en_87', 'en_89', 'en_90', 'en_93', 'en_100', 'en_102', 'en_103', 'en_104', 'en_105', 'en_106', 'en_110', 'en_112', 'en_113', 'en_114', 'en_115']
voice_pitches = ['x-low', 'low', 'medium', 'high', 'x-high']
voice_speeds = ['x-slow', 'slow', 'medium', 'fast', 'x-fast']

# Used for making text xml compatible, needed for voice pitch and speed control
table = str.maketrans({
    "<": "<",
    ">": ">",
    "&": "&",
    "'": "&apos;",
    '"': """,
})


def xmlesc(txt):
    return txt.translate(table)


def load_model():
    model, example_text = torch.hub.load(repo_or_dir='snakers4/silero-models', model='silero_tts', language=params['language'], speaker=params['model_id'])
    model.to(params['device'])
    return model


model = load_model()


def output_modifier(string):
    """
    This function is applied to the model outputs.
    """

    global model, current_params

    original_string = string
    string = tts_preprocessor.preprocess(string)
    processed_string = string

    if string == '':
        string = '*Empty reply, try regenerating*'
    else:
        output_file = Path(f'extensions/silero_tts/outputs/test_{int(time.time())}.wav')
        prosody = '<prosody rate="{}" pitch="{}">'.format(params['voice_speed'], params['voice_pitch'])
        silero_input = f'<speak>{prosody}{xmlesc(string)}</prosody></speak>'
        model.save_wav(ssml_text=silero_input, speaker=params['speaker'], sample_rate=int(params['sample_rate']), audio_path=str(output_file))

        autoplay = 'autoplay' if params['autoplay'] else ''
        string = f'<audio src="file/{output_file.as_posix()}" controls {autoplay}></audio>'

        if params['show_text']:
            string += f'

{original_string}

Processed:
{processed_string}'

    print(string)


if __name__ == '__main__':
    import sys
    output_modifier(sys.argv[1])