Back to Repositories

Testing OpenAI API Server Implementation with Multi-Model Workers in FastChat

This test suite implements and validates an OpenAI API test server with multiple model workers in FastChat. It handles both standard language models and multimodal capabilities, coordinating multiple worker processes and model configurations.

Test Coverage Overview

The test coverage focuses on launching and coordinating multiple model workers through an OpenAI-compatible API server. Key functionality includes:

Controller and API server initialization
Dynamic model worker allocation
Support for both standard and multimodal models
Port management and worker addressing
Custom tokenizer configurations

Implementation Analysis

The testing approach uses a process-based architecture to simulate a production environment. It implements worker management patterns through subprocess launching and configuration, with specific technical handling for different model types and worker variants.

The implementation leverages argument parsing for flexible test configuration and environment-specific GPU device allocation.

Technical Details

Uses Python’s os.popen for process management
Implements argparse for CLI argument handling
Configures CUDA device allocation
Manages multiple worker types (model_worker, vllm_worker, sglang_worker)
Handles custom port assignments (40000+)
Supports specialized tokenizer configurations

Best Practices Demonstrated

The test implementation showcases several testing quality practices including modular process management, clear separation of concerns, and flexible configuration options.

Configurable model selection
Systematic resource allocation
Clean process management
Environment-aware setup
Extensible worker configuration

lm-sys/fastchat

tests/launch_openai_api_test_server.py

            
"""
Launch an OpenAI API server with multiple model workers.
"""
import os
import argparse


def launch_process(cmd):
    os.popen(cmd)


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--multimodal", action="store_true", default=False)
    args = parser.parse_args()

    launch_process("python3 -m fastchat.serve.controller")
    launch_process("python3 -m fastchat.serve.openai_api_server")

    if args.multimodal:
        models = [
            ("liuhaotian/llava-v1.5-7b", "sglang_worker"),
        ]
    else:
        models = [
            ("lmsys/vicuna-7b-v1.5", "model_worker"),
            ("lmsys/fastchat-t5-3b-v1.0", "model_worker"),
            ("THUDM/chatglm-6b", "model_worker"),
            ("mosaicml/mpt-7b-chat", "model_worker"),
            ("meta-llama/Llama-2-7b-chat-hf", "vllm_worker"),
        ]

    for i, (model_path, worker_name) in enumerate(models):
        cmd = (
            f"CUDA_VISIBLE_DEVICES={i} python3 -m fastchat.serve.{worker_name} "
            f"--model-path {model_path} --port {40000+i} "
            f"--worker-address http://localhost:{40000+i} "
        )

        if "llava" in model_path.lower():
            cmd += f"--tokenizer-path llava-hf/llava-1.5-7b-hf"

        if worker_name == "vllm_worker":
            cmd += "--tokenizer hf-internal-testing/llama-tokenizer"

        launch_process(cmd)

    while True:
        pass