Back to Repositories

Testing GPU Communication Parameter Extraction in ColossalAI

This test suite validates the alpha-beta profiling functionality in ColossalAI’s device management system, focusing on measuring communication parameters between GPU devices for distributed training optimization.

Test Coverage Overview

The test suite covers the extraction and validation of alpha-beta communication parameters between GPU devices.

Key areas tested include:
  • Alpha parameter extraction for device mesh communication
  • Beta parameter extraction for bandwidth measurements
  • Parameter validation across different device configurations
  • Multi-GPU setup with varying physical device combinations

Implementation Analysis

The testing approach utilizes pytest’s distributed testing capabilities with parameterized test cases. The implementation employs spawn-based multi-process testing with NCCL backend initialization.

Key patterns include:
  • Distributed test decorators for multi-GPU scenarios
  • Parameterized device configurations
  • Process spawning for distributed environment simulation
  • Automatic port management and address reuse handling

Technical Details

Testing infrastructure includes:
  • pytest framework with dist markers
  • NCCL backend for GPU communication
  • AlphaBetaProfiler class for parameter extraction
  • Custom decorators: @parameterize, @rerun_if_address_is_in_use
  • Launch configuration with localhost and dynamic port allocation

Best Practices Demonstrated

The test implementation showcases robust distributed testing practices for GPU-based systems.

Notable practices include:
  • Proper test isolation with process spawning
  • Flexible device configuration testing
  • Error handling for address conflicts
  • Clear parameter validation boundaries
  • Skipping tests when CI environment is unsuitable

hpcaitech/colossalai

tests/test_device/test_extract_alpha_beta.py

            
import pytest

from colossalai.device import AlphaBetaProfiler
from colossalai.initialize import launch
from colossalai.logging import disable_existing_loggers
from colossalai.testing import parameterize, rerun_if_address_is_in_use, spawn


def check_extract_alpha_beta(rank, world_size, port, physical_devices):
    disable_existing_loggers()
    launch(rank=rank, world_size=world_size, host="localhost", port=port, backend="nccl")
    profiler = AlphaBetaProfiler(physical_devices)

    mesh_alpha, mesh_beta = profiler.extract_alpha_beta_for_device_mesh()
    for alpha in mesh_alpha:
        assert alpha > 0 and alpha < 1e-3
    for beta in mesh_beta:
        assert beta > 0 and beta < 1e-10


@pytest.mark.skip(reason="Skip because assertion may fail for CI devices")
@pytest.mark.dist
@parameterize("physical_devices", [[0, 1, 2, 3], [0, 3]])
@rerun_if_address_is_in_use()
def test_profile_alpha_beta(physical_devices):
    spawn(check_extract_alpha_beta, 4, physical_devices=physical_devices)


if __name__ == "__main__":
    test_profile_alpha_beta()