Back to Repositories

Testing CUDA Pre-Layer Normalization Kernel Implementation in DeepSpeed

This test suite validates the Pre-Layer Normalization (PreLN) kernel implementation in DeepSpeed’s inference v2 core operations. It ensures correct computation of layer normalization with residual connections across different tensor dimensions and data types.

Test Coverage Overview

The test suite provides comprehensive coverage of the CUDA PreLN kernel functionality.

Key areas tested include:

Multiple input dimensions with varying token and channel configurations
Support for different data types (dtype parametrization)
Residual connection handling
Layer normalization accuracy

Implementation Analysis

The testing approach uses pytest’s parametrization to validate the kernel across multiple scenarios.

Technical implementation includes:

Reference implementation in PyTorch for comparison
CUDA kernel execution with varying tensor shapes
Precision-aware type conversion handling
Epsilon-based numerical stability checks

Technical Details

Testing infrastructure utilizes:

PyTest framework with custom markers
DeepSpeed accelerator utilities
Custom allclose comparison function
CUDA device management
Dynamic tensor allocation and type casting

Best Practices Demonstrated

The test implementation showcases several testing best practices:

Parameterized test cases for comprehensive coverage
Reference implementation for validation
Device-aware tensor handling
Proper cleanup and resource management
Clear separation of input setup and validation logic

microsoft/deepspeed

tests/unit/inference/v2/kernels/core_ops/test_pre_ln.py

            
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team

import pytest
import torch

from deepspeed.accelerator import get_accelerator
from deepspeed.inference.v2.kernels.core_ops import CUDAFPPreLN
from ....v2.inference_test_utils import get_dtypes, allclose


def reference_implementation(residual: torch.Tensor, hidden_states: torch.Tensor, gamma: torch.Tensor,
                             beta: torch.Tensor, epsilon: float) -> torch.Tensor:
    residual_f = residual.to(torch.float32)
    hidden_states_f = hidden_states.to(torch.float32)
    gamma_f = gamma.to(torch.float32)
    beta_f = beta.to(torch.float32)
    residual_out = residual_f + hidden_states_f
    hidden_out = torch.nn.functional.layer_norm(residual_out, (hidden_states_f.size(-1), ),
                                                weight=gamma_f,
                                                bias=beta_f,
                                                eps=epsilon)
    return residual_out.to(hidden_states.dtype), hidden_out.to(hidden_states.dtype)


@pytest.mark.inference_v2_ops
@pytest.mark.parametrize("tokens, channels", [(1, 4096), (37, 2048), (112, 14432), (1024, 6144)])
@pytest.mark.parametrize("dtype", get_dtypes())
def test_cuda_pre_ln(tokens: int, channels: int, dtype: torch.dtype) -> None:

    # Input vals
    hidden_states = torch.randn((tokens, channels), dtype=dtype, device=get_accelerator().current_device_name())
    residual = torch.randn((tokens, channels), dtype=dtype, device=get_accelerator().current_device_name())
    gamma = torch.randn((channels), dtype=dtype, device=get_accelerator().current_device_name())
    beta = torch.rand((channels), dtype=dtype, device=get_accelerator().current_device_name())
    epsilon = 1e-5

    # Reference output
    ref_output_res, ref_output_hid = reference_implementation(residual, hidden_states, gamma, beta, epsilon)

    # New output
    pre_ln_kernel = CUDAFPPreLN(hidden_states.size(-1), residual.dtype)
    ds_output_res = torch.empty_like(residual)
    ds_output_hid = torch.empty_like(hidden_states)
    pre_ln_kernel(ds_output_res, ds_output_hid, residual, hidden_states, gamma, beta)

    # Check
    assert allclose(ds_output_res, ref_output_res)
    assert allclose(ds_output_hid, ref_output_hid)