Back to Repositories

Testing FP8 Quantization Matrix Operations in DeepSpeed

This test suite validates the functionality of FP8 quantization and matrix multiplication operations in DeepSpeed. It focuses on testing the accuracy and performance of floating-point quantization for different batch sizes and configurations, ensuring precise numerical computations in deep learning models.

Test Coverage Overview

The test suite covers FP8 quantization and matrix multiplication operations with comprehensive parameter variations.

Tests multiple batch sizes from 1 to 2048
Validates quantization accuracy with 8-bit precision
Verifies matrix multiplication with quantized weights
Tests error thresholds across different configurations

Implementation Analysis

The testing approach uses parametrized pytest fixtures to validate quantization operations across different scenarios.

Implements systematic parameter testing using pytest.mark.parametrize
Uses CUDA-enabled tensor operations
Validates against reference implementations
Implements error threshold verification

Technical Details

Uses PyTorch’s bfloat16 data type
Implements 8-bit quantization
Utilizes DeepSpeed’s FPQuantizer operations
Employs quantization group size of 128
Tests matrix dimensions: varying M x 8192 x 4096

Best Practices Demonstrated

The test implementation showcases robust testing practices for numerical operations.

Systematic parameter space exploration
Precise error threshold validation
Comprehensive batch size testing
Hardware-specific compatibility checks
Clear test organization and modularity

microsoft/deepspeed

tests/unit/ops/fp_quantizer/test_fp8_gemm.py

            
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team

import pytest
import torch
import deepspeed

from deepspeed.ops.op_builder import FPQuantizerBuilder

if not deepspeed.ops.__compatible_ops__[FPQuantizerBuilder.NAME]:
    pytest.skip("FPQuantizer op is not available on this system", allow_module_level=True)

from deepspeed.ops.fp_quantizer import FP_Quantize, matmul_fp8


@pytest.mark.parametrize("dtype", [torch.bfloat16], ids=["bf16"])
@pytest.mark.parametrize("q_bits", [8], ids=[
    "qbits8",
])
@pytest.mark.parametrize("M", [1, 2, 4, 8, 32, 64, 128, 256, 512, 1024, 2048])
def test_fp_quant(dtype, q_bits, M):
    quantization_group_size = 128
    fpq = FP_Quantize(group_size=quantization_group_size)

    N = 8192
    H = 4096

    x = torch.randn(M, H, dtype=dtype, device='cuda')
    weight_bf16 = torch.randn(H, N, dtype=dtype, device='cuda')

    weight, _ = fpq.quantize(weight_bf16.data, q_bits=8, return_meta_tensor=True)
    scale = fpq.get_scales()
    out = matmul_fp8(
        x,
        weight,
        scale,
        quantization_group_size,
    )

    out_q = torch.matmul(x, fpq.dequantize(weight, scale=fpq.scale))

    error = ((out - out_q).abs() / (out.abs() + 1e-5)).sum() / out.numel()
    assert 0.004 > error, f"failed on batch-size {M} with error {error}"