Back to Repositories

Validating CPU Lion Optimizer Implementation in DeepSpeed

This test suite validates the CPU implementation of the Lion optimizer in DeepSpeed, comparing its performance against the CUDA-based FusedLion implementation. It ensures numerical accuracy and correct parameter updates across different data types and model sizes.

Test Coverage Overview

The test suite provides comprehensive coverage of the CPU-based Lion optimizer implementation.

Key areas tested include:

Multiple data types (FP16, BF16, FP32)
Various model sizes from small (22) to large (1048576)
Parameter update consistency between CPU and GPU implementations
Error handling for incorrect device placement

Implementation Analysis

The testing approach uses parameterized pytest fixtures to validate optimizer behavior across different configurations.

Key patterns include:

Distributed test environment setup
Numerical tolerance handling for different precisions
Systematic comparison between CPU and GPU implementations
Vendor-specific compatibility checks

Technical Details

Testing infrastructure utilizes:

PyTest framework with parametrization
DeepSpeed’s distributed testing utilities
CPU info detection for vendor-specific features
NumPy for numerical comparisons
Custom tolerance calculations based on parameter norms

Best Practices Demonstrated

The test suite exemplifies several testing best practices:

Systematic parameter space exploration
Precise numerical comparison with appropriate tolerances
Proper error handling and validation
Clear separation of test cases
Hardware-specific test skipping logic

microsoft/deepspeed

tests/unit/ops/lion/test_cpu_lion.py

            
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team

import torch
import numpy as np
import pytest
from cpuinfo import get_cpu_info

import deepspeed
from deepspeed.accelerator import get_accelerator
from deepspeed.ops.lion import FusedLion
from deepspeed.ops.op_builder import CPULionBuilder
from unit.common import DistributedTest

pytest.cpu_vendor = get_cpu_info()["vendor_id_raw"].lower()


def check_equal(first, second, atol=1e-2, verbose=False):
    x = first.detach().float().numpy()
    y = second.detach().float().numpy()
    print("ATOL", atol)
    if verbose:
        print("x = {}".format(x.flatten()))
        print("y = {}".format(y.flatten()))
        print('-' * 80)
    np.testing.assert_allclose(x, y, err_msg="param-update mismatch!", atol=atol)


def _compare_optimizers(model_size, param1, optimizer1, param2, optimizer2):
    for i in range(10):
        param1.grad = torch.randn(model_size, device=param1.device).to(param1.dtype)
        param2.grad = param1.grad.clone().detach().to(device=param2.device, dtype=param2.dtype)

        optimizer1.step()
        optimizer2.step()

    tolerance = param1.float().norm().detach().numpy() * 1e-2
    check_equal(param1.float().norm(), param2.float().cpu().norm(), atol=tolerance, verbose=True)


@pytest.mark.parametrize('dtype', [torch.half, torch.bfloat16, torch.float], ids=["fp16", "bf16", "fp32"])
@pytest.mark.parametrize('model_size',
                         [
                             (64),
                             (22),
                             #(55),
                             (128),
                             (1024),
                             (1048576),
                         ]) # yapf: disable
class TestCPULion(DistributedTest):
    world_size = 1
    reuse_dist_env = True
    requires_cuda_env = False
    if not get_accelerator().is_available():
        init_distributed = False
        set_dist_env = False

    @pytest.mark.skipif(not get_accelerator().is_available(), reason="only supported in CUDA environments.")
    @pytest.mark.skipif(not deepspeed.ops.__compatible_ops__[CPULionBuilder.NAME],
                        reason="CPULionBuilder has not been implemented on this system.")
    def test_fused_lion_equal(self, dtype, model_size):
        if ("amd" in pytest.cpu_vendor) and (dtype == torch.half):
            pytest.skip("cpu-lion with half precision not supported on AMD CPUs")

        from deepspeed.ops.lion import DeepSpeedCPULion

        cpu_data = torch.randn(model_size, device='cpu').to(dtype)
        cpu_param = torch.nn.Parameter(cpu_data)
        cuda_param = torch.nn.Parameter(cpu_data.to(get_accelerator().device_name()))

        cpu_optimizer = DeepSpeedCPULion([cpu_param])
        cuda_optimizer = FusedLion([cuda_param])

        _compare_optimizers(model_size=model_size,
                            param1=cpu_param,
                            optimizer1=cpu_optimizer,
                            param2=cuda_param,
                            optimizer2=cuda_optimizer)


class TestCPULionGPUError(DistributedTest):

    @pytest.mark.skipif(not deepspeed.ops.__compatible_ops__[CPULionBuilder.NAME],
                        reason="CPULionBuilder has not been implemented on this system.")
    def test_cpu_lion_gpu_error(self):
        model_size = 64
        from deepspeed.ops.lion import DeepSpeedCPULion
        device = get_accelerator().device_name(0)  # 'cuda:0' or 'xpu:0'
        param = torch.nn.Parameter(torch.randn(model_size, device=device))
        optimizer = DeepSpeedCPULion([param])

        param.grad = torch.randn(model_size, device=device)
        with pytest.raises(AssertionError):
            optimizer.step()