Back to Repositories

Validating CPU Lion Optimizer Implementation in DeepSpeed

This test suite validates the CPU implementation of the Lion optimizer in DeepSpeed, comparing its performance against the CUDA-based FusedLion implementation. It ensures numerical accuracy and correct parameter updates across different data types and model sizes.

Test Coverage Overview

The test suite provides comprehensive coverage of the CPU-based Lion optimizer implementation.

Key areas tested include:
  • Multiple data types (FP16, BF16, FP32)
  • Various model sizes from small (22) to large (1048576)
  • Parameter update consistency between CPU and GPU implementations
  • Error handling for incorrect device placement

Implementation Analysis

The testing approach uses parameterized pytest fixtures to validate optimizer behavior across different configurations.

Key patterns include:
  • Distributed test environment setup
  • Numerical tolerance handling for different precisions
  • Systematic comparison between CPU and GPU implementations
  • Vendor-specific compatibility checks

Technical Details

Testing infrastructure utilizes:
  • PyTest framework with parametrization
  • DeepSpeed’s distributed testing utilities
  • CPU info detection for vendor-specific features
  • NumPy for numerical comparisons
  • Custom tolerance calculations based on parameter norms

Best Practices Demonstrated

The test suite exemplifies several testing best practices:

  • Systematic parameter space exploration
  • Precise numerical comparison with appropriate tolerances
  • Proper error handling and validation
  • Clear separation of test cases
  • Hardware-specific test skipping logic

microsoft/deepspeed

tests/unit/ops/lion/test_cpu_lion.py

            
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team

import torch
import numpy as np
import pytest
from cpuinfo import get_cpu_info

import deepspeed
from deepspeed.accelerator import get_accelerator
from deepspeed.ops.lion import FusedLion
from deepspeed.ops.op_builder import CPULionBuilder
from unit.common import DistributedTest

pytest.cpu_vendor = get_cpu_info()["vendor_id_raw"].lower()


def check_equal(first, second, atol=1e-2, verbose=False):
    x = first.detach().float().numpy()
    y = second.detach().float().numpy()
    print("ATOL", atol)
    if verbose:
        print("x = {}".format(x.flatten()))
        print("y = {}".format(y.flatten()))
        print('-' * 80)
    np.testing.assert_allclose(x, y, err_msg="param-update mismatch!", atol=atol)


def _compare_optimizers(model_size, param1, optimizer1, param2, optimizer2):
    for i in range(10):
        param1.grad = torch.randn(model_size, device=param1.device).to(param1.dtype)
        param2.grad = param1.grad.clone().detach().to(device=param2.device, dtype=param2.dtype)

        optimizer1.step()
        optimizer2.step()

    tolerance = param1.float().norm().detach().numpy() * 1e-2
    check_equal(param1.float().norm(), param2.float().cpu().norm(), atol=tolerance, verbose=True)


@pytest.mark.parametrize('dtype', [torch.half, torch.bfloat16, torch.float], ids=["fp16", "bf16", "fp32"])
@pytest.mark.parametrize('model_size',
                         [
                             (64),
                             (22),
                             #(55),
                             (128),
                             (1024),
                             (1048576),
                         ]) # yapf: disable
class TestCPULion(DistributedTest):
    world_size = 1
    reuse_dist_env = True
    requires_cuda_env = False
    if not get_accelerator().is_available():
        init_distributed = False
        set_dist_env = False

    @pytest.mark.skipif(not get_accelerator().is_available(), reason="only supported in CUDA environments.")
    @pytest.mark.skipif(not deepspeed.ops.__compatible_ops__[CPULionBuilder.NAME],
                        reason="CPULionBuilder has not been implemented on this system.")
    def test_fused_lion_equal(self, dtype, model_size):
        if ("amd" in pytest.cpu_vendor) and (dtype == torch.half):
            pytest.skip("cpu-lion with half precision not supported on AMD CPUs")

        from deepspeed.ops.lion import DeepSpeedCPULion

        cpu_data = torch.randn(model_size, device='cpu').to(dtype)
        cpu_param = torch.nn.Parameter(cpu_data)
        cuda_param = torch.nn.Parameter(cpu_data.to(get_accelerator().device_name()))

        cpu_optimizer = DeepSpeedCPULion([cpu_param])
        cuda_optimizer = FusedLion([cuda_param])

        _compare_optimizers(model_size=model_size,
                            param1=cpu_param,
                            optimizer1=cpu_optimizer,
                            param2=cuda_param,
                            optimizer2=cuda_optimizer)


class TestCPULionGPUError(DistributedTest):

    @pytest.mark.skipif(not deepspeed.ops.__compatible_ops__[CPULionBuilder.NAME],
                        reason="CPULionBuilder has not been implemented on this system.")
    def test_cpu_lion_gpu_error(self):
        model_size = 64
        from deepspeed.ops.lion import DeepSpeedCPULion
        device = get_accelerator().device_name(0)  # 'cuda:0' or 'xpu:0'
        param = torch.nn.Parameter(torch.randn(model_size, device=device))
        optimizer = DeepSpeedCPULion([param])

        param.grad = torch.randn(model_size, device=device)
        with pytest.raises(AssertionError):
            optimizer.step()