Back to Repositories

Testing GPT-2 Model Configurations and Loss Functions in ColossalAI

This test suite implements and validates GPT-2 model configurations and loss calculations for the ColossalAI framework, focusing on model architecture verification and performance profiling.

Test Coverage Overview

The test suite provides comprehensive coverage of GPT-2 model implementations, including medium and XL variants.

Key areas tested include:
  • Model initialization with configurable parameters
  • Forward pass functionality with attention masks
  • Loss calculation mechanisms
  • Gradient checkpointing implementation

Implementation Analysis

The testing approach utilizes PyTorch’s nn.Module framework to implement GPT-2 model variants with customizable architectures.

Notable implementation patterns include:
  • Modular class structure for model and loss components
  • Flexible parameter configuration
  • Integration with Hugging Face Transformers library
  • Optional gradient checkpointing support

Technical Details

Testing infrastructure leverages:
  • PyTorch nn.Module for model architecture
  • Transformers library’s GPT2Config and GPT2LMHeadModel
  • CrossEntropyLoss for language modeling
  • Gradient checkpointing for memory optimization

Best Practices Demonstrated

The test implementation showcases robust software engineering practices.

Notable practices include:
  • Clean separation of model and loss components
  • Configurable architecture parameters
  • Memory optimization options
  • Consistent interface design
  • Standard PyTorch model structure adherence

hpcaitech/colossalai

tests/test_fx/test_profiler/gpt_utils.py

            
import torch.nn as nn
from transformers import GPT2Config, GPT2LMHeadModel


class GPTLMModel(nn.Module):
    def __init__(
        self,
        hidden_size=768,
        num_layers=12,
        num_attention_heads=12,
        max_seq_len=1024,
        vocab_size=50257,
        checkpoint=False,
    ):
        super().__init__()
        self.checkpoint = checkpoint
        self.model = GPT2LMHeadModel(
            GPT2Config(
                n_embd=hidden_size,
                n_layer=num_layers,
                n_head=num_attention_heads,
                n_positions=max_seq_len,
                n_ctx=max_seq_len,
                vocab_size=vocab_size,
            )
        )
        if checkpoint:
            self.model.gradient_checkpointing_enable()

    def forward(self, input_ids, attention_mask):
        # Only return lm_logits
        return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0]


class GPTLMLoss(nn.Module):
    def __init__(self):
        super().__init__()
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, logits, labels):
        shift_logits = logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()
        # Flatten the tokens
        return self.loss_fn(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))


def gpt2_medium(checkpoint=False):
    return GPTLMModel(hidden_size=1024, num_layers=24, num_attention_heads=16, checkpoint=checkpoint)


def gpt2_xl(checkpoint=False):
    return GPTLMModel(hidden_size=1600, num_layers=48, num_attention_heads=32, checkpoint=checkpoint)