Back to Repositories

Testing GPT-2 Model Configurations and Loss Functions in ColossalAI

This test suite implements and validates GPT-2 model configurations and loss calculations for the ColossalAI framework, focusing on model architecture verification and performance profiling.

Test Coverage Overview

The test suite provides comprehensive coverage of GPT-2 model implementations, including medium and XL variants.

Key areas tested include:

Model initialization with configurable parameters
Forward pass functionality with attention masks
Loss calculation mechanisms
Gradient checkpointing implementation

Implementation Analysis

The testing approach utilizes PyTorch’s nn.Module framework to implement GPT-2 model variants with customizable architectures.

Notable implementation patterns include:

Modular class structure for model and loss components
Flexible parameter configuration
Integration with Hugging Face Transformers library
Optional gradient checkpointing support

Technical Details

Testing infrastructure leverages:

PyTorch nn.Module for model architecture
Transformers library’s GPT2Config and GPT2LMHeadModel
CrossEntropyLoss for language modeling
Gradient checkpointing for memory optimization

Best Practices Demonstrated

The test implementation showcases robust software engineering practices.

Notable practices include:

Clean separation of model and loss components
Configurable architecture parameters
Memory optimization options
Consistent interface design
Standard PyTorch model structure adherence

hpcaitech/colossalai

tests/test_fx/test_profiler/gpt_utils.py

            
import torch.nn as nn
from transformers import GPT2Config, GPT2LMHeadModel


class GPTLMModel(nn.Module):
    def __init__(
        self,
        hidden_size=768,
        num_layers=12,
        num_attention_heads=12,
        max_seq_len=1024,
        vocab_size=50257,
        checkpoint=False,
    ):
        super().__init__()
        self.checkpoint = checkpoint
        self.model = GPT2LMHeadModel(
            GPT2Config(
                n_embd=hidden_size,
                n_layer=num_layers,
                n_head=num_attention_heads,
                n_positions=max_seq_len,
                n_ctx=max_seq_len,
                vocab_size=vocab_size,
            )
        )
        if checkpoint:
            self.model.gradient_checkpointing_enable()

    def forward(self, input_ids, attention_mask):
        # Only return lm_logits
        return self.model(input_ids=input_ids, attention_mask=attention_mask, use_cache=not self.checkpoint)[0]


class GPTLMLoss(nn.Module):
    def __init__(self):
        super().__init__()
        self.loss_fn = nn.CrossEntropyLoss()

    def forward(self, logits, labels):
        shift_logits = logits[..., :-1, :].contiguous()
        shift_labels = labels[..., 1:].contiguous()
        # Flatten the tokens
        return self.loss_fn(shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1))


def gpt2_medium(checkpoint=False):
    return GPTLMModel(hidden_size=1024, num_layers=24, num_attention_heads=16, checkpoint=checkpoint)


def gpt2_xl(checkpoint=False):
    return GPTLMModel(hidden_size=1600, num_layers=48, num_attention_heads=32, checkpoint=checkpoint)