Back to Repositories

Testing Embedding Utility Functions and MMR Implementation in llama_index

This test suite validates embedding utility functions in llama_index, focusing on Maximum Marginal Relevance (MMR) and top-k embedding retrieval implementations. The tests ensure accurate similarity calculations and proper handling of embedding vectors.

Test Coverage Overview

The test suite provides comprehensive coverage of embedding utility functions, particularly the MMR algorithm implementation.

  • Tests MMR threshold behavior and similarity calculations
  • Validates embedding vector ordering and ID mapping
  • Verifies consistency between regular top-k and MMR-based retrieval
  • Covers edge cases with different similarity thresholds

Implementation Analysis

The testing approach employs numpy-based vector comparisons and mathematical validation of MMR scoring.

Key patterns include:
  • Numerical similarity assertions using numpy’s isclose()
  • Vector-based test scenarios with controlled input data
  • Threshold parameter validation
  • ID mapping verification

Technical Details

Testing tools and configuration:
  • Python unit testing framework
  • NumPy for numerical computations and assertions
  • Custom embedding utility functions: get_top_k_embeddings and get_top_k_mmr_embeddings
  • Vector-based test data with controlled dimensionality

Best Practices Demonstrated

The test suite exemplifies robust testing practices for vector-based algorithms.

  • Comprehensive edge case coverage
  • Precise numerical comparisons with appropriate tolerances
  • Clear test data setup with documented expectations
  • Validation of both numerical results and structural outputs

run-llama/llama_index

llama-index-core/tests/indices/query/test_embedding_utils.py

            
""" Test embedding utility functions."""

import numpy as np
from llama_index.core.indices.query.embedding_utils import (
    get_top_k_embeddings,
    get_top_k_mmr_embeddings,
)


def test_get_top_k_mmr_embeddings() -> None:
    """Test Maximum Marginal Relevance."""
    # Results score should follow from the mmr algorithm
    query_embedding = [5.0, 0.0, 0.0]
    embeddings = [[4.0, 3.0, 0.0], [3.0, 4.0, 0.0], [-4.0, 3.0, 0.0]]
    result_similarities, result_ids = get_top_k_mmr_embeddings(
        query_embedding, embeddings, mmr_threshold=0.8
    )

    assert np.isclose(0.8 * 4 / 5, result_similarities[0], atol=0.00001)
    assert np.isclose(
        0.8 * 3 / 5 - (1 - 0.8) * (3 * 4 / 25 + 3 * 4 / 25),
        result_similarities[1],
        atol=0.00001,
    )
    assert np.isclose(
        0.8 * -4 / 5 - (1 - 0.8) * (3 * -4 / 25 + 4 * 3 / 25),
        result_similarities[2],
        atol=0.00001,
    )
    assert result_ids == [0, 1, 2]

    # Tests that if the first embedding vector is close to the second,
    # it will return the third
    query_embedding = [1.0, 0.0, 1.0]
    embeddings = [[1.0, 0.0, 0.9], [1.0, 0.0, 0.8], [0.7, 0.0, 1.0]]

    _, result_ids = get_top_k_mmr_embeddings(
        query_embedding, embeddings, mmr_threshold=0.5
    )
    assert result_ids == [0, 2, 1]

    # Tests that embedding ids map properly to results
    _, result_ids = get_top_k_mmr_embeddings(
        query_embedding, embeddings, embedding_ids=["A", "B", "C"], mmr_threshold=0.5
    )
    assert result_ids == ["A", "C", "B"]
    # Test that it will go back to the original order under a high threshold
    _, result_ids = get_top_k_mmr_embeddings(
        query_embedding, embeddings, mmr_threshold=1
    )
    assert result_ids == [0, 1, 2]

    # Test similarity_top_k works
    _, result_ids = get_top_k_mmr_embeddings(
        query_embedding, embeddings, mmr_threshold=1, similarity_top_k=2
    )
    assert result_ids == [0, 1]

    # Test the results for get_top_k_embeddings and get_top_k_mmr_embeddings are the
    # same for threshold = 1
    query_embedding = [10, 23, 90, 78]
    embeddings = [[1, 23, 89, 68], [1, 74, 144, 23], [0.23, 0.0, 1.0, 9]]
    result_similarities_no_mmr, result_ids_no_mmr = get_top_k_embeddings(
        query_embedding, embeddings
    )
    result_similarities, result_ids = get_top_k_mmr_embeddings(
        query_embedding, embeddings, mmr_threshold=1
    )

    for result_no_mmr, result_with_mmr in zip(
        result_similarities_no_mmr, result_similarities
    ):
        assert np.isclose(result_no_mmr, result_with_mmr, atol=0.00001)