Back to Repositories

Testing TrieDict Text Tokenization Implementation in HanLP

This test suite validates the functionality of TrieDict implementation in HanLP, focusing on dictionary-based text tokenization and batch processing capabilities for Chinese text analysis.

Test Coverage Overview

The test suite provides comprehensive coverage of TrieDict functionality:

  • Basic tokenization with dictionary matching
  • Batch processing capabilities
  • Multi-word phrase detection
  • Empty dictionary handling
  • Dictionary modification operations

Implementation Analysis

The testing approach utilizes Python’s unittest framework with systematic validation of core TrieDict operations.

Key patterns include setUp fixture initialization, assertion-based validation, and sequential test organization focusing on individual features.

The implementation leverages unittest’s TestCase class for structured test execution and assertion handling.

Technical Details

  • Testing Framework: Python unittest
  • Test Setup: TestCase inheritance with setUp method
  • Assertion Methods: assertEqual, assertSequenceEqual, assertTrue/False
  • Test Data: Mixed Chinese-English text samples
  • Dictionary Operations: Initialization, modification, and deletion

Best Practices Demonstrated

The test suite exemplifies several testing best practices:

  • Isolated test cases with clear objectives
  • Proper test fixture setup
  • Comprehensive edge case handling
  • Consistent assertion patterns
  • Clean test method naming conventions

hankcs/hanlp

plugins/hanlp_trie/tests/test_trie_dict.py

            
import unittest

from hanlp_trie import TrieDict


class TestTrieDict(unittest.TestCase):

    def setUp(self) -> None:
        super().setUp()
        self.text = '第一个词语很重要,第二个词语也很重要'
        self.trie_dict = TrieDict({'重要': 'important'})

    def test_tokenize(self):
        self.assertEqual([(6, 8, 'important'), (16, 18, 'important')], self.trie_dict.tokenize(self.text))

    def test_split_batch(self):
        data = [self.text]
        new_data, new_data_belongs, parts = self.trie_dict.split_batch(data)
        predictions = [list(x) for x in new_data]
        self.assertSequenceEqual(
            [['第', '一', '个', '词', '语', '很', 'important', ',', '第', '二', '个', '词', '语', '也', '很', 'important']],
            self.trie_dict.merge_batch(data, predictions, new_data_belongs, parts))

    def test_tokenize_2(self):
        t = TrieDict({'次世代', '生产环境'})
        self.assertSequenceEqual(t.tokenize('2021年HanLPv2.1为生产环境带来次世代最先进的多语种NLP技术。'),
                                 [(15, 19, True), (21, 24, True)])

    def test_empty_dict(self):
        trie_dict = TrieDict()
        self.assertFalse(bool(trie_dict))
        trie_dict['one'] = 1
        self.assertTrue(bool(trie_dict))
        del trie_dict['one']
        self.assertFalse(bool(trie_dict))


if __name__ == '__main__':
    unittest.main()