Testing Chinese Text Tokenization Possibilities in HanLP
This test suite validates the string utility functions in HanLP, specifically focusing on tokenization possibilities for Chinese text. It ensures the correct enumeration of all possible ways to segment a given Chinese string.
Test Coverage Overview
Implementation Analysis
Technical Details
Best Practices Demonstrated
hankcs/hanlp
tests/test_string_util.py
# -*- coding:utf-8 -*-
# Author: hankcs
# Date: 2022-03-22 17:17
import unittest
from hanlp.utils.string_util import possible_tokenization
class TestStringUtility(unittest.TestCase):
def test_enumerate_tokenization(self):
text = '商品和服务'
toks = possible_tokenization(text)
assert len(set(toks)) == 2 ** (len(text) - 1)
for each in toks:
assert ''.join(each) == text
if __name__ == '__main__':
unittest.main()