Back to Repositories

Testing Whoosh Search Integration with Chinese Text Analysis in jieba

This test suite validates the Whoosh search functionality integrated with Jieba Chinese text analysis. It specifically tests the searching capabilities using ChineseAnalyzer with various keywords in both Chinese and English.

Test Coverage Overview

The test coverage focuses on validating the Whoosh search index functionality with Jieba’s Chinese text analysis capabilities.

Key areas tested include:
  • Index directory handling and creation
  • Schema definition with TEXT and ID fields
  • Search functionality across multiple keywords
  • Mixed language search support (Chinese and English)
  • Content highlighting in search results

Implementation Analysis

The testing approach implements a systematic verification of the Whoosh search engine integration with Jieba’s ChineseAnalyzer. The test utilizes key patterns including:

  • Schema-based index configuration
  • Directory management for search indexes
  • QueryParser implementation for content searching
  • Iterative keyword testing with result highlighting

Technical Details

Testing tools and configuration:

  • Whoosh search engine framework
  • Jieba ChineseAnalyzer component
  • Schema configuration with TEXT and ID fields
  • Temporary directory handling for index storage
  • Unicode support for mixed character sets
  • QueryParser for content field searches

Best Practices Demonstrated

The test demonstrates several testing quality practices including proper separation of concerns and comprehensive search validation.

Notable practices include:
  • Systematic keyword testing across languages
  • Proper index and searcher management
  • Structured schema definition
  • Result highlighting implementation
  • Clean directory handling and management

fxsjy/jieba

test/test_whoosh_file_read.py

            
# -*- coding: UTF-8 -*-
from __future__ import unicode_literals
import sys
import os
sys.path.append("../")
from whoosh.index import create_in,open_dir
from whoosh.fields import *
from whoosh.qparser import QueryParser

from jieba.analyse import ChineseAnalyzer 

analyzer = ChineseAnalyzer()

schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored=True, analyzer=analyzer))
if not os.path.exists("tmp"):
    os.mkdir("tmp")
ix = open_dir("tmp")

searcher = ix.searcher()
parser = QueryParser("content", schema=ix.schema)

for keyword in ("水果小姐","你","first","中文","交换机","交换","少林","乔峰"):
    print("result of ",keyword)
    q = parser.parse(keyword)
    results = searcher.search(q)
    for hit in results:  
        print(hit.highlights("content"))
    print("="*10)