Back to Repositories

Testing Whoosh Search Integration with Chinese Text Analysis in jieba

This test suite validates the Whoosh search functionality integrated with Jieba Chinese text analysis. It specifically tests the searching capabilities using ChineseAnalyzer with various keywords in both Chinese and English.

Test Coverage Overview

The test coverage focuses on validating the Whoosh search index functionality with Jieba’s Chinese text analysis capabilities.

Key areas tested include:

Index directory handling and creation
Schema definition with TEXT and ID fields
Search functionality across multiple keywords
Mixed language search support (Chinese and English)
Content highlighting in search results

Implementation Analysis

The testing approach implements a systematic verification of the Whoosh search engine integration with Jieba’s ChineseAnalyzer. The test utilizes key patterns including:

Schema-based index configuration
Directory management for search indexes
QueryParser implementation for content searching
Iterative keyword testing with result highlighting

Technical Details

Testing tools and configuration:

Whoosh search engine framework
Jieba ChineseAnalyzer component
Schema configuration with TEXT and ID fields
Temporary directory handling for index storage
Unicode support for mixed character sets
QueryParser for content field searches

Best Practices Demonstrated

The test demonstrates several testing quality practices including proper separation of concerns and comprehensive search validation.

Notable practices include:

Systematic keyword testing across languages
Proper index and searcher management
Structured schema definition
Result highlighting implementation
Clean directory handling and management

fxsjy/jieba

test/test_whoosh_file_read.py

            
# -*- coding: UTF-8 -*-
from __future__ import unicode_literals
import sys
import os
sys.path.append("../")
from whoosh.index import create_in,open_dir
from whoosh.fields import *
from whoosh.qparser import QueryParser

from jieba.analyse import ChineseAnalyzer 

analyzer = ChineseAnalyzer()

schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored=True, analyzer=analyzer))
if not os.path.exists("tmp"):
    os.mkdir("tmp")
ix = open_dir("tmp")

searcher = ix.searcher()
parser = QueryParser("content", schema=ix.schema)

for keyword in ("水果小姐","你","first","中文","交换机","交换","少林","乔峰"):
    print("result of ",keyword)
    q = parser.parse(keyword)
    results = searcher.search(q)
    for hit in results:  
        print(hit.highlights("content"))
    print("="*10)