A simple Chinese word

zhaozj2021-02-16 98

CLUCENE - A C Search Engine http://sourceforge.net/projects/clucene/

The traditional full-text search is based on a database. SQL Server Oracle MySQL provides full-text retrieval, but these are relatively large, not suitable for single or small applications (MySQL 4.0 can be used as integrated development), and mysql does not support Chinese. Later, I learned that Apache had an open source full-text search engine, and the application is relatively wide, Lucene is the full-text search engine of the Java version of Apache, which is quite good, but unfortunately is Java version, I have been thinking about there is C or C version Finally one day in http://sourceforge.net to take a good stuff, Clucene! Clucene is a full-text search engine of C version, which is completely transplanted in Lucene, but there is a lot of support for Chinese, and there are many memory leaks,: p cluene does not support Chinese word words, I wrote a simple Chinese word, probably thinking is The traditional two-point lexist, because the words of Chinese are unlike the language such as English. When the space or punctuation is considered to be the end of a word, it is used in two-point lexical, and the second point is, for example: Beijing, cut into Beijing. Beijing City. This way, the word library will be very large, but it is a simple word method (after a while, I will introduce some ideas for Chinese word). Of course, I can't enter "Beijing" when I retrieve it. If you don't retrieve it, just enter: " Beijing Beijing", you can retrieve in Beijing, although the accuracy is not very high, but it is suitable for simple words, and it is not afraid to miss some words. I made a chinesetokenizer, which made a chinesetokenizer. This module was responsible for the word, I wrote the main function.

ChineseTokenizer.cpp:

Token * chinesetokenizer :: next () {

While (! rd.eos ()) {char_t ch = rd.getNext (); if (Isspace ((CHAR_T) CH)! = 0) {Continue;} // read for alpha-nums and Chinese est CHAR_T) CH)! = 0) {Start = rd.column ();

Return ReadChinese (CH);}} Return NULL;

Token * chinesetokenizer :: readChinese (const char_t prev) {bool ischinese = false; stringbuffer str; str.append (prev);

CHAR_T CH = prev;

IF ((CHAR_T) CH >> 8) && (char_t) ch> = 0xA0) ischinese = true; while (! rd.eos () && isspace ((char_t) ch) == 0) {ch = rd.getNext (); If (isalnum ((CHAR_T) CH)! = 0) {// is a math or English to read the next space. Or the next Chinese word // is Chinese characters. Eating next Chinese characters, or read spaces or English end if (ischinese) {// Chinese character, and ch is Chinese character if ((CHAR_T) CH >> 8) && (char_t) ch> = 0xA0) {// Return to the previous Chinese character str.Append (ch); rd; .Unget (); // wprintf (_T ("[" ("Str.GetBuffer (), Start, Rd.column (), TokenImage [Lucene :: Analysis :: Chinese :: : Chinese]);} else {// is letters or numbers or spaces rd.unget (); // wprintf (_t ("[% s]"), str); return new token (Str.getBuffer (), Start , Rd.Column (), tokenimage [Lucene :: Analysis :: Chinese :: Chinese]);}} else {// Non-Chinese // CH is Chinese character if ((CHAR_T) CH >> 8) && (char_t) CH> = 0xA0) {// wprintf (_T ("["); rd.unget (); return new token (Str.getBuffer (), start, rd.column (), tokenimage [Lucene :: Analysis :: Chinese :: Chinese]);} Str.Append (ch);}}} // wprintf (_t ("[% s]"), str); return new token (Str.getBuffer (), start, rd.column (), tokenimage [Lucene :: Analysis :: Chinese :: alphaum]); }

At the same time, this Chinese word does not support files, can only support memory flow, because I used rd.unget (); if it is a file, hey, can only return half bytes: P

Ok. Write here first, I am too anxious today, I will send me the improvement of CLUCENE.

转载请注明原文地址:https://www.9cbs.com/read-16510.html

9cbs

New Post(0)