LUCENE language analysis and its expansion

xiaoxiao2021-03-05  30

LUCENE language analysis and its expansion

Lucene's full-text search engine is quite good in the design of the soft architecture. Especially in the entire full-text search program, almost all can be dismantled independently, and the pristines will be expanded or changed. Recently we encountered an example, just explaining this flexibility.

A pristine who used Lucene should all know that in Lucene, several major categories related to the language discharge analysis, and their respective characters:

Org.apache.lucene.Alysis.Token: The minimum unit after vocabulary analysis represents a word that cannot be divided. Org.apache.lucene.Analysis.Tokenstream: The stream of Token Category allows you to take a token from it in a manarative method. Org.apache.lucene.Alysis.Tokenizer: Inherited from TOKENSTREAM, you can divide a given text into a series of sequential Token, and provide an external reading with TOKENSTREAM. Org.apache.lucene.Analysis.Tokenfilter: Inherited from TokenStream. Its constructive is usually tokenstream, which is the action of "filtration" that is incorporated when it is incorporated in the construction, and the other tokenStream is used as an external reading. Org.apache.lucene.Analysis.Analyzer: Constructing TokenStream based on entering text. In this category, the user's territor must face the category. In most cases, the user's terms do not need to come into contact with their derivative categories.

Analyzer is the most important category in the entire tanguary analysis. In general, Lucene has a different derivative category of different Analyzers, respectively, with different language analysis strategies, such as org.apache.lucene.Analysis.WhitespaceAnalyzer, is a blank font to divide the language. In most cases, org.apache.lucene.Alysis.standard.standardAndanalyzer is convenient.

Each Analyzer actually applies to a tokenizer within it, and uses one or more tokenfilter to filter substream (TOKENSTREAM) as needed. So what is presented outside, which is its role in Tokenizer and all tokenfilter, and it also hides the existence of Tokenizer and Tokenfilter. And this means that Analyzer is a representative of the entire language analysis subsystem.

Such a design can be seen by two benefits: (1) Since Analyzer is an assembled tokenizer and tokenfilter to present a single language analysis policy, it can be shared when multiple Analyzers have some effects. Tokenizer or tokenfilter. For example, you need to convert the text into lowercase, which can use the same LowerCaseFilter. This section is converted to a lowercase code, which has an effect of reuse. (2) From Analyzer to represent the entire language analysis subsystem, the ANALYZER can be changed, and the operational behavior of the language of the communication can be changed. There is no need to drag the shade, in order to disturb the discovery behavior, it has to be modified in the user's terms.

In its design, it can also be observed that it uses Decorator design mode. You can find that the tokenfilter itself is not only inherited from tokenstream, but also its built-in parameters, still tokenstream. This is a typical Decorator. The same design, you can also see on Java.io.FilterInputStream, FilterInputStream inherits from InputStream, and the only parameters of its constructing subsequent parameters are still InputStream. Basically, Decorator is an act of imitating parent categories (so it has an interface with the parent category), but the real behavior is the object that acts in one and its derivative category (by its parent category). Building a sub-incoming). Decorator is very suitable for dynamic assembly you want. Tokenfilter takes this feature. Let's take a look at Org.Apache.lucene.Alysis.standard.standardAnalyzer tokenStream () method:

TokenStream Result = New StandardTokenizer (Reader);

Result = New StandardFilter (Result);

Result = new lowercasefilter (result);

Result = New StopFilter (Result, Stopset);

The tokenstream () method is a method of each Analyzer to provide a conversation stream for the outside world. Different Analyzer's biggest differences, almost here, this method usually reveals that Analyzer is assembled to bekenizer and tokenfilter. So we can see that StandardAlyzer is StandardTokenizer, and StandardFilter, LowerCaseFilter, StopFilter and other Tokenfilter. You can notice that when you build StandardTokenizer, the program has been the most original TokenStream, followed by incorporating this TokenStream into the first tokenfilter. You can imagine this is a water tube process. At first we have a small water pipe, the water will flow from one end to the other end, then we connect another small water pipe that will stain on this small water pipe, when water is When export is out, the color will be changed. Then, the program sees the two-segment water pipe in accordance with the two sections of the gourd, so we can finally get the comprehensive effect of these four segments. Here you can see the shadow of Decorator, each tokenfilter is a TokenStream and also accepting the TokenStream object as an input (that is, source tokemstream) and changes to this input object.

Self-booking exchange analysis behavior, three can be. The first is the self-reserve Analyzer, which is the simplest level, you almost just choose to assemble the tokenizer and tokenfilter in TokenStream (). Another time is a custom tokenfilter, you can choose to discard the TokenStream in your own tokenfilter, or you can modify the Token content. The most difficult is the self-reserve tokenizer, where you have to develop the way of dividing the language.

Recently we encountered the application is like this. When our text contains the URL, for example: www.yahoo.com.tw, ​​we hope to enter Yahoo can also be found. However, when we use StandardAnalyzer, StandardTokenizer is generated by Javacc, while Javacc will generate the original code of StandardTokenizer in accordance with its STANDARDTOKENIZER.JJ file specification. There is such syntax in StandardTokenizer.jj file: | ("." ) >

So www.yahoo.com.tw will meet this grammar, but it is considered a single language (if you don't know Javacc, just learn about this conclusion). Once it is considered a single language, when it is built, the whole www.yahoo.com.tw will be a complete index project. After searching in Yahoo, it is natural to search.

In order to understand this problem, I can have two solutions. First, I can make a small file into another custom .jj file, thereby producing another tokenizer. Or, I can write a tokenfilter I have to judge them for Token. If it is Host type, then it will be subdivided.

I prefer the latter, because after writing this tokenfilter, I can also provide it to the future other Different Analyzer, so that the effect of the comprehensive match can be received.

So we wrote the following tokenfilter:

Import java.io.ioException;

Import java.util.vector;

Import org.apache.lucene.Alysis.token;

Import org.apache.lucene.Aalogs.tokenfilter;

Import org.apache.lucene.Alysis.TokenStream;

Public class urlplusfilter extends tokenfilter

{

Private vector vTokenBuffer = new vector ();

//

Public token next () THROWS IOEXCEPTION

{

IF (VTokenBuffer.Size ()> 0)

Return (token) vTokenBuffer.Remove (0);

Token token = INPUT.NEXT ();

IF (token == null)

Return NULL;

String text = token.TermText ();

INT P = 0;

INT Q = text.indexof ('.');

IF (q <0)

Return token;

While (p <(Text.length () - 1))

{

IF (q <0)

{

VTokenBuffer.Add (new token (text.substring (p), token.startoffset () p, token.startoffset () text.length ()));

Break;

}

Else

VTokenbuffer.Add (New token (p, q), token.startoffset () p, token.startoffset () Q)); p = q 1;

Q = text.indexof ('.', p 1);

}

Return next ();

}

Public urlplusfilter (TokenStream Input)

{

Super (Input);

}

}

This tokenfilter is the most important responsibility, which is within the next method. The entire Tokenfilter is like this, first use the parent category's constructors to remember the incoming source tokemstream (that is, tokenfilter.input this field). In NEXT (), it is basically returned to INPUT.NEXT (). However, there is an exception to the Token object returned by INPUT.NEXT (), and its text content contains ".", Further dismantling this Token. After disassembling a token, after disassembling a plurality of Token, then the plurality of Token will be stored in a buffer. So when NEXT () is called, you will first check if there is a token in Buffer. If there is, it will return directly. In turn, use input.next () from the source tokemstream. Through this method, when the source tokemstream returns token contains ".", It will be dismantled into a plurality of Token and is passed back.

Finally, just write an Analyzer to choose to match Tokenizer and Filter.

Import Java.io.Reader;

Import java.util.set;

Import org.apache.lucene.Alysis.TokenStream;

Import org.apache.lucene.Analysis.Analyzer;

Import org.apache.lucene.Analysis.lowercasefilter;

Import org.apache.lucene.Alysis.stopanalyzer;

Import org.apache.lucene.Aalogs.StopFilter;

Import org.apache.lucene.Analysis.standard.standardfilter;

Import org.apache.lucene.Analysis.standard.standardTokenizer;

Public Class Urlplusanalyzer Extends Analyzer

{

PUBLIC STATIC FINAL STRING [] STOP_WORDS

Public static final string [] stop_words1 = stopanalyzer.english_stop_words;

Public static factory string [] stop_words2 = {"http", "www", "com", "net"};

//

PRIVATE STOPSET;

//

Static

{

Stop_words = new string [stop_words1.length stop_words2.length];

For (int i = 0; i

STOP_WORDS [I] = stop_words1 [i];

For (int i = 0; I

}

Public urlplusanalyzer ()

{

THIS (stop_words);

}

Public urlplusanalyzer (string [] stopwords)

{

Stopset = stopfilter.makestopset (stopwords);

}

Public tokenstream tokenstream (String FieldName, Reader Reader)

{

TokenStream Result = New StandardTokenizer (Reader);

Result = New StandardFilter (Result);

Result = new lowercasefilter (result);

Result = new urlplusfilter (result);

Result = New StopFilter (Result, Stopset);

Return Result;

}

}

This URLPlusanalyzer is an imitation StandardAnalyzer. But there are two places, one is an additional set of STOP Words. The so-called STOP WORDS is feeding to StopFilter as an input, and the representative is not wishing to be tetached as an index. For example, "WWW", "COM", "Net" in the URL are too often, and there is not much substantive significance, it should be added to STOP Words. Another different way is to choose the assembled tokenfilter in tokenstream (). This means that all analyzed token will pass the URLPLUSFILTER inspection. If it is the typographical version of the URL, it will be subdivided.

This must be noted that urlplusfilter must play a power before the StopFilter works, and this sequence is necessary. If the URLPLUSFILTER is posted behind the StopFilter, the entire URL is considered a single language, so the entire URL is not compliant with COM or NET in STOP Words, so it will not be filtered. Drop. Therefore, URLPLUSFILTER must be placed in front, first let URLPLUSFILTER divide these words to filter out them by StopFilter.

As shown in this article, Lucene allows you to change its language through a variety of ways. This change can be made by modifying or expanding the original Lucene category, such a change can be made, which shows the elasticity and expandability of its original architecture. Through a sample of this, we hope that it is not just Lucene's language analysis operation behavior, nor is it to expand in Lucene, more importantly, how to learn or learn after observing its architecture. Mimic how to design this kind of elasticity and expandability software architecture.

Posted by Qing September 29, 2004 10:17 PM

echo

Come and try it out

Posted by: You Welfare Posted in October 5, 2004 07:26 AM

转载请注明原文地址:https://www.9cbs.com/read-33173.html

New Post(0)