Primary desktop search system implementation (Small Dream Search V0.5 Released)

xiaoxiao2021-04-02 362

Emilmatthew (Emilmatthew@126.com)

Summary:

A primary desktop search system is implemented, which is currently inquiring a single string composed of an ACSCII alphabet. The keywords in the document determine the IDF method

Keywords: IR, TF_IDF, Desktop Search

The Implement of a Basic Desktop Search Engine

(Small Dream Search V0.5 Released)

Emilmatthew (Emilmatthew@126.com)

Abstract:

In this paper, I describe the implement of a basic desktop search engine which I have just finished recently. This engine, which called as "Small Dream Search", supports the query for a single string as a key word currently. I use tf idf Algorithm to Determine The Key Word's Weight in Each Related Document.

Key Words: IR, TF_IDF, Desktop Search

1 Background introduction:

In the era, search has become an important way to use online access knowledge, and some famous search engines have become synonymous with search. Although we can't overreactivate the search engines to get knowledge, they have created a new avenue for people to obtain knowledge. From the perspective of programming, design a compact desktop search system, which is undoubtedly a very good exercise opportunity from theory to practice.

2 Overall design of the program:

This desktop search system is mainly composed of two parts: host end (CLAW) and customer query (Client), which introduce it separately:

2.1 Host (CLAW)

The host is mainly responsible for scanning all files within the existing specified path, and the frequency of each word in each file is statistically established, and the corresponding database is established (the way in which the inverted file is used).

2.2 Client (Client)

The client mainly performs the work of query, when typing the critical query word, if in the database, there is a check file of the word, you can get all the information related to the word, and finally with a file Page and summary web pages are presented as an output.

3 Detailed design of the program:

Let's talk about some of the key points when designing:

3.1 Host (CLAW)

The host's run framework is like this:

For Each File in Searchdir (Including ITS Subdir)

{

If File's Suffix Is Legal Then

Parse The File and Save Info to The DB

END IF

}

In general, it is nothing more than a traveler of all files under the specified path (including files in the subpath) plus the scanning single file and records the information to the library.

3.1.1 Traverse all files under the specified path:

The recursive form is traversed here, which is less efficient than recursive.

PARSEDIR (Dirstruct Curdir)

{

For all fas's or subdir in the curdir {

IF IT IS A Subdir

Parsedir (Subdir)

ELSE // IT IS A File

Parsesinglefile (filename)

}

Here is the related functions required for implementation, which is a few more important functions in the in the C language and the structure _finddata_t:

Struct _finddata_t {

unsigned attrib;

Time_t time_create; / * -1 for fat file systems * /

Time_t Time_Access; / * -1 for fat file systems * /

Time_t time_write;

_fsize_t size;

Char Name [260];

}

Long _findfirst (const char *, struct _finddata_t *); / / Locate the first handle of the file or directory containing the lookup item under the current folder

INT _FINDNEXT (long, struct _finddata_t *); // find the next file or directory in accordance with the files found last time

INT _FindClose (long); // Close the search handle, not to cause memory leakage.

In addition, when it is determined whether it is a directory, the following decision can be employed:

If TempFileinfo.attrib & _a_subdir // TempFileInfo is the second parameter in _findfirst or _findnext.

There are two directories ".." The last layer or ".", If it is not required, it should be deleted.

In addition, you also need to change a function of the path (in )

Void_Chdir (const char * pathname);

3.1.2 Traverse in a single file, collect the frequency of each word, and record into the corresponding inverted file

3.1.2.1 Rollover file

First introduce what is inverting files, usually, when a file has a statistical record of frequency, you can use this form:

Somefile:

STR1 FREQ1

STR2 FREQ2

...

This is the so-called form of a regular basin, is the way to file name as an index, and inverting? It is to store the file name of the word and the file name that appears in this document as the content, as follows:

Str1

RelateFile1 FREQ1

RelateFile2 Freq2

......

Of course, it can be encoded to RelateFile1, such as using an MD5 algorithm to save storage space. It is only saved in the form of the original path string.

3.1.2.2 Scan files, take out the conditional string, statistics

Here, it involves three works, one is to remove the words from the file, the second is to perform legality check, and finally the statistics are performed.

[A] Remove the words from the file:

This can be counted as a FILTER work:

The main frame is as follows:

While (File Not End)

{

Get a new charà tmpchar;

File Reach End

Goto out1;

WHIL (TMPCHAR IS Not a ASCII Code);

OUT:

While (TMPCHAR IS A ASCII Code)

{

IF TMPCHAR I Capitalization Thenchange to Lowcase

Equip tmpchar to newstring

Get a new charà tmpchar;

File Reach End

{

Break;

}

Make new string's ending;

}

[b] legality check:

The main legitimacy check has a string (only one character), the string is too long (a single word is more than 30 letters) and the filtering of a string of the search meaning (such as A, THE, WHAT, BE, etc.). This part is simpler, slightly defective.

[C] Statistical words in the current appearance frequency and in the library:

The frequencies of statistical words have more vision, here you focus on the operation of the warehouse:

Here, for the word of the word, a HASH function is used to convert the ASCII code value of each word into an integer value, and in units of 10,000 files, stored in the word inverted folder. For example, in a folder, the Hash code stored in the word is a retroffed file of 0-9999. In a folder, the HASH code is 10000-19999 inverted documents.

So, when a word is going to the library, the following framework is used to find its right inverted file location:

Flag = 0

{

IF (Invert File Name Is Hash (Curstring) Not Be CREATED

{

Create The Invert File and Save The First File Address and Curstring's Freq to Invert File

Flag = 1

}

Else

{

IF (Invert File Name Is Hash (Curstring) Has Been Created && It Is Corresponed to Curstring

{

Append the file address and curingstring's freq to invert file

Flag = 1

}

Else

{

Using hashValue = (Hashvalue 1)% Maxlen Strategy to Prepare the str's corresponded file

}

} while (! flag)

Due to a file, more of the more IO operations, so there is still a large improvement room.

3.2 Customer Query (Client)

Compared to the host query end, the customer's work should be relaxed, mainly for Hash processing for the input keyword, find its corresponding inverted file, extract the file path information, and use the corresponding weight calculation policy, Sort, eventually, in the form of a summary.

Here is a TF_IDF algorithm to introduce the current implementation calculation strategy.

The so-called IDF refers to a significant impact of a word in search. For example, this word has fewer times in all files, then searches this keyword, the document related to it will highlight it; If a word appears in many file chapters, it is difficult to have a better distinction result when searching it is a bit "泯泯".

So, Idfi = log (n / ni) can be defined where: n is the total number of files in the library, and Ni is the number of files containing string I.

The result of the Fi, J is the result of the frequency of the string I after the frequency of the frequency of the document J: FI, J = FI, J / MAXL FREQL, J / / This step is processed by the host

The string corresponds to its weight value of the document J:

Wi, j = fi, j * idfi

Another weight calculation method of more distinctive effects:

Wi, j = (0.5 0.5 * fi, j) * log (n / ni)

Align the weight values in descending order, the ordering result of the final related document is obtained.

4 Description of the parameters in the program configuration file:

4.1 Host:

Fargus.txt

GSUPCHS = 0 // Does it support Chinese, 0 means not support

GSearchDB = f: / TestSearch / dsearchdb // Store the library of the inverted file, the folder needs you to create

GGLOBALVAL = F: / TestSearch / global // Store the folder where the global variable is located, you need to create yourself

GMODE = 0 // Mode: 0 indicates that the search path is read from the file, and 1 is read from the screen.

Ginputdbdir = f: / testsearch / searchadd // Searching folder location, do not support space, please do not add /, if you want to scan the C:

FELEWORDS.TXT

Store words that need to filter

THE

...

FSUFFX.TXT

Store an acceptable suffix

SUF1

SUF2

...

4.2 Client:

GSearchDB = f: / testsearch / dsearchdb //

GGLOBALVAL = f: / testsearch / globalval // with the host

GHTMLHOME = F: / TestSearch / htmlhome // Store the folder of the search results page, where the engine icon Flag.jpg

GALGOTYPE = 1 // Select Algorithm, 1: TF_IDF

gcmode = 0 // Mode: with the host

gcsearchword = apple // Search keywords

If you find a related file, the following search results interface will appear:

Welcome trials and debugging, propose criticism and rectification opinions!

5 TO VERSION1.0 improvements:

1 Distribute files into non-recursive forms.

2 Multiple keywords can be supported.

3 Support Chinese path but do not support Chinese files.

4 Improve the IO operation of the file into the stock, accelerate the capture.

5 Adopt multi-threaded mode to accelerate the capture.

6 Adjustments to files that are repeated.

7 Multiple paths can be supported.

8 Enhance the reliability of the system: For example, when interrupt scanning, it can guarantee the least effects of files and related data in the library.

9 Identify file names contain multiple.

10 Configuration files The path band space can be accepted.

11 The path in the chassis file is compressed by the MD5 or the corresponding algorithm.

Reference:

[1] Ricardo Baeza-Yates, BERTHIER RIBERIRO-NETO, MODERTN Information Retrieval, AdDision WELETY, 2005.

[2] Li Xiaoming, Yan Hongfei, Wang Jimin, Search Engine Principles, Technology and System, Science Press, 2005 Completion Day:

06/06/20

appendix:

1 Test program download:

http://emilmatthew.51.net/emilpapers/0623dreamsearch/code.rar

If you click Can not download (or browse), use the hyperlink download (or browse) to the browser (recommend the Myie or Greenborwser), press Enter to press Enter. If you don't accident, you should download it.

If there is a problem in the download, please refer to:

http://blog.9cbs.net/emilmatthew/archive/2006/04/08/655612.aspx

转载请注明原文地址:https://www.9cbs.com/read-131357.html

9cbs

New Post(0)