Intelligent information processing system kernel implementation

xiaoxiao2021-03-06 19

Intelligent Information Processing System Indicuated Cheng Jun 1 Intelligent Information Processing Status Introduction At present, all new service models, such as intelligent Agent, intelligent retrieval, information mining, information push, information navigation, knowledge discovery, etc., one of its core issues How to solve the information automatic classification. With regard to the automatic classification technology of information, there are two main methods: the basis for the vector space model (VSM) theory proposed by Professor Salton, US Salton; one is based on Agent's hypertext classification method, its foundation Adaptive Resonance THEORY, ADAPTIVE ART). At present, the online classification system is mostly using the VSM method. Using the word table as a variable in the vector space, because the Chinese word is very huge, there are generally tens of thousand words, one of the main problems of this method is slow. The kernel described in this paper also uses the VSM method, but because of the characteristics of Chinese characters, only the Chinese characters in the file are used as variables in the vector space; and the two-level index is used, which is the file index and the word index, respectively, what is expected to achieve It is mainly based on the guarantee rate and the ratio of the ratio to achieve a faster search speed. Based on the "Digital Project" Digital Processing of China Million Books, it has chosen files in 220 books in this book, covering 22 largest categories of "Middle-graphics", 24518 File, each file is 3K, with a total of 75M data, establish an index requires approximately 2 minutes and 16 seconds, and the query time is in the second level; another feature is that the index file is small, generally only the original file 1 / Around 6. The test environment is: Pentium 1.6G PC, the running speed is mainly related to the CPU calculation speed. This kernel mainly includes two parts: Part of the creation index, refers to the establishment of an index file for the original file for retrieval; another part is a matching search, use the created index file, and the user-selected new files, find the original file Similar files. 2 Creating the frequency of creating the cable is a big relationship with the correlation of the calculation file, so we must first find the words that have high frequencies in various files, exclude it; then all original files selected according to the retrieval need Establish three indexes in order: file number index, file index, and word index. 2.1 Statistics Frequency Table By statistics on the People's Daily Electronic Edition 83 456 990 text, we found that the frequency of other words has a certain rules except for the word "word", and the frequency of other words has certain regularities (see below). There are many virtual characters in the high-frequency word, and the frequencies in all kinds of text are very high. For the classification and retrieval of the text, it will be removed when the text is preprocessed. The low-frequency word did not appear in most files. In order to improve the retrieval speed, it removed, with 2,600 words of the frequency of 0.0001%, and the words also removed.

2.2 Creating File Number Index This index is very simple, mainly to record all file paths included in the database. Save all of the page files paths by selecting a directory. Structure: File Path: A b C ... File number: 1 2 3 ... The file number can quickly find this file to access this file, this index file is used in establishing an index and retrieval. 2.3 Creating a file index For each web file, counting the number of times of each word, finds the top 20 words with the highest frequency, the word, and the number of times, the number of times, and records the file index. These words constitute the vector space of the file, perform the COS value of the spatial vectors of the new file in resective sorting, obtain the COS value of the vector angle to determine the correlation. Document index file 1 word A1 number, the number of words A2, ..., the number of words of the number of words A20 number, the number of documents, the number of words, ... The number of words C1, the number of words C2, ..., the number of times the word C20 has the number of times the number of times, and when establishing an index, the most cost CPU time is the establishment of the file index, because to read each word, calculate the number of times, To sort, find the top 20. The author uses Chinese character encoding continuity (the first character ASCII code of two consecutive characters is greater than 175 and the second character ASCII code is greater than 160), which has been hasced to define the Chinese characters, which only needs to define an array: int count [72] [95]; the number of occurrences of all Chinese characters of the standard Chinese character set GB2312 can be accommodated. For each Chinese character, only two subtractions are required, and the count can be completed once (count [0x ** - 0xB0] [0x ** - 0xa1] ;). Sort by the bubbling order, although it is the most inexposed algorithm in the sorting algorithm, here we only need to find 20 words with the highest frequency, it has become the highest efficiency algorithm, no matter how big files If you only need to find 20, you can complete the sort. 2.4 Generate the data index of the data records for the file index, which is to find out which file appears in each word. According to the order in the following, you can form a word index. This file is used in the rough screening below, finds the original file similar to the new file on the word. Word 引字 1 1 file A1, file a2, ... // Indicates the word 1 in the files A1, A2, etc., files B1, file B2, ...... word 3 file C1, file C2, ............ 3 Retrieval Retrieval Divided into two steps: one is a crude screening; the other is the relevance sort. The rough screening is mainly to find files with new files that may be similar, and the correlation order is re-calculated, and the vector spatial angle of both is re-calculated, and the correlation between the two is quantifies. By adopting two levels of retrieval, it is greatly reduced the amount of calculation when retrieving. Because the multiplication and division calculation of the calculated vector is slower, if the correlation of 1 million files is calculated, the retrieved second level response is not guaranteed. The crude screening reduces candidate documents to 1/1000 of the number of original files, so if there is 1 million in the original file, only 1,000 left and right after crude screening, and then calculate these files, it can guarantee The second level of the retrieval responded. The following is a detailed description: 3.1 Rough screening Since the calculation of the vector space requires a large amount of calculation, the possible file can be crude, which can greatly reduce the amount of calculation and improve the program matching speed.

转载请注明原文地址:https://www.9cbs.com/read-45497.html

9cbs

New Post(0)