Similarity analysis spell corrector
Wu Wei
1 week, 2 Wang Liheng 1
1 Tongji University Computer Department, 2 Tongji University Software College, 1239 Siping Road, Shanghai, 200092
Wweric2000@163.com, zhou_wenjun @ 21cn.com, liswang @ online.sh.cn
Abstract: This article uses similarity analysis to English spell corrector. When the user uses the search engine for information query, it is often not accurate due to the incorrect word, which is not accurate enough. This paper attempts to correct the spelling errors in the method of similarity analysis. The number of metadata between two words [1] is used as the similarity standard, and finally the three words with the largest similarity to the user. Through experiments, the method described herein can reach a relatively high correction accuracy.
Key words: spell correction; similarity; Levenshtein distance; cluster
Abstract:. This Paper applies the analysis of Similarity to correct the spelling errors of English words When people use the search engines, because of the spelling mistakes of query words it is usual that they can not get the information that they need eagerly This paper. attempt to use the similarity of words to correct the spelling errors. The number of Elementary Operations [1] (ie single replacements, insertions and deletions) is considered the similarity. Eventually, it will return 3 recommending correct words to users. Through a series Of Tests, IT Shows That The Corrector Perform Wonderful.
introduction
The search engine is a web user using the most frequent tools, but now the mainstream search engine is mainly retrieved by keywords, which enables the correctness of keywords entered by the user to determine the accuracy of the retrieval result. Most of the information on the Internet is published in English. This is the possibility of spelling errors in English. When querying English information, the possibility of spelling errors is large.
This paper attempts to determine the correct form of the word input by the user input by analyzing the similarity between words. Here, the string distance is used as the standard of the similarity, ie the number of metadata between the two strings. At present, the spelling inspector is generally retrieved by a word matching, which often causes the word belonging to the same root because the timing is considered to be two completely different words. This paper clusters a word belonging to the same root by similarity, as long as appropriate to adjust the minimum threshold of the similarity, the "shaped" word caused by the same stem is clustered in the same class, and then in the same class The similarity is from large to small arrangements, returning the first three words as candidate words for users.
The next article layout is such that Part 1 is a common method of similarity clustering; Part 2 is a Levenshtein Distance [4] algorithm for calculating the character distance herein; Part 3 for the principle of spell correction Structure, as well as some improvements; Part 4 also gives some experimental results based on this corrector and some issues; the final part is summary.
1 cluster analysis based on similarity
The purpose of clustering is to bring data objects into multiple classes or clusters such that there is a high similarity between objects in the same cluster, and the objects in different clusters differ. Choosing a good similarity standard is a tricky problem. In this paper, the number of element operations in the string [1] is similar to the similarity standard, the higher the similarity, the more similarity of the two characters. Here, there is of course a problem that sets the threshold. For example, the similarity value is less than 0.5 string. But this threshold should not be static, but should be dynamically set according to the length of the string. When the string length is large, the valve value can be appropriately raised; when the string length is small, the threshold can be lowered. It is different because of a character in a string of different lengths. For example, a string of 10 characters, which occurs a characterful error, which is clear that there is only a character error in a string of only 4 characteristics.
To calculate the similarity between the strings, first calculate the distance between the strings. The calculation method of several common distances is introduced, and it is also commonly used in clustering analysis. [5]:
1) The most commonly used distance metric is the distance from Euclide, which is defined as follows:
among them
with
It is two P-dimensional data objects. A string can be regarded as a p-dependent data object in a string comparison, where P is a string character number.
2) Another famous metric method is Manhatan distance, which is defined as follows:
3) The Nakati distance is the narration of the distance of Eurorer and Manhatan, which is defined as follows:
Where Q is a positive integer. When q = 1, it represents Manha Tan, when q = 2 indicates that Eurorer distance.
4) Levenshtein distance is a distance from the number of components between the source string and the target string, and this article selects the algorithm to calculate the string distance. Because Levenshtein Distance can compare the distance between different lengths of strings, and the first three methods can only compare the same length string. Part 2 will be a brief introduction to Levenshtein Distance.
2 Levenshtein Distance
Levenshtein distance is named Russian Scientist Vladimir Levenshtein, and he designed this algorithm for calculating Levenshtein Distance in 1965. Levenshtein Distance is also used in field of speech recognition, DNA analysis.
Levenshtein Distance defines the number of yuan required for the source string to transition into a target string [1]. The number of yuan is defined as a string to convert the basic operation required to another string. For example, Move transfers into MOVIE, is 1, that is, the Levenshtein Distance between Move and Movie is 1. Because adding character I to Move to become MOVIE. Basic operations: Includes additional characters, delete 1 characters and replace one characters. The larger the Levenshtein Distance, the greater the difference between the two strings.
Levenshtein Distance calculates the following advantages that the string is not available in other methods:
1) The distance between the strings of different lengths can be calculated, while the Euclide Distance, the Manha Tan Distance and the Tosski Distance can only calculate the distance between the same length;
2) Yuan operation more truly reflects artificial spelling errors;
However, the Levenshtein Distance algorithm also has the shortcomings of computing overhead, mainly caused by matrix calculations.
Calculate the algorithm for Levenshtein Distance as follows [4]:
step
Description
1
Set N is the character number of S. set the number of characters of t. IF n = 0, return m and exit.if m = 0, Return n and exit. Create a 0..m row, 0..n Matrix of columns.
2
Initialize the first line of 0..n. Initialization first list is 0..m.
3
Check each character (i from1 to n) in S.
4
Check each character in t (J from 1 to m).
5
If S [i] is equal to T [J], COST is 0. If S [i] does not equal T [J], COST is 1.
6
Unit D [i, j] in the matrix is equal to the smallest in the following: a. The value of the same number is 1: D [i-1, j] 1.B. The value of the previous unit of the peer plus 1: D [I, J-1] 1. The value of the unit of the unit plus COST: D [I-1, J-1] COST.
Seduce
After the iteration is executed, the distance between S and T is D [N, M].
3 spell corrector based on similarity analysis
The spelling corrector designed herein is the core of similarity. The core of similarity calculation is Levenshtein Distance, which uses the following similarity calculation formula.
[2]
Where Edit-Distance uses the Levenshtein Distance algorithm, Wn () is the number of words of words.
The corrector is pre-processed by the original data input by the user, and the pre-processes include STOP Words [2] in the input string. For example, STOP Words such as A, An, THE, THIS, and extract the main word in the input string (The Primary Word [1]). Because in general, Stop Words is not meaningful for inquiries, queries are mainly based on the host word. After pretreatment, it is determined the word matching range, which is determined according to the length of the string (after pre-processing). This is the main reason to reduce the calculation overhead, because one original is a string of 10 characters long, there is no need to match the string of less than 6 characters, but the matching range fluctuates a few characters to occur according to human errors. Probability is determined. For example, generally spelling errors will not exceed 2 characters, and a 10-character string can be matched to the word length of 8-12. After determining the matching range, you can start the similarity between the source words and the target words, and finally according to the size of the similarity, the maximum three words are returned to the user for the candidate word. The specific flow is shown in Figure-1. The similarity threshold set for different lengths in the program is dynamically adjusted. The larger the length, the larger the threshold, only the similarity of the similarity is greater than equal to the threshold is likely to be a candidate word, and finally remove from these words. Three similarity as candidate words.
Figure -1 corrector process
The organization of the language is also critical to the calculation of the corrector. This article has chosen university level four words ("Jinshan Word 2002") as the word library of corrector. Since the matching range is determined according to the length of the character, the word library is classified into words in different character lengths, and each class is stored in the corresponding word library file.
The corrector is developed on the latest ASP.NET platform in Microsoft, and writes in C # language. The program interface is shown in Figure-2.
Figure -2 spell corrector
The corrector can be embedded in the search engine as the pre-processing program of the search engine. The corrector interface is central to the input box and search button, which is similar to the common search engine interface; the input box is the search engine selection bar, including some common search engines; the input box is a prompt bar, give candidate words for User reference; the number above the left side of the interface is the similarity of three candidate words. If the word entered by the user is considered correct words, it is directly search for the correct words. If you can't fully match, you will give you up to three candidate correct words for users to choose, click on the corresponding words.
4 corrector accuracy test
In order to obtain the precise correction accuracy of the corrector, this paper has made a series of tests on the corrector. One is a calibration test for single error, and the other is a correction test for dual error. The test selected the most commonly used words, a total of 130 words ("Jinshan Word 2002" in the four-level word form election), the error type CY000, inserted characters, delete characters, and replace characters. The original data of the first experiment is a word that occurred, and the data of the second experiment is a word that occurred in the event. In order to test the accuracy of the corrector, we designed an error generator that inserts, deletes, and replacing the characters, generates corresponding error words for this 130 correct words. The error type is the program, which will have a certain gap with the human error. The resulting error word also shows that some errors are the spelling errors that are very difficult. But even if such harsh conditions, the corrector still is very good. In the case of single error, the corrector is very good. 130 erroneous words, the corrector correctly corrected 128 of them. 121 of them are the first candidate (see chart-1), 6 is the second candidate, one is the third candidate, the accuracy is 98.5%. The two error words of the wrong correction are exactly the other two correct words. The corrector is mistaken to be correct, and there is no real misjury. It can be seen that the accuracy of the single-character error corrector is quite high. Detailed results are shown in Chart - 1.
Chart -1 single error error correction result. The first (two / 3) candidates are the first (two / three) positions in three candidate words, belonging to correct correction; wherein the length of the length is 2 characters from the length of the error word, and the length of the original word. Thus, the correct words are not included in the matching range, so they are not correctly corrected; the word reason refers to the error word just into another correct word; the remaining candidates are not listed. Length reasons, words and misjudgment belong to errors.
In the case of a double error, the correction accuracy of the corrector is not as high as a single error. 130 erroneous words, corrector correctly corrected 94. Among them, 75 were the first candidate, 14 were the second candidate, 5 were the third candidate, the accuracy rate was 72.3%. The cause of the wrong correction is mostly, and there are 6 of the truly mistakes. Detailed results are shown in Chart - 2. The main reason for producing this result is that the corrector matches only words that differ from one character, such as the original of the experiment, and the corrector only matches only words between 5-7. However, since the error is currently generated, if the two errors are inserted into characters or delete characters, the error word is different from the correct word, and the correct word will not include within the match range, so the corrector cannot List the correct words. Increased matching ranges can increase accuracy, but at the same time increase computing overhead. To find the best proportion of matching ranges and overhead, this is also a problem to be discussed.
Chart-2 double error correction result. Explain the same graph -1.
4.1 Discussion of some problems
There are still some problems to discuss with similarity analysis to correct spelling errors. The authors believe that after calculating three candidate words, these three words will still be processed. The above experimental results show that the similarity in many candidate words is the same, but does not indicate the difference between these words and the original words. This is actually caused by a distance calculation method. For example, the Levenshtein distance between TEST and REST is 1, but the Levenshtein distance between TEST and TENT is also 1, but since Test and REST are the first letters, the impact of similarity is greater, so Test and TENT more similar. This is actually the choice of distance calculation methods and similarity calculation methods. Distance calculation methods should truly reflect the difference between the two strings, similarity calculations should be exactly different from the fine difference between the two strings. In fact, you can add the right to Levenshtein Distance, that is, different locations of the strings of different lengths errors, and their weight should be different. The smaller the string length, the higher the left character, the greater the error value, which is where the lower stage needs to be studied. Secondly, the similarity threshold setting problem is comparable between the two strings when the similarity is greater than a few. Different length words whose threshold should be dynamically adjusted, but also to be adjusted according to experience. Improve correction accuracy is the goal of the corrector. If the corrector can combine the search engine's query cluster (Query Clustering [2]), it can theoretically improve the correction accuracy. Because each web user has its own specific interest, he often accesses the page related to his interest. Based on its previous query keyword, you can analyze it, you can analyze his direction, which can be used as an important basis for determining the word range when calibrating words, so that the accuracy of the correction can be greatly improved. This has already been added to the semantic component, speculating the semantics of its input words depending on its interest, thereby determining the words it wants to enter, in fact, intelligent correction. The author believes that this is the direction of the development of the corrector in the future, and the objectives of the authors are working hard. 5 Conclusion
The spelling corrector is an important part of the search engine. This article matches the correct word that is closest to the original word by similar analysis. Through the experiment, the accuracy of this method is still satisfactory. Since the error generated by the experiment is randomly generated by the program, some errors are not in line with human habits, and the actual accuracy is also higher. The corrector does not use the word letter to match, without using semantic information, the accuracy has improved space, which is also where the corrector needs to be improved. How to combine Query Clustering to intelligently calibrate the wrong way, will be what we need further research.
references:
[1] VLADIMIR CHERKASSKY, NIKOLAOS VASSILAS AND GREGORY L.BRODT. ConvertIn Associative Memory-based Spelling Checkers. IEEE, 1990
JI - RONG WEN AND Hong - JIANG ZHANG. Query Clustering The Web Content. 2002 Kluwer Academic Publishers.
[3] M.-w. Fu and S. CHANG. An approach to designing Very Fast Approximate String Matching Algorithms. IEEE Transactions ON Knowledge and Data Engineering, Vol 6. NO 4, August 1994
[4] Michael Gilleland. Levenshtein distance, in three flavors. Http://www.merriampark.com/ld.htm[5] jiawei Han, Micheline Kamber, Data Mining Concept & Technology, Machinery Industry Press, 2001
about the author:
Wu Wei
, Male, master's degree, research direction: data warehouse, data mining, web excavation
Zhou Wenzhen, male, lecturer, research direction: data warehouse, massive data mining, web excavation
Wang Lisheng, male, associate professor, master tutor, research direction: data mining, commercial intelligence (BI), embedded system.
First author contact information:
No. 1239, Shiping Road, Shanghai, China. Zip code: 200092
Email: wweric2000@163.com
Tel: 021-65975130
Original download "spelling corrector based on similarity analysis"