Foundation of the automatic classification of text ---- TERM frequency calculation method

xiaoxiao2021-03-06 40

It is said that the document on the Internet has grown to 1 million in the Internet. Such a big growth allows Google to take a 1 month or even longer to visit your website once. So if you have optimized your web page today, then look at Google's response. This is really an age of information explosion. When the Internet is just born, through the catalog navigation mechanism, we can find the information you need, Yahoo seized this opportunity to succeed; later with the popularity of the Internet, the speed of information explosion caused the catalog navigation to lose the effect, Google seized This opportunity, proposes a special search algorithm that allows people to find information, and Google has succeeded. However, as we can't have the Internet, the newspaper is discarded, and the mechanism of the catalog navigation still plays a role. Observe that Google's launching personality search service can find that in order to let the user search content more correlation, Google is encouraging you to use the scheduled search channel. That is to say, the directory classification mechanism of the search still exists, but does not face end users directly, but in the face of the search engine, it is automatically classified according to the documentation.

There are many ways to automatically classify the documentation. This article describes the TERM frequency calculation method.

The basic idea of the vector space model is to see the document as a vector in which the words appear in the words. In order to reduce the noise of information, the words inside need to be processed by the following steps:

1. Treatment of documents, remove all words (TERM) included in the document; 2, eliminate meaningless words (TERM), such as Chinese: Yes, etc.; 3, statistical calculation each word ( Term) The frequency; 4, the partial word (TERM) (TERM) having high frequency is filtered out as needed (TERM) (TERM) (similar to the highest and minimum score in variety "); 5, After processed this step, we assume that there is a common word, then labeled a unique tag for these words. Processing to this step, the following steps are different depending on the algorithm. But there is a common feature that it must rely on the weight of the word (TERM). The weight of the word directly depends on the frequency they appear. Because we want to analyze thousands of documents, the frequency of words that appear in a document does not explain the problem, so considering the factors of multiple documents when considering the word weight. Now we abstract: 1. Suppose the document that needs to be processed is a collection of D objects; Tending that subset A (classification). So this seems to determine the word weight should include the following three parts: 1, the frequency factor in the word itself, determine the importance of the word in the current document; 2. Factors of the length of the document; 3, all documents contain TERM The frequency of appearing, determining the importance of the word in all documents;

If it is more accurate to get the frequency of the word, plus the statistics method, you should be more accurate to the documentation.