Text excavation origin
Text database (Web document data)
SemiStructure Data (Semistructure Data)
Information Retrieval (INFORMATION RETRIEVAL)
WEB text mining process
General processing process of web text excavation
Establishment of characteristics
Reduction of the characteristic set
Extract of learning and knowledge model
Knowledge model
Mode quality evaluation
Document set
Establishment of text characteristics
Definition: Text feature refers to metadata about text.
classification:
Descriptive features: Name, date, size, type, etc. of the text.
Semantic feature: the author, title, institution, and content, etc. of the text.
Representation (document modeling):
Vector spatial model (VSM) (matrix)
Feature vector
(Ti is the word item, Wi (D) is the weight of Ti in D)
Mathematical representation of text feature evaluation functions
Information Gain
Expected Cross Entropy
Mutual Information (MUTUAL INFORMATION)
F is the characteristics corresponding to the word W;
P (w) the probability of the word W
P (CI) is the probability of the occurrence of the Class I value;
p (CI | W) is the condition probability of the Class Class I when the word W appears.
Mathematical representation of text feature evaluation functions (continued)
Text Evidence FOR TEXT)
Word Frequency
P (w) the probability of the word W
P (CI) is the probability of the occurrence of the Class I value;
P (CI | W) is the conditional probability of a class I when the word W occurs;
Tf (w) The number of words that appear in the document set.
Document modeling
Word frequency matrix
Corresponding to keyword t, column correspondence document D Vector Converse each document as space vector V vector value reflects the relationship between the word T and the document D represents the word frequency matrix of document word frequency
Document similarity calculation
Cosine Calculation (Cosine Measure)
Cosine Similarity Definition: "略"
Disadvantages: "Unlimited", resulting in increased matrix, increased calculation
Reduction of the characteristic set
The Latent Semantic Indexing method uses the "Singular Value Decomient, SVD" technology in the matrix theory, converting the word frequency matrix into a singular matrix (K × K)
Basic steps for potential semantic derivation methods:
1. Establish a word frequency matrix, Frequency Matrix
2. Calculate the singular value of Frequency Matrix
Decompose Frequency Matrix into 3 matrices U, S, V. U and V are orthogonal matrices (UTU = I), s is a diagonal matrix (K × K)
3. For each document D, replace the original vector with a new vector that eliminates the words after the SVD
4. Save all vectors, create indexes with advanced multidimensional index technologies
5. Calculation of similarity using the conversion document vector
Other text retrieval margin
Inverted Index
An index structure containing two hash tables, or two B tree index finishes
Find all documents related to given gauges
Find all the words related to the specified document
Easy to achieve, but can not handle synonyms and polysemous words, posting_list is very long, stored in stock
Signature file Doc_1, ..., DOC_N
Term_n
TN_1, ..., TN_N
DOC_N
Bamboo
Bamboo
Bamboo
Bamboo
DOC_1, ..., DOC_J
Term_2
T2_1, ..., t2_n
DOC_2
DOC_1, ..., DOC_I
Term_1
T1_1, ..., t1_n
DOC_1
Posting_list
Term_id
Posting_list
DOC_ID
Ciction table (TERM_TABLE)
Document table (Document_Table)
Wordsman
Definition: Determine the words of the words in the sentence in the context.
And word classification:
Similarity is of similarity and words:, for example: leaders (verbs / nouns)
Similarity is synonymous and word: for example: hour (quantifier / noun)
Symmetry synonyms and words: computers, computers
Automatic Word Label is to automatically give the text in the text.
In the natural language such as English, Chinese, there is a large number of words, which brings great difficulties to the automatic mean labeling of text. Therefore, how to exclude the word class ambiguity, is a key issue for text automatic mean labeling research.
Target technology route: Based on probability statistics and rules
Automatic word class
As early as the 1960s, foreign scholars began researching the automatic word of English text, and proposed some methods of eliminating the ambiguity of the word, and established some automatic mean labeling systems.
In 1971, Greene and Rubin (Rubin), the University of Brown University, established a Taggit system, using 3300 context frame rules (Context frame rules) to exclude and automatically labeled The correct rate reached 77%.
In 1983, Ri Leech and Gased (R. Garside) have established a CLAWS system, using probability statistics to perform automatic termination, they use 133 × 133 word co-probability matrix, pass The statistical model is to eliminate the ambiguity of the class, and the correct rate of automatic label reaches 96%.
In 1988, Derose made some improvements to the CLAWS system, using linear planning methods to reduce the complexity of the system, propose Volsunga algorithm, greatly improved processing efficiency, enabling the correct rate of automatic mean labels. Practical level.
CLAWS algorithm based on probability statistics
Claws is the abbreviation of English Constituent-Likelihood Automatic Word-Tagging System (ingredient likelihood automatic etched system), it is 1983 Mashall in giving the LOB corpus (British English corpus, library with various types) A algorithm proposed when the capacity is 1 million words) as automatic mean. The specific approach is:
First select some of the text from the LOB corpus to be labeled, called "Training Set", the text of the training concentrate, the text is won by words, and then use the computer to train any two adjacent markers in the training set. The current probability is made to form an adjacent mark of the same probability matrix.
When performing automatic labeling, the system sequentially intercepts a limited length in the input text. The word of this string and the word of the tail should be unique. Such a word string is called a sprocket (SPAN), recorded as W0 , W1, W2, ..., Wn, Wn 1. Among them, W0 and Wn 1 are non-aligned words, W1, W2, ..., WN is n and word. Use data provided with the current probability matrix to calculate the profile of each possible tag generated by each word in this span, and select the maximum marker string as the selection path (PATH), as the best result output .
Volsunga algorithm
There are two main aspects of the improvement of the CLAWS algorithm in the Volsunga algorithm. In terms of optimal path, not finally calculating the maximum marker string, but along the direction from left to right, using "step-by-step camp" strategy, for the current consideration, only the word The best path, discard other path, then start from this word, match all the tags of the next word, continue to identify the best path, discard other path, step by step, until the whole cross After the paragraph is finished, the optimal path of the entire cross-section is obtained as the result output.
According to the corpus statistics, the relative labability of each word is statistics, and use this relative label probability to assist the best path selection.
The Volsunga algorithm greatly reduces the time complexity and spatial complexity of the CLAWS algorithm, and improves the accuracy of automatic mean labeling.
Statistical method of defect
The CLAWS algorithm and the Volsunga algorithm are based on statistical automatic annotation methods, which are labeled in terms of the same probability. However, the probability is only the maximum possible possibility, and it is possible to determine the comply with the probability to discard the probability of present.
In order to improve the correct rate of automatic mean, it must be complemented by rule-based approaches, according to language rules.
Rule-based label
Rules-based approach by considering the territory of the words and markers on the impact of the words and words, it is often used as a supplement based on probability statistical methods. The combination of statistical methods and rule methods is considered to be the best means to solve the problem of mean.
In the case where the statistical corpus is large, the given minimum support and minimum credibility are combined, first discovering greater than the minimum support common mode set, and then generates association rules. If the credibility of this rule is greater than a given minimum confidence, the word rules are obtained. As long as the minimum credibility is defined enough, the rules obtained can be used to handle the cases of the conformity. (Rules rely on various combinations of words and words, the excavation process is more complicated)
Rules-based territory (continued)
It is mainly based on context to determine the copulation.
This is a white paper ("white" appears before the term "paper", it is determined as adjective)
He was running over ("White" appeared before the verb "run", it is determined as an adverb)
Words Connection: In a combined joint structure, the word class of the combined two components should be the same, if one is a non-aligned word, the other is the same as the word, and the words and words can be determined to be non-compatible. Words.
I have read several articles and reports.
"Article" is noun, is a non-written word, "Report" is a dynamic - name and class, since in a joint structure, it can be determined that the "report" is noun.
Qinghua University's computer system Huang Changning uses statistical methods to establish an automatic mean labeling system, labeling the correct rate of 96.8%, and the speed of automatic annotation is 175 Chinese characters per second.
Automatic semantic label
The word polysemous, how to form the words, the automatic semantic label is mainly a polyfined problem with the word.
The morbidity is also a common phenomenon in the natural language, however, in a certain context, a word generally only interprets a semantic.
The so-called automatic semantic label is that the computer performs a semantics that appears in a certain context, determines its correct semantics and label. Semantic automatic labeling method
Word meaning
Word = word ... word
Using the method of retrieving the correlation word appearing in the context to determine the meaning of polysemous words
The degree of affinity between words (PEN)
Use the contextual relationship to determine the word meaning of polysemous words
Words (PLAN)
Use the biggest possible righteousness to disperse
Selecting the Using the Using the Using the Using the Using the Using the Using the User Items in Text. This is obviously not a scientific approach, but there is still a certain correct rate.
According to statistics, use the biggest possible righteousness to disperse, for closing text, the correct rate is only 67.5%, for open text, the correct rate is lower, only 64.8%.
At present, there are many machine translation systems that use this maximum possible meaning to determine the meaning of the polysemous words, this is one of the main causes of poor quality of these machine translation systems.
Other text search standards (continued)
Signature file
Definition: is a file that stores the feature record of each document in the database
Method: Each feature corresponds to a fixed length bit string, a bit corresponding to a vocabulary, if a bit corresponding to the word appears in the document, then the position 1, otherwise 0.
S1
S2
Match the bit operation, determine the similar shape of the document
You can correspond to a bit of a bit, to reduce the length of the bit string, but increase the search overhead, there is a shortcomings of multi-to-one mapping.
slightly...
Extract of learning and knowledge model
Text source
Text structure analyzer
Text classifier
Chinese text excavation model structure diagram
Pieces and non-use words
Feature extraction
Name recognition
Date processing
Digital processing
Text Summary Builder
User Interface
Users browse
Search Results
Extraction of learning and knowledge mode (continued)
Participle
Definition: Plus spaces between words and words in Chinese text.
Non-used words (deactivate words)
Definition: Words from the auxiliary role in text.
classification:
虚 虚: "A, THE, OF, FOR, WITH, IN, AT, ..." in English
"In Chinese", land, get, ... "
Real words: "Database" in the paper at the database meeting, is considered non-use
Stem problem
Compute, computes, computed is considered the same word (deformed)
Automatic word
The use of automatic word:
Automatic search, filtering, classification and summary of Chinese text
Automatic school of Chinese text
Hanwai machine translation
Chinese character identifies the post-treatment of Chinese speech recognition
Chinese speech synthesis
Enter the Chinese character keyboard in the sentence
Chinese characters simple body conversion
Major scales
Maximum Matching Method, MM Method: Select the symbol string containing 6-8 Chinese characters as the maximum symbol string, match the maximum symbol string with the word entry in the dictionary, if you can't match, turn off a Chinese word to continue Match until the corresponding word is found in the dictionary. The matching direction is from right to left.
Reverse Maximum Method, RMM Method: The matching direction is opposite to the MM method, from left to right. Experiments show that for Chinese, the reverse maximum matching method is more effective than the maximum matching method.
Bidirectional matching method (BM Method): Compare the word results with the RMM method to determine the correct word.
Optimum Matching Method, OM Method: Randide words in the dictionary according to their size in the text, and the high frequency word is ranked before, the frequency is low, thus Improve the speed of matching. Association-Backtracking Method, AB Method: Matches with Lenovo and backtracking.
Model quality evaluation
Basic metric for text retrieval
{Relevant}: Collection of documents related to a query.
{RETRIEVED}: Collection of documents retrieved by the system.
{Relevant} {RETRIEVED}: Collection of actual documents that are retrieved and retrieved.
Precision: The percentage of the actual document and the retrieved document is retrieved.
Check the full rate: the percentage of the actual documentation associated with the actual documentation associated with the query.
Model quality evaluation example
{Relevant} = {A, B, C, D, E, F, G, H, I, J} = 10
{RETRIEVED} = {B, D, F, W, Y} = 5
{Relevant} = {RETRIEVED} = {B, D, F} = 3
Quality: precision = 3/5 = 60%
Check the system: recall = 3/10 = 30%
B, D, F related and retrieved documents
All documents A, C, E, G, H, I, J related document W, Y retrieved documents
Text Categorization
General method
Classifying other documents from training sets as training sets from training sets (requires testing process, continuous refinement) to classify other documents with exported classification mode
Associated classification method
By information retrieval technology and association analysis technology, keywords and vocabulary use existing words to generate keywords and words (document categories), using associated mining methods to discover associated words, and divide all kinds of documents (each type of document corresponds to A set of association rules) Use the association rules to categorize new documents.
Automatic classification of web documents
Classified statistical method using information in hyperlinks
Markov Random Field, MRF is combined with the loose logo (RL) is classified by WebLog data.
Text cluster
Level clustering method
Planar division method (K-Means algorithm)
Simple Bayesopolymerization
K-recently adjacent reference clustering
Grading clustering method
Conceptual text cluster
Level clustering method
Specific process
The document set D = {D1, ..., di, ..., DN} is considered as a class CI = {di} with a single member, these classes constitute a cluster C = {C1 , ..., ci, ..., cn};
Calculate the similarity SIM (CI, CJ) between each pair of classes (CI, CJ) in C;
Choose a class with the maximum similarity to Arg Max SIM (CI, CJ), and combine CI and CJ into a new class CK = Ci∪cj to construct a new class C = {C1, ..., CN -1};
Repeat the above step until there is only one class left in C.
...
...
Plane division
Several classes of document set D = {D1, ..., DI, ..., DN}, specific processes:
1. Determine the number of classes to be generated;
2. Generate K polymeric centers as clustering seeds S = {S1, ..., SJ, ..., SK};
3. Each document DI in D is calculated in turn with the similarity SIM (Di, SJ) of each seed SJ;
4. Select the seed Arg Max SIM (DI, SJ) with the maximum similarity, and the DI is classified into a class Cj which is a cluster center with SJ, thereby obtaining a cluster C = {C1, ..., CK}; 5. Repeat steps 2 to 4, to obtain a relatively stable clustering result.
This method is fast, but K is to be predetermined, the seed selection is difficult
Automatic abstract, automatic summary
definition:
That is to automatically use the computer to extract the simple and coherent textu to fully accurately reflect the content of the document center from the original document.
Automatic abstract system
The automatic abstract system should automatically extract the original theme idea or center content.
Abstract should have profactive, objectivity, understandability, and readability.
The system is suitable for any field.
1995 automatic abstract system evaluation
(1) Three systems can pick a part of the statement from the original.
(2) The extracted abstract is the statement in the original text, and only some Chinese numbers are removed in the abstracts of the unit 2.
(3) The abstracts of the three systems are almost completely incompatible.
Different from experts
related information
Chinese character input with Chinese corpus
Automatic phrase boundaries and syntax of Chinese written text in the corpus
Construction of machine dictionary
Terminology database
machine translation
Computer auxiliary text school
Intelligence automatic search system
Chinese speech recognition system
Chinese speech synthesis system
Chinese character identification system
Domestic research situation
"Journal of Computer" 99 (1)
"Software Journal" 98-99 (4)
Journal of Tsinghua University "(4)
Chinese basic noun phrase analysis model, identification model, text word meaning label, language modeling, phenotheetary algorithm, context independent analysis, morpheme and pattern study
Huang Changning
Department of Computer Science and Technology, Tsinghua University
Handwritten Chinese character recognition (dynamic match), Chinese character identifies multi-classifier integration (integrated identification method), "implementation of business card automatic entry system"
Ding Xiaoyan
Wu Youshou
Tsinghua University Electronic Engineering Department
"Computer Research and Development" 98 (4)
"Journal of Software" 97 (1)
other
Translation, Chinese word, natural language interface, syntax analysis, semantic analysis, sound conversion, automatic word
Chen Xiong
China Academy of Sciences Computer Language Information Engineering Research Center
Major journal and quantity
"Journal of Beijing University of Posts and Telecommunications"
"Information Journal"
Automatically labeled the Chinese language (neural network model), automatic abstract (Database text structure relationship), proposed a discourse analysis method based on speech behavior theory
Zhong Yi letter
Department of Information Engineering, Beijing University of Posts and Telecommunications
"Chinese Automatic Summary System", automatic classification optimization algorithm based on neural network
Wang Yongcheng
Shanghai Jiaotong University Computer Application Research Institute
"Computer Research and Development" 97-99 (5)
"Journal of Software" 98 (1)
Implicit conversion, automatic abstract, handwriting Chinese character identification, automatic word, "Chinese words fast finding system"
Wang Cast
Wang Xiaolong
Department of Computer Science and Engineering, Harbin Institute of Technology
"Journal of Software" 97-00 (7)
Journal of Shanghai Jiaotong University (3)
Statement semantics, natural language model, constructive semantic explanation model (incremental type), tree layered database method (non-structured data knowledge method), parade reasoning
Lu Zhan
Department of Computer Science and Engineering, Haidong University
Major journal and quantity
"Computer Research and Development" 97-99 (5)
"Journal of Software" 97-99 (3)
"Small Micro Computer System" 97-99 (4)
"Journal of Northeastern University" 97-99 (10)
Wordsman, inheritance theory, the transformation of unlimited natural language processing into limited categories, Chinese information automatic extraction, word class matching rules, speech recognition model, time information analysis of text, time information, phrase structure rules, automatic acquisition method Fuzzy cluster analysis is used in the field of speech recognition, language alienation, automatic access method based on neural network, the remote match of the verb in English, Chinese name identification, Chinese text automatic classification model design and implementation, vocabulary disambiguation
Yao Tianshun
Zhu Jingbo
Northeastern University 9
"Computer Research and Development" 97 (1) Some Chinese Grammar Analyzer
Wu Lide
Fudan University
Major journal and quantity
"Journal of Circuits and Systems" (1)
Journal of South China University of Technology (1)
Handwritten Chinese character recognition (elastic grid direction decomposition feature, dynamic attenuation adjustment diameter function (RBF DDA))
Xu Bingzhen
Department of Electronic and Communication Engineering, South China University of Technology
11
Journal of Peking University (1)
"Chinese Information" (1)
Chinese single sentence predicate center word recognition
Yu Zhibo
Wei Zhi Fang
Beijing University Computing Language Institute
10
Major journal and quantity
content
Take the leader
School, hospital, office
Serial number
Domestic research (continued)
Shanghai Jiaotong University Computer Application Research Institute
Automatic word (1960s)
Automatic Direction (70s)
Northeastern University
Department of Computer Science and Technology, Tsinghua University
Department of Information Engineering, Beijing University of Posts and Telecommunications
China Academy of Sciences Computer Language Information Engineering Research Center
Department of Computer Science and Engineering, Harbin Institute of Technology
Shanghai Jiaotong University
Northeastern University
Distance of the verbs in English
Department of Information Engineering, Beijing University of Posts and Telecommunications
Harbin Institute of Technology
Department of Information Engineering, Beijing University of Posts and Telecommunications
Automatic abstract
School, hospital, office
content
Department of Computer Science and Engineering, Harbin Institute of Technology
Northeastern University
Department of Computer Science and Engineering, Shanghai Jiaotong University
China Academy of Sciences Computer Language Information Engineering Research Center
Phonograph conversion
Tsinghua University Electronic Engineering Department
Harbin Institute of Technology
Beijing University Computing Language Institute
Northeastern University
Department of Electronic and Communication Engineering, South China University of Technology
Chinese character identification
Department of Computer Science and Technology, Tsinghua University
Fudan University Computer Science Department
China Academy of Sciences Computer Language Information Engineering Research Center
Semantic Analysis Grammatical Analysis Syntactic Analysis
School, hospital, office
content
Chinese word system CSEG & TAG
"Realization of business card automatic entry system"
"Chinese words fast finding system"
"Chinese Automatic Summary System"
Department of Computer Science and Engineering, Shanghai Jiaotong University
TH-OCR Chinese Character Recognition (Optical Character Recognition)
Tsinghua University
Department of Computer Science and Engineering, Harbin Institute of Technology
Shanghai Jiaotong University Computer Application Research Institute
China Academy of Sciences Computer Language Information Engineering Research Center
"Design and Implementation of Natural Language Managers"
Northeastern University
Department of Computer Science and Technology, Tsinghua University
Identification model
School, hospital, office
content
Rule-based machine translation system (foreign)
Georgetown University's Motor Translation System
Russian law translation system at University of Grenoble, France
Canadian TAUM-Meteo System
Japan Atlas
Japan's other practical machine translation systems
French Textile Research Institute's Titus-IV System
American SYSTRAN system
Weidner system in the United States
PAHO system in the United States
German Metal System
SUSY system in Germany
Eurotra system
Japan's MU system and ODA program
DLT system
Rules-based machine translation system (domestic)
Russian machine translation system
English-Chinese Quest Recorder Translation System Title
Han - French / British / Day / Russian / Due multi-language machine translation system FAJRA
"Translation star" English system
"Gao Li" English-Chinese system
863-IMT / EC English-Chinese system
Matrix English-Chinese system
Translate English - Chinese-English system
Yaxinying system
READWORLD English-Chinese system
Sino-Trans Han English - Han Jian Motion Translation System
E-TO-J English machine translation system
Foreign excavation tool
IBM text intelligent excavator
advanced
Search Engine (Advanced Search Engine) - TextMiner;
Web Access Tool (Web Access Tools) - includes web search engines Netquestion and Web Crawler;
Text Analysis Tools.
IBM's TextMiner: The main function is characteristic extraction, document aggregation, document classification, and retrieval.
Data retrieval of multiple format texts for 16 languages;
Using deep text analysis and indexing methods;
Support full-text search and index search, the search conditions can be natural language and Boolean logic conditions; it is a tool for Client / Server structure, supporting a large number of concurrent users to do search tasks;
Online update indexing, and other search tasks can be completed.
Foreign text excavation tools (continued)
Autonomy's core product is Concept Agents.
After training, it can automatically draw concepts from text.
TelTech
Provide expert service;
Professional literature retrieval service;
Products and manufacturers search service;
The key to success of Teltech is to establish a high-performance knowledge structure. It uses the subject matter, its theme word table is divided into different major, with more than 30,000, maintained by digital knowledge engineers, updated 500 to 1200 words per week.