Web Text Mining Technology

zhaozj2021-02-16 88

Text excavation origin

Text database (Web document data)

SemiStructure Data (Semistructure Data)

Information Retrieval (INFORMATION RETRIEVAL)

WEB text mining process

General processing process of web text excavation

Establishment of characteristics

Reduction of the characteristic set

Extract of learning and knowledge model

Knowledge model

Mode quality evaluation

Document set

Establishment of text characteristics

Definition: Text feature refers to metadata about text.

classification:

Descriptive features: Name, date, size, type, etc. of the text.

Semantic feature: the author, title, institution, and content, etc. of the text.

Representation (document modeling):

Vector spatial model (VSM) (matrix)

Feature vector

(Ti is the word item, Wi (D) is the weight of Ti in D)

Mathematical representation of text feature evaluation functions

Information Gain

Expected Cross Entropy

Mutual Information (MUTUAL INFORMATION)

F is the characteristics corresponding to the word W;

P (w) the probability of the word W

P (CI) is the probability of the occurrence of the Class I value;

p (CI | W) is the condition probability of the Class Class I when the word W appears.

Mathematical representation of text feature evaluation functions (continued)

Text Evidence FOR TEXT)

Word Frequency

P (w) the probability of the word W

P (CI) is the probability of the occurrence of the Class I value;

P (CI | W) is the conditional probability of a class I when the word W occurs;

Tf (w) The number of words that appear in the document set.

Document modeling

Word frequency matrix

Corresponding to keyword t, column correspondence document D Vector Converse each document as space vector V vector value reflects the relationship between the word T and the document D represents the word frequency matrix of document word frequency

Document similarity calculation

Cosine Calculation (Cosine Measure)

Cosine Similarity Definition: "略"

Disadvantages: "Unlimited", resulting in increased matrix, increased calculation

Reduction of the characteristic set

The Latent Semantic Indexing method uses the "Singular Value Decomient, SVD" technology in the matrix theory, converting the word frequency matrix into a singular matrix (K × K)

Basic steps for potential semantic derivation methods:

1. Establish a word frequency matrix, Frequency Matrix

2. Calculate the singular value of Frequency Matrix

Decompose Frequency Matrix into 3 matrices U, S, V. U and V are orthogonal matrices (UTU = I), s is a diagonal matrix (K × K)

3. For each document D, replace the original vector with a new vector that eliminates the words after the SVD

4. Save all vectors, create indexes with advanced multidimensional index technologies

5. Calculation of similarity using the conversion document vector

Other text retrieval margin

Inverted Index

An index structure containing two hash tables, or two B tree index finishes

Find all documents related to given gauges

Find all the words related to the specified document

Easy to achieve, but can not handle synonyms and polysemous words, posting_list is very long, stored in stock

Signature file Doc_1, ..., DOC_N

Term_n

TN_1, ..., TN_N

DOC_N

Bamboo

DOC_1, ..., DOC_J

Term_2

T2_1, ..., t2_n

DOC_2

DOC_1, ..., DOC_I

Term_1

T1_1, ..., t1_n

DOC_1

Posting_list

Term_id

Posting_list

DOC_ID

Ciction table (TERM_TABLE)

Document table (Document_Table)

Wordsman

Definition: Determine the words of the words in the sentence in the context.

And word classification:

Similarity is of similarity and words:, for example: leaders (verbs / nouns)

Similarity is synonymous and word: for example: hour (quantifier / noun)

Symmetry synonyms and words: computers, computers

Automatic Word Label is to automatically give the text in the text.

In the natural language such as English, Chinese, there is a large number of words, which brings great difficulties to the automatic mean labeling of text. Therefore, how to exclude the word class ambiguity, is a key issue for text automatic mean labeling research.

Target technology route: Based on probability statistics and rules

Automatic word class

As early as the 1960s, foreign scholars began researching the automatic word of English text, and proposed some methods of eliminating the ambiguity of the word, and established some automatic mean labeling systems.

In 1971, Greene and Rubin (Rubin), the University of Brown University, established a Taggit system, using 3300 context frame rules (Context frame rules) to exclude and automatically labeled The correct rate reached 77%.

In 1983, Ri Leech and Gased (R. Garside) have established a CLAWS system, using probability statistics to perform automatic termination, they use 133 × 133 word co-probability matrix, pass The statistical model is to eliminate the ambiguity of the class, and the correct rate of automatic label reaches 96%.

In 1988, Derose made some improvements to the CLAWS system, using linear planning methods to reduce the complexity of the system, propose Volsunga algorithm, greatly improved processing efficiency, enabling the correct rate of automatic mean labels. Practical level.

CLAWS algorithm based on probability statistics

Claws is the abbreviation of English Constituent-Likelihood Automatic Word-Tagging System (ingredient likelihood automatic etched system), it is 1983 Mashall in giving the LOB corpus (British English corpus, library with various types) A algorithm proposed when the capacity is 1 million words) as automatic mean. The specific approach is:

First select some of the text from the LOB corpus to be labeled, called "Training Set", the text of the training concentrate, the text is won by words, and then use the computer to train any two adjacent markers in the training set. The current probability is made to form an adjacent mark of the same probability matrix.

When performing automatic labeling, the system sequentially intercepts a limited length in the input text. The word of this string and the word of the tail should be unique. Such a word string is called a sprocket (SPAN), recorded as W0 , W1, W2, ..., Wn, Wn 1. Among them, W0 and Wn 1 are non-aligned words, W1, W2, ..., WN is n and word. Use data provided with the current probability matrix to calculate the profile of each possible tag generated by each word in this span, and select the maximum marker string as the selection path (PATH), as the best result output .

Volsunga algorithm

There are two main aspects of the improvement of the CLAWS algorithm in the Volsunga algorithm. In terms of optimal path, not finally calculating the maximum marker string, but along the direction from left to right, using "step-by-step camp" strategy, for the current consideration, only the word The best path, discard other path, then start from this word, match all the tags of the next word, continue to identify the best path, discard other path, step by step, until the whole cross After the paragraph is finished, the optimal path of the entire cross-section is obtained as the result output.

According to the corpus statistics, the relative labability of each word is statistics, and use this relative label probability to assist the best path selection.

The Volsunga algorithm greatly reduces the time complexity and spatial complexity of the CLAWS algorithm, and improves the accuracy of automatic mean labeling.

Statistical method of defect

The CLAWS algorithm and the Volsunga algorithm are based on statistical automatic annotation methods, which are labeled in terms of the same probability. However, the probability is only the maximum possible possibility, and it is possible to determine the comply with the probability to discard the probability of present.

In order to improve the correct rate of automatic mean, it must be complemented by rule-based approaches, according to language rules.

Rule-based label

Rules-based approach by considering the territory of the words and markers on the impact of the words and words, it is often used as a supplement based on probability statistical methods. The combination of statistical methods and rule methods is considered to be the best means to solve the problem of mean.

In the case where the statistical corpus is large, the given minimum support and minimum credibility are combined, first discovering greater than the minimum support common mode set, and then generates association rules. If the credibility of this rule is greater than a given minimum confidence, the word rules are obtained. As long as the minimum credibility is defined enough, the rules obtained can be used to handle the cases of the conformity. (Rules rely on various combinations of words and words, the excavation process is more complicated)

Rules-based territory (continued)

It is mainly based on context to determine the copulation.

This is a white paper ("white" appears before the term "paper", it is determined as adjective)

He was running over ("White" appeared before the verb "run", it is determined as an adverb)

Words Connection: In a combined joint structure, the word class of the combined two components should be the same, if one is a non-aligned word, the other is the same as the word, and the words and words can be determined to be non-compatible. Words.

I have read several articles and reports.

"Article" is noun, is a non-written word, "Report" is a dynamic - name and class, since in a joint structure, it can be determined that the "report" is noun.

Qinghua University's computer system Huang Changning uses statistical methods to establish an automatic mean labeling system, labeling the correct rate of 96.8%, and the speed of automatic annotation is 175 Chinese characters per second.

Automatic semantic label

The word polysemous, how to form the words, the automatic semantic label is mainly a polyfined problem with the word.

The morbidity is also a common phenomenon in the natural language, however, in a certain context, a word generally only interprets a semantic.

The so-called automatic semantic label is that the computer performs a semantics that appears in a certain context, determines its correct semantics and label. Semantic automatic labeling method

Word meaning

Word = word ... word

Using the method of retrieving the correlation word appearing in the context to determine the meaning of polysemous words

The degree of affinity between words (PEN)

Use the contextual relationship to determine the word meaning of polysemous words

Words (PLAN)

Use the biggest possible righteousness to disperse

Selecting the Using the Using the Using the Using the Using the Using the Using the User Items in Text. This is obviously not a scientific approach, but there is still a certain correct rate.

According to statistics, use the biggest possible righteousness to disperse, for closing text, the correct rate is only 67.5%, for open text, the correct rate is lower, only 64.8%.

At present, there are many machine translation systems that use this maximum possible meaning to determine the meaning of the polysemous words, this is one of the main causes of poor quality of these machine translation systems.

Other text search standards (continued)

Signature file

Definition: is a file that stores the feature record of each document in the database

Method: Each feature corresponds to a fixed length bit string, a bit corresponding to a vocabulary, if a bit corresponding to the word appears in the document, then the position 1, otherwise 0.

Match the bit operation, determine the similar shape of the document

You can correspond to a bit of a bit, to reduce the length of the bit string, but increase the search overhead, there is a shortcomings of multi-to-one mapping.

slightly...

Extract of learning and knowledge model

Text source

Text structure analyzer

Text classifier

Chinese text excavation model structure diagram

Pieces and non-use words

Feature extraction

Name recognition

Date processing

Digital processing

Text Summary Builder

User Interface

Users browse

Search Results

Extraction of learning and knowledge mode (continued)

Participle

Definition: Plus spaces between words and words in Chinese text.

Non-used words (deactivate words)

Definition: Words from the auxiliary role in text.

classification:

虚虚: "A, THE, OF, FOR, WITH, IN, AT, ..." in English

"In Chinese", land, get, ... "

Real words: "Database" in the paper at the database meeting, is considered non-use

Stem problem

Compute, computes, computed is considered the same word (deformed)

Automatic word

The use of automatic word:

Automatic search, filtering, classification and summary of Chinese text

Automatic school of Chinese text

Hanwai machine translation

Chinese character identifies the post-treatment of Chinese speech recognition

Chinese speech synthesis

Enter the Chinese character keyboard in the sentence

Chinese characters simple body conversion

Major scales

Maximum Matching Method, MM Method: Select the symbol string containing 6-8 Chinese characters as the maximum symbol string, match the maximum symbol string with the word entry in the dictionary, if you can't match, turn off a Chinese word to continue Match until the corresponding word is found in the dictionary. The matching direction is from right to left.

Reverse Maximum Method, RMM Method: The matching direction is opposite to the MM method, from left to right. Experiments show that for Chinese, the reverse maximum matching method is more effective than the maximum matching method.

Bidirectional matching method (BM Method): Compare the word results with the RMM method to determine the correct word.

Optimum Matching Method, OM Method: Randide words in the dictionary according to their size in the text, and the high frequency word is ranked before, the frequency is low, thus Improve the speed of matching. Association-Backtracking Method, AB Method: Matches with Lenovo and backtracking.

Model quality evaluation

Basic metric for text retrieval

{Relevant}: Collection of documents related to a query.

{RETRIEVED}: Collection of documents retrieved by the system.

{Relevant} {RETRIEVED}: Collection of actual documents that are retrieved and retrieved.

Precision: The percentage of the actual document and the retrieved document is retrieved.

Check the full rate: the percentage of the actual documentation associated with the actual documentation associated with the query.

Model quality evaluation example

{Relevant} = {A, B, C, D, E, F, G, H, I, J} = 10

{RETRIEVED} = {B, D, F, W, Y} = 5

{Relevant} = {RETRIEVED} = {B, D, F} = 3

Quality: precision = 3/5 = 60%

Check the system: recall = 3/10 = 30%

B, D, F related and retrieved documents

All documents A, C, E, G, H, I, J related document W, Y retrieved documents

Text Categorization

General method

Classifying other documents from training sets as training sets from training sets (requires testing process, continuous refinement) to classify other documents with exported classification mode

Associated classification method

By information retrieval technology and association analysis technology, keywords and vocabulary use existing words to generate keywords and words (document categories), using associated mining methods to discover associated words, and divide all kinds of documents (each type of document corresponds to A set of association rules) Use the association rules to categorize new documents.

Automatic classification of web documents

Classified statistical method using information in hyperlinks

Markov Random Field, MRF is combined with the loose logo (RL) is classified by WebLog data.

Text cluster

Level clustering method

Planar division method (K-Means algorithm)

Simple Bayesopolymerization

K-recently adjacent reference clustering

Grading clustering method

Conceptual text cluster

Level clustering method

Specific process

The document set D = {D1, ..., di, ..., DN} is considered as a class CI = {di} with a single member, these classes constitute a cluster C = {C1 , ..., ci, ..., cn};

Calculate the similarity SIM (CI, CJ) between each pair of classes (CI, CJ) in C;

Choose a class with the maximum similarity to Arg Max SIM (CI, CJ), and combine CI and CJ into a new class CK = Ci∪cj to construct a new class C = {C1, ..., CN -1};

Repeat the above step until there is only one class left in C.

...

Plane division

Several classes of document set D = {D1, ..., DI, ..., DN}, specific processes:

1. Determine the number of classes to be generated;

2. Generate K polymeric centers as clustering seeds S = {S1, ..., SJ, ..., SK};

3. Each document DI in D is calculated in turn with the similarity SIM (Di, SJ) of each seed SJ;

4. Select the seed Arg Max SIM (DI, SJ) with the maximum similarity, and the DI is classified into a class Cj which is a cluster center with SJ, thereby obtaining a cluster C = {C1, ..., CK}; 5. Repeat steps 2 to 4, to obtain a relatively stable clustering result.

This method is fast, but K is to be predetermined, the seed selection is difficult

Automatic abstract, automatic summary

definition:

That is to automatically use the computer to extract the simple and coherent textu to fully accurately reflect the content of the document center from the original document.

Automatic abstract system

The automatic abstract system should automatically extract the original theme idea or center content.

Abstract should have profactive, objectivity, understandability, and readability.

The system is suitable for any field.

1995 automatic abstract system evaluation

(1) Three systems can pick a part of the statement from the original.

(2) The extracted abstract is the statement in the original text, and only some Chinese numbers are removed in the abstracts of the unit 2.

(3) The abstracts of the three systems are almost completely incompatible.

Different from experts

related information

Chinese character input with Chinese corpus

Automatic phrase boundaries and syntax of Chinese written text in the corpus

Construction of machine dictionary

Terminology database

machine translation

Computer auxiliary text school

Intelligence automatic search system

Chinese speech recognition system

Chinese speech synthesis system

Chinese character identification system

Domestic research situation

"Journal of Computer" 99 (1)

"Software Journal" 98-99 (4)

Journal of Tsinghua University "(4)

Chinese basic noun phrase analysis model, identification model, text word meaning label, language modeling, phenotheetary algorithm, context independent analysis, morpheme and pattern study

Huang Changning

Department of Computer Science and Technology, Tsinghua University

Handwritten Chinese character recognition (dynamic match), Chinese character identifies multi-classifier integration (integrated identification method), "implementation of business card automatic entry system"

Ding Xiaoyan

Wu Youshou

Tsinghua University Electronic Engineering Department

"Computer Research and Development" 98 (4)

"Journal of Software" 97 (1)

other

Translation, Chinese word, natural language interface, syntax analysis, semantic analysis, sound conversion, automatic word

Chen Xiong

China Academy of Sciences Computer Language Information Engineering Research Center

Major journal and quantity

"Journal of Beijing University of Posts and Telecommunications"

"Information Journal"

Automatically labeled the Chinese language (neural network model), automatic abstract (Database text structure relationship), proposed a discourse analysis method based on speech behavior theory

Zhong Yi letter

Department of Information Engineering, Beijing University of Posts and Telecommunications

"Chinese Automatic Summary System", automatic classification optimization algorithm based on neural network

Wang Yongcheng

Shanghai Jiaotong University Computer Application Research Institute

"Computer Research and Development" 97-99 (5)

"Journal of Software" 98 (1)

Implicit conversion, automatic abstract, handwriting Chinese character identification, automatic word, "Chinese words fast finding system"

Wang Cast

Wang Xiaolong

Department of Computer Science and Engineering, Harbin Institute of Technology

"Journal of Software" 97-00 (7)

Journal of Shanghai Jiaotong University (3)

Statement semantics, natural language model, constructive semantic explanation model (incremental type), tree layered database method (non-structured data knowledge method), parade reasoning

Lu Zhan

Department of Computer Science and Engineering, Haidong University

Major journal and quantity

"Computer Research and Development" 97-99 (5)

"Journal of Software" 97-99 (3)

"Small Micro Computer System" 97-99 (4)

"Journal of Northeastern University" 97-99 (10)

Wordsman, inheritance theory, the transformation of unlimited natural language processing into limited categories, Chinese information automatic extraction, word class matching rules, speech recognition model, time information analysis of text, time information, phrase structure rules, automatic acquisition method Fuzzy cluster analysis is used in the field of speech recognition, language alienation, automatic access method based on neural network, the remote match of the verb in English, Chinese name identification, Chinese text automatic classification model design and implementation, vocabulary disambiguation

Yao Tianshun

Zhu Jingbo

Northeastern University 9

"Computer Research and Development" 97 (1) Some Chinese Grammar Analyzer

Wu Lide

Fudan University

Major journal and quantity

"Journal of Circuits and Systems" (1)

Journal of South China University of Technology (1)

Handwritten Chinese character recognition (elastic grid direction decomposition feature, dynamic attenuation adjustment diameter function (RBF DDA))

Xu Bingzhen

Department of Electronic and Communication Engineering, South China University of Technology

Journal of Peking University (1)

"Chinese Information" (1)

Chinese single sentence predicate center word recognition

Yu Zhibo

Wei Zhi Fang

Beijing University Computing Language Institute

Major journal and quantity

content

Take the leader

School, hospital, office

Serial number

Domestic research (continued)

Shanghai Jiaotong University Computer Application Research Institute

Automatic word (1960s)

Automatic Direction (70s)

Northeastern University

Department of Computer Science and Technology, Tsinghua University

Department of Information Engineering, Beijing University of Posts and Telecommunications

China Academy of Sciences Computer Language Information Engineering Research Center

Department of Computer Science and Engineering, Harbin Institute of Technology

Shanghai Jiaotong University

Northeastern University

Distance of the verbs in English

Department of Information Engineering, Beijing University of Posts and Telecommunications

Harbin Institute of Technology

Department of Information Engineering, Beijing University of Posts and Telecommunications

Automatic abstract

School, hospital, office

content

Department of Computer Science and Engineering, Harbin Institute of Technology

Northeastern University

Department of Computer Science and Engineering, Shanghai Jiaotong University

China Academy of Sciences Computer Language Information Engineering Research Center

Phonograph conversion

Tsinghua University Electronic Engineering Department

Harbin Institute of Technology

Beijing University Computing Language Institute

Northeastern University

Department of Electronic and Communication Engineering, South China University of Technology

Chinese character identification

Department of Computer Science and Technology, Tsinghua University

Fudan University Computer Science Department

China Academy of Sciences Computer Language Information Engineering Research Center

Semantic Analysis Grammatical Analysis Syntactic Analysis

School, hospital, office

content

Chinese word system CSEG & TAG

"Realization of business card automatic entry system"

"Chinese words fast finding system"

"Chinese Automatic Summary System"

Department of Computer Science and Engineering, Shanghai Jiaotong University

TH-OCR Chinese Character Recognition (Optical Character Recognition)

Tsinghua University

Department of Computer Science and Engineering, Harbin Institute of Technology

Shanghai Jiaotong University Computer Application Research Institute

China Academy of Sciences Computer Language Information Engineering Research Center

"Design and Implementation of Natural Language Managers"

Northeastern University

Department of Computer Science and Technology, Tsinghua University

Identification model

School, hospital, office

content

Rule-based machine translation system (foreign)

Georgetown University's Motor Translation System

Russian law translation system at University of Grenoble, France

Canadian TAUM-Meteo System

Japan Atlas

Japan's other practical machine translation systems

French Textile Research Institute's Titus-IV System

American SYSTRAN system

Weidner system in the United States

PAHO system in the United States

German Metal System

SUSY system in Germany

Eurotra system

Japan's MU system and ODA program

DLT system

Rules-based machine translation system (domestic)

Russian machine translation system

English-Chinese Quest Recorder Translation System Title

Han - French / British / Day / Russian / Due multi-language machine translation system FAJRA

"Translation star" English system

"Gao Li" English-Chinese system

863-IMT / EC English-Chinese system

Matrix English-Chinese system

Translate English - Chinese-English system

Yaxinying system

READWORLD English-Chinese system

Sino-Trans Han English - Han Jian Motion Translation System

E-TO-J English machine translation system

Foreign excavation tool

IBM text intelligent excavator

advanced

Search Engine (Advanced Search Engine) - TextMiner;

Web Access Tool (Web Access Tools) - includes web search engines Netquestion and Web Crawler;

Text Analysis Tools.

IBM's TextMiner: The main function is characteristic extraction, document aggregation, document classification, and retrieval.

Data retrieval of multiple format texts for 16 languages;

Using deep text analysis and indexing methods;

Support full-text search and index search, the search conditions can be natural language and Boolean logic conditions; it is a tool for Client / Server structure, supporting a large number of concurrent users to do search tasks;

Online update indexing, and other search tasks can be completed.

Foreign text excavation tools (continued)

Autonomy's core product is Concept Agents.

After training, it can automatically draw concepts from text.

TelTech

Provide expert service;

Professional literature retrieval service;

Products and manufacturers search service;

The key to success of Teltech is to establish a high-performance knowledge structure. It uses the subject matter, its theme word table is divided into different major, with more than 30,000, maintained by digital knowledge engineers, updated 500 to 1200 words per week.

转载请注明原文地址:https://www.9cbs.com/read-19763.html

9cbs

New Post(0)