Zhang Chengmin Zhang Chengzhi China Pharmaceutical University Library (Department of Information Management, Nanjing Agricultural University)
Summary This article provides a brief introduction on Internet information excavation technology, which explains the key technologies and system processes in network information mining, and combines the development and application of agricultural network information mining systems, pointing out the application prospects of network information mining.
Key words data mining Internet web information extraction
About The WDM Technology Zhang Chengzhi (Department of Information Management, Nanjing Agriculturn ", Nanjing 210095)
Abstract This paper introduces the WEB Data Mining (WDM) expoundes the key technology, the system process of the WDM, then use the Agricultural WEB Data Mining (AWDM) as a example, declare that the WDM has good foreground in the practice.Keywords Data Mining, Internet, Web Pages, Information Extration
I. Overview
With the rapid development of the Internet, more and more information is presented in front of the user, but the same problem is that users are increasingly difficult to obtain their most needed. In the early days, in order to solve this problem, there is a semi-automated network search engine (Search Engine) represented by Yahoo. The network search engine is mainly composed of three parts of the network robot, an index database, and query services [1]. The network robot traverses the Internet resources and discovers and collects new information as much as possible; use full-text search technology to establish an indexed index database to the index database, which can greatly improve the speed of information retrieval; Query service receives and analyzes The user's query, the user query is used as a database, according to a certain matching policy, such as a Boolean model, a fuzzy Boolery model, etc., will finally reach a certain result of the degree of matching (including title items, simple information, simple, and link). Address) The collection is returned to the user. Since artificial intelligence research has not reached the level of practicalization, the current network robot can not realize the accurate classification of information, so that the results of the search are not satisfactory, for example, when a user uses "cotton planting" to search, it is intended to get the cotton. Data of the distribution of regional distributions, but most of the search engines returns a large number of articles on cotton planting technology, which causes this reason that most of the existing search engines are based on simple keyword matching, and cannot truly understand the user's search meaning. Resulting. In addition, most of the search sites are processed by manual ways, so that the speed of information is far behind the expansion of network information. In order to achieve personalized active information services, network information mining technology has become a new research topic in recent years, which is the application of data mining technology in network information processing [2]. Network information mining refers to the intrinsic feature between data objects on a large number of training samples, and is based on the purposeful information. For example, when the information mining system system finds that the user's interest is "cotton planting distribution", it will automatically filter out of non-related data such as cotton planting technology, which can greatly reduce the search time and cost of users. Network information mining and network information retrieval have a lot of similarities, but there are also differences. Network information mining technology takes the excellent results in network information retrieval in Robot, full text search, and combines various technologies in artificial intelligence, pattern identification, neural network. The maximum difference between network information mining system and network information retrieval is that it can obtain user personalized information requirements, and perform a destressed information on the network or in the information library according to the target feature information. This article explains the overall process and technology implementation of network information mining technology, pointing out the feasibility of the application in the field of network information excavation agricultural information and its development prospects. Second, key technology and system flow in network information mining technology
1. Key Technologies in Network Information Mining (1) Features of Target Samples Extraction Network Information Mining System Adopt Vector Space Modal, VSM, represents feature terms (T1, T2, ..., TN) and its weight WI Target information, when performing information match, use these feature items to evaluate the degree of correlation of unknown text and target samples. The selection of the feature entry and its weight is called the feature extraction of the target sample, and the advantages and disadvantages of the feature extraction algorithm will directly affect the operation of the system. The frequency distribution presentation in the documentation of different content is different, so feature extraction and weight evaluation can be performed according to the frequency characteristics of the entry. An effective feature entry should be able to reflect the target content, and can also distinguish between the targets with other documents, so the terms of the entry is proportional to the number of interpretation of the entry, which is relatively than the document frequency of the entry in the training text. Constructs the following feature item weight evaluation function: weight (word) = tfik * idfi = tfik * log (n / nk 1) where the TFIK indicates the frequency of the term TK in the document DI, IDFI is the frequency of reverse file, n The number of documents of all target samples, NK indicates the number of documents of entry TK. If you consider the word length factor, standardization can be made: weight (word) = tfik * log (n / nk 1) / compared with ordinary text files, there is a significant identifier in the HTML document, and the structural information is more obvious. And the properties of the object are more rich. When calculating the feature word stripe weight, the system fully considers the characteristics of the HTML document, which gives a higher weight for the text more than the characteristic information. In order to improve the operational efficiency, the system performs a despinuation processing of the feature vector, and only the territory of the power is high, and the target feature vector with a lower dimension is lowered. (2) Chinese sentences to handle English sentences in space as a fixed separator, and in Chinese, this is a big obstacle to Chinese information processing, such as a computer unable to distinguish "racket bought" is "racket, bought" It is still a "ball, auction", so the entry segmentation will be performed before performing word frequency statistics. A relatively simple and effective score method is based on a large-scale word. The general language system contains a lot of common vocabulary that does not become a feature item. In order to improve system operation efficiency, the system establishes a professional score list according to mining targets, which can significantly improve the operating efficiency of the system under the premise of ensuring the extraction of the feature. When the entry is made, the crude score is performed according to the punctuation, and then the forward and reverse maximum matching method will be used. When performing word frequency statistics, considering the diversity of natural languages, the system is established and used the corresponding synonymous dictionary, related words dictionary, etc., to improve the accuracy of information matching. (3) Getting Dynamic Information Robot in the network is an important part of the traditional search engine, which reads the web page in accordance with the HTTP protocol and automatically roam on the HTML document, and Robot is also called Spider, Worm Or crawler. However, Robot can only get a static page on the web, and valuable information is often stored in the network database. People cannot obtain these data through the search engine, only to log in to the professional information website, and use the website provided to submit query requests, get And browse the dynamic page generated by the system. The network information mining system traverses the information in the network database through the query interface provided by the website, and automatically analyzes the results of the traversal results according to the professional knowledge base, and finally imports the local information library.
2. Network Information Mining Technology Implementation Flow Chart 1 shows the overall flow chart of network information mining technology, each of which is explained in each step, respectively: First step: establish a target sample, that is, the user selects the target text, as extracted user Feature information; Step 2: Extract feature information, that is, according to the word frequency distribution of the target sample, extract the characteristic vectors of the mining target from the statistical dictionary and calculate the corresponding weight; the third step: network information acquisition, that is, search The engine site selects the site to be acquired, then use the Robot program to collect the static web page, finally obtain dynamic information in the visited site network database, generate the WWW resource index library; Step 4: Information feature match, the source information in the index library Feature vector, and match the characteristic vector of the target sample, return information that meets the threshold condition to the user. Third, the application prospect of network information mining technology
The Internet provides users with rich resources, but there is no good information excavation tool is very difficult to obtain useful information. The author uses the application of network information mining technology in the field of agricultural information as a brief description. With the further development of my country's telecommunications industry, network information is also doubled, especially, agriculture is the first major industry in my country, and the informationization of agriculture will inevitably require the information mining system in an agricultural field to meet all levels of users. Demand for agricultural information. Building a agricultural network information mining system should be based on the existing mature theory, combined with the current distribution characteristics of the current WWW agricultural information resources, you can divide the statistical dictionary into agricultural basic science, agricultural engineering, agronomy, plant protection, Crops, gardening, forestry, animal husbandry, aquatic products, fisheries and other professional dictionary. This helps improve the accuracy of the match, thereby increasing the accuracy of the search. During the construction of the system, three comparison problems are related to the following: 1. Target Sample Determination of Problem User Features Information The extraction of the user's feature information is derived from the network resource it browse (generally HTML text), submitting the user's web page to the server, using the user's target sample, the number of target samples is 50 Therefore, it is too little because the extracted keyword is too sparse and not enough to express the user's characteristic interest. If too much, it will increase the system overhead and take a longer calculation time. In the user feature information extraction algorithm, the weight of the entry is measured, and we mainly consider the word frequency (TFIK), the reverse document frequency (IDFI), and location factors. In order to improve the characteristic expression of keywords, we can further consider the word length, the distribution of words as weight measurement factors. Generally, the word long words can express more specialized concepts, such as "crop cultivation" to refer to "crops", corresponding, to give "crop cultivation" high weight. The distribution of the word is the case where the words are distributed in a piece of text, and a word A non-deactivated word appears in each paragraph in the article, and the other word B appears in one of the paragraphs, thinking A ratio B is more features, thus gives A high weight. 2. The texture of the statistical dictionary and the extraction of the user feature information and the automatic index of Internet information should involve the problem of word. The advantages and disadvantages of the word effects are greatly related to the statistical dictionary used in the word algorithm and the word. The "Maximum Matching Method" (MM Match) is used as a word algorithm in the Chinese word processing module in this system, and the statistical dictionary used is mainly keyword dictionary, synonym dictionary, related words dictionary. The data in keyword dictionary is mainly from the "China Library Classification Law" (fourth edition), "China Classification Theme", "Agricultural Professional Classification Table", "Chinese Marc", "Chinese Technology" Class S class data in Journal Database. The specific processing process of the data will be introduced in detail due to the limit of the space. The data of synonym dictionary is mainly based on the above data resources and "synonymous terms". When processing user query and text classification, the synonym dictionary shows a large role. Related words dictionary (such as plant inspection and fruit test) and implication related words (such as grafting and dwarf rootstock, grafting seedlings, sci-shaping, bridging, interlayer anvil, rootstock, grafting relatives) . The structure of this dictionary can be determined by the above data resources and the term-based statistical algorithm. The design of the agricultural network information mining system should also take into account the excavation of user interest. If the feature vector of a certain user is found, "aloe, planting" is included in the characteristic vector, and the excavation system should increase the feature item after learning. " Aloe, planting the weight, and utilizing the feedback mechanism (User feed-back) to achieve timely push (PUSH).
In addition, the deeper knowledge can be excavated by group users. If many of the features generated by many users in a certain area have "aloe", it can be inferred that there may be a phenomenon of aloe vera in this area, which is based on this. The mining system can analyze the needs of the regional needs of the aloe market, thereby providing a certain scientific basis for the circulation of aloe vera. Currently, it is not mature and other technologies such as artificial intelligence, using statistics models to build a agricultural information network mining system has certain inspirational significance, and each part of the system needs to be further improved and improved. references
1. Gudivada V N. Information Retrieval on The World Wide Web. IEEE Internet Computing, 11997, 1 (5): 58 ~ 682. Lee level. Review of data mining technology. Small Mini Computer System, 1998,19 (4): 74 ~ 81
Related articles :: Publish your own views for this! I want to say two sentences ... What is Chinese word-Chinese scope (2003-12-28) Chinese scope in Chinese software (2003-10-21) Data Mining ---- Library The basic tool (2003-10-07) from the Web to the Competitive Intelligence (2003-10-07) Talking about Internet Information Mining Technology (2003-10-04) Web Text Mining Technology (2003-10 -04) The main search service (2003-10-01) current search engine (2003-10-01) The technology development trend (2003-09-29) What is keyword density (2003-09-26) Google website included and ranking Analysis (2003-09-26) Google Quick Lock Content Tips (2003-09-22) Website Login Google Key: Website External Links (2003-08-28) IBM Unified Artificial Intelligence Academic Search Technology will be super Google (2003- 08-18) Practical Chinese Search Engine Promotion (2003-08-16)