Author: XU Zhen Airlines, we have vast amounts of data on Liu Liqin Web, how these data become today's complex applications hotspot database technology. Data mining is to find implicit regular content from a large amount of data to solve the application quality problem of data. It is the most important application of data mining technology to make full use of useful data. Compared with the Web's data, the data structure in the traditional database is strong, that is, the data is fully structured data, and the data on the web is semi-structural. The semi-structured is data relative to the fully structured traditional database. Obviously, the Web-oriented data mining is much more complicated than data for a single data warehouse. 1. Heterogeneous Database Environment From the perspective of database research, the information on the Web site can also be seen as a database, a larger, more complex database. Each site on the web is a data source, each of which is heterogeneous, so information and organizations between each site are different, which constitutes a huge heterogeneous database environment. If you want to use these data to perform data mining, first, you must study the integration of heterogeneous data between the site, only integrate the data of these sites, provide a unified view of the user, so it is possible to be from huge data Get what you need in resources. Second, we must solve the problem of data inquiry on the web, because if the required data cannot be effectively obtained, these data is analyzed, integrated, and handling. 2. Data on the semi-structured data structure Web is different from the data in the traditional database, and the traditional database has a certain data model that can specifically describe specific data depending on the model. The data on the web is very complex. There is no specific model description, and each site is independently designed, and the data itself has self-preceding and dynamic variability. Thus, the data on the web has certain structural, but is therefore unsatisfactory data, which is also referred to as semi-structured data due to the presence of the readme hierarchy. Semi-structure is the biggest feature of data on the web. 3. Solve semi-structured data source issues Web data mining technology primarily solve the query and integration of semi-structured data source models and semi-structural data models. To solve the integration and query problem of heterogeneous data on the Web, there must be a model to clearly describe the data on the Web. For data semi-structured features on the Web, finding a semi-structured data model is the key to solving the problem. In addition to defining a semi-structured data model, there is a need for a semi-junction configuration model extraction technology to automatically extract a semi-structured model from existing data. Data mining towards Web must be premised on semi-structured models and semi-structural data model extraction techniques. XML and Web Data Mining Technology The new generation of WWW environments based on XML is directly facing Web data, not only well-compatible with the original web applications, but also better implements information sharing and exchange in the Web. XML can be seen as a semi-structural data model that can easily describe the XML document description with the properties in the relational database, implement accurate query and model extraction. 1.XML's production and development XML (ExtensibleMarkUplanguage) is an important branch designed by the World Wide Web Association (W3C), especially for the SGML (StandardGeneralmarkUplanguage) of web application services. In general, XML is a multi-markuplanguage that provides a format describing structured data. In detail, XML is a language similar to HTML, designed to describe data.
XML provides a separate running program to share data, which uses a new standard language from the motion description information that enables computer communication to expand the functions of the Internet from information to other variety of people. Go in the event. XML consists of several rules, which can be used to create tag language and can use a concise program called the analyte program to process all newly created tag language, as HTML provides a display for the first computer user reading the Internet document. Like the way, XML has also created a Chinese language that anyone can read and write. XML solves the two web issues that HTML can't be resolved, that is, the Internet has fast development speed and slow access speed, and more information available, but it is difficult to find the part of the information you need. XML can increase structural and semantic information, allowing computer and servers to process multiple forms of information. Therefore, using XML expansion features not only download a large amount of information from the web server, but also greatly reduce network traffic. The flag in the XML is not predefined, and the user must customize the required flag, XML is a language that can perform self-explancribing. XML uses DTD (DocumentTyPedefinition Document Type Definition) to display these data, XSL (ExtensibleStylesheetLanguage) is a mechanism to describe how these documents display, it is a style sheet description language of XML. XSL history is long than HTML CSS (laminated style sheet cascadingstylesheets), XSL includes two parts: a method for converting an XML document; a method for formatting an XML document. XLL (ExtensLinkLANGuage) is an XML connection language that provides connections in XML, similar to HTML, but more powerful. Using XLL, you can connect in multiple directions, and the connection can exist in the object level, not just the page level. Since XML can mark more information, it can make users easily find the information they need. With XML, Web designers can not only create text and graphics, but also build a multi-level, interdependent system, data tree, metadata, hyperlink structure, and style sheets defined by document type. 2. The main feature of xml is the characteristics of XML determines its superior performance. XML as a marker language, there are many features: (1) Simple. XML has been carefully designed, and the entire specification is simple, it consists of several rules, which can be used to create tag language and can use a concise program that is often referred to as an analysis program to process all newly created tag language. XML can create a world language that anyone can read and written, which is a unified functionality of this world language. As the tag created by XML is always pair, and the new coding criteria that relies on unified code. (2) Open. XML is SGML has many mature software on the market to help prepare, manage, etc., the foundation of open standard XML is the verification standard technology and optimized for networks. Many industry top companies, cooperate with W3C's working group to assist in ensuring interaction operation, support developers, authors and users on all systems and browsers, and improve XML standards. The XML interpreter can load an XML document using a programming method. When this document is loaded, the user can obtain and manipulate the information of the entire document through the XML file object model, speed up the network operation. (3) Efficient and expandable.
Support multiplexed document pieces, users can invent and use their own labels, can be shared with others, extendable, in XML, can define unlimited set of labels. XML provides an architecture that mark structured materials. An XML component can declare information related to the retail price, business tax, book name, quantity, or any other data element. As many institutions in the world have gradually adopted XML standards, there will be more related functions: Once locked, use any way to pass through the cable, and rendered in the browser, or transferred to other applications The program makes further processing. XML provides an independent way to share data, using DTD, people in different groups can exchange data using common DTDs. Your app can use this standard DTD to verify that the data you receive is valid, you can also use a DTD to verify your own data. (4) Internationalization. Standard internationalization and support most of the world. This comes from new coding standards that rely on its unified code, which supports all the mixed texts written in the main language in the world. In HTML, in most digital processing, a document is generally written in a special language. Whether it is English or Japanese or Arabic, if the user's software can't read the character of the special language, then he can't use it. Documentation. But the software that can read the XML language can handle any combination of these different language characters. Therefore, XML can not only exchange information between different computer systems, but also exchange information in cross-border and transcending different cultural borders. 3. XML Application XML in Web Data Mining has become a formal specification, and developers can mark and exchange data with XML format. XML provides a good method for data processing on a three-layer architecture. Using a skilled three-layer model, XML can be generated from the existing data, and data using XML can be separated from commercial specifications and expressions. The integration, sending, processing, and display of data is every step in the following procedure: promoting XML applications are WEB applications that cannot be done with standard HTML. These applications can be divided into the following four categories: Web clients are required to communicate between two or more heterogeneous databases; trying to transfer most of the processing load from the web server to the web client application; A web client requires the same data to provide different users in different browsing form; requires the intelligent web proxy to cut information content according to the needs of the individual users. Obviously, these applications and Web data mining technologies have important links, and web-based data mining must rely on them. XML gives powerful features and flexibility to web-based applications, so it brings many benefits to developers and users. For example, a more meaningful search, and web data can be uniquely identified by XML. Without XML, search software must understand how each database is built, but this is actually impossible because each database description data is almost different. Due to the existence of integration of different source data, the search for a variety of incompatible databases now is actually impossible. XML enables structured data from different sources to be easily combined. Software agents can integrate data from the backend database and other applications on the intermediate layer. The data can then be sent to the customer or other server for further collection, processing, and distribution. XML scalability and flexibility allows it to describe data in different kinds of applications, from the description of the web page to the data record, to obtain data through a variety of applications. At the same time, since XML-based data is self-description, data does not need to be exchanged and processed internally. With XML, users can easily perform local computing and processing, and the data of the XML format is sent to the customer, the customer can use the application software to parse the data and edit and process the data.
Users can process data in different ways, not just to display it. XML Document Object Mode (DOM) allows data to be handled with scripts or other programming languages, and data calculations can be done back to the server. XML can be utilized to separate the interface of the user to view the data, using a simple and flexible format, you can create powerful applications to the Web, and these software can only be built on the high-end database. In addition, after the data is sent to the desktop, it can be displayed in a variety of ways. XML can also describe the structured data in a simple open extension, XML supplements HTML, is widely used to describe the user interface. HTML describes the appearance of the data, while XML describes the data itself. Since the data display is separated from the content, the data defined by XML allows the specified different display, making data more reasonably. Locally data can dynamically manifest by customer configuration, user selection, or other standards. CSS and XSL provide an announcement mechanism for the display of data. With XML, the data can be updated granularly. Whenever a part of the data varies, it is not necessary to return the entire structured data. The changing element must be sent from the server to the customer, and the changed data does not need to refresh the entire user's interface. But at present, as long as a data has changed, all pages must be rebuilt. This seriously limits the server's upgrade performance. XML also allows additional data, such as predicted temperatures. The added information can enter the existing page and does not require a browser to re-send a new page. XML Application When you need to interact with different data sources, data may come from different databases, which have their own different complex formats. But customers with these databases interact with one standard language, that is, XML. Due to the customity and scalability of XML, it is sufficient to express various types of data. After the customer receives the data, it can be processed or passed between different databases. In summary, in such applications, XML solves the unified interface problem of data. However, different from other data delivery criteria is that XML does not define specific specifications in data in the data file, but is attached to the data to express the logical structure and meaning of the data. This makes XML a specification that can be automatically understood by a program. XML is applied to a large number of computational loads on the client, ready to select and create different applications according to their own needs to handle data, and the server only needs to issue the same XML file. If the customer issues a different request to the server according to the traditional "client / server" mode of work, the server responds separately, which not only increases the load of the server itself, but network managers must also investigate a variety of different user needs. Corresponding to different programs, if the user's demand is complicated, all business logic is still concentrated in the server side, because the programmers of the server may not meet many application needs, and they will not keep up with the changes in demand. Both parties are very passive. Applying XML handed over the initiative to the customer, the server is only perfect and accurately putting the data into XML files as much as possible. XML self-explanatory makes the client also understand the logical structure and meaning of data while receiving data, thereby making a wide, general distributed computing. XML is also applied to the network agent to edit and decrease the information obtained to accommodate personal users. Some customers have obtained data not to use them directly to organize their own databases as needed. For example, the education department must establish a huge question bank. When the exam is taken out, the topic in the question bank will take the test paper, and then package the test paper into the XML file, followed by the school to let it pass through a filter, filter out all the answers, In the face of each candidate, the content that has not been filtered can be sent directly to the teacher. Of course, you can also transfer a question compilation after the exam.