HTML: Advantages and Disadvantages Background Technical Method Overview Getting the Source Information Finding Data of XHTML Format The reference point to map data into XML merge and processing results Conclusion Reference information about the author's evaluation
Automatically extract information written with HTML, XML and Java Jared Jackson (Jjared@almaden.ibm.com) Jussi MyllMaki (Jussi@almaden.ibm.com) IBM Researcher June 2001
It is undeniable that the World Wide Web is to the world's most abundant and most intensive source of information so far. However, its structure makes it difficult to use information. The methods and tools described herein will enable developers who are familiar with the most common technologies that are most common technologies that can quickly and easily access information that they need in a web way.
The web that rapidly growing quickly in the information age causes a wide variety of public information being distributed in a large amount. Unfortunately, although HTML as the information main carrier provides a way to present information to the reader, it may not be a good structure that can automatically extract information related to data-driven services or applications. . Multiple methods have been tried to solve this problem. Most methods use some of the specific query languages to map each part of the HTML page into code, and these codes fill in the information on the web page into the database. Although these methods may provide some benefits, most of them become unrealistic because of the following two reasons: First, they need to develop people to take time to learn a query language that cannot be used in other cases, secondly, they are not enough. Strong to deal with simple changes to the target web page. In this article, a Web-based data mining method using standard Web technology - HTML, XML, and Java - development will be discussed. This approach is more powerful than other dedicated methods, and other methods are not equal, and for those that have been familiar with Web technology, they only need to pay very little effort to receive good results. In addition, this article also comes with a number of codes required to start data extraction. HTML: Advantages and Disadvantages HTML is usually a very difficult media to process media. Most of the contents in the web page describes formatted formats that are independent of the data-driven system, and because the document structure may be changed when the other server-side scripts are required to be dynamically added. Also because all the formats of the main part of the web page are unreasonable, the problem becomes more complicated, and the result is that the current web browser is very unscrupulous when the HTML syntax analysis is performed. Despite these problems, HTML still has advantages in data mining. The data you are interested in usually can be separated by a single
or
tag in the deep nested in the HTML tree. This allows the extraction process to be performed specifically within a small portion of the document. In the absence of a client-end script, there is only one way to define a drop-down menu and other data list. These aspects of HTML allow us to concentrate on data extraction when there is a format data. BACKGROUND OF THE INVENTION The key to data mining techniques described herein is to convert the existing web page to XML, or convert to XHTML, and use a small portion of a wide range to process data of the XML structure to retrieve appropriate data. Fortunately, there is a solution to correct the weakness of the HTML page design. TIDY (available from some programming languages) is a free product that can be used to correct common errors in the HTML document and generate a good equation in formatting. You can also use TIDY to generate these documents in XHTML (XML subset) format. (See Referring). The code examples in this article are written in Java, and when compiling and running them, there is a TIDY JAR file in your system's classpath. They also need to be available to the XML library through the Apache project, Xerces and Xalan. Both libraries are based on IBM-supplied code and control XML syntax analysis and XSL transformation, respectively. Each of these three libraries can be gated from the web. To find them, you can follow the above links or refer to the references later in this article. Understanding Java programming languages, XML and XSL transform will help you understand the following examples. References on these technologies can be found later in this article. Method Overview and Example Introduction To introduce data extraction methods in an example. Suppose we are interested in tracking the temperature and humidity level of Washington, Seattle, Washington, Washington, which is measured every day. If there is no ready-made software to report this type of information to meet our needs, we still have an opportunity to collect such information from many public websites. Figure 1 illustrates a full process of extracting.
The web page only has been retrieved and processed after creating a data set that merges to an existing dataset. Figure 1. Summary Description The extraction process requires only a few steps, we can have a suitable and reliable system that collects our information. These steps are listed here to provide a brief summary of the process, and the higher form of this process is shown in FIG. Identify the data source and map it into XHTML. Find reference points in the data. Map the data into XML. Merger results and data. Each step in these steps will be described in detail and will provide the code necessary to perform them. Get the source information of the letter XHTML format In order to extract data, it is of course you need to know where you can find it. In most cases, source information is obvious. If you want to collect the title and URL of the article from DeveloperWorks, we will use http://www.ibm.com/developerWorks as our goal. In the weather in the weather, we have a number of available information sources. We will use Yahoo! in the example. Weather, but use other information sources also have the same effect. We will specially track the data on the URL: http://weather.yahoo.com/forcast/seattle_wa_us_f.html. Figure 2 shows the screen snapshot of this page. Figure 2. Washington State Seattle's Yahoo! Weather Weather Weather Weather WEB page is very important to keep the following elements in mind:
Is the information source generated reliable data on a reliable network connection? How long will the information source will have? One week, one month or or even a year? How to stabilize the layout structure of the information source? We seek the easiest way to take the most reliable and most stable information source that can be used in the dynamic environment, and our work will be the easiest. Once the source is identified, the first step in the extraction process is to convert the data from HTML to XML. We will complete this task by constructing a Java class named XMLHELPER (composed of Static Helper function) and other related tasks with XML. All of this class can be found by links to XMLHELPER.JAVA and XMLHELPEREXCEPTION.JAVA. With the continued development of this article, we will build this class. We use the function provided by the TIDY library to perform conversion in the XMLHELPER.TIDYHTML () method. This method accepts the URL as a parameter and returns an "XML document" as a result. When calling this method or any other method related to XML, please check if there is any other. The code is displayed in Listing 1 shows the code. Figure 3 shows the code results, Microsoft's Internet Explorer XML viewer uses XML in the Weather page. Figure 3. Convert to XHTML Yahoo! Weather Weather web page Find data reference points, whether in WEB pages or the vast number of information in the source XHTML view is completely unrelated to us. One of our next task is to find a specific area in the XML tree, we can extract our data without care for external information. For more complex extraction, we may need to find a number of instances of these areas on a single page. The easiest way to accomplish this task is usually, first check the web page, then use XML. Just look at the page, you can know that the information we are looking for is in the middle upper area of the page. Even if the familiarity of HTML is very limited, it is also easy to infer the data we are looking for, which may be included under the same
element, and this table may always contain words like "appar TEMP" and "DewPoint". Regardless of the data that may be on the day. Make a note of the content we observed, now consider the XHTML generated by the page. The text of "Appar Temp" (shown in Figure 4) shows that the text is indeed in a table containing all the data we need. We will use the table as a reference point or anchor. Figure 4: Find an anchor by looking for a table with text "Appar Temp" now, we need to find this anchor method. Because we are preparing to use XSL to convert our XML, you can use XPath expressions to complete this task. We will use the following ordinary expressions: / html / body / center / table [6] / TR [2] / TD [2] / Table [2] / TR / TD / TABLE [6]
This expression specifies the path from the root element to the anchor. This ordinary approach will lead to our modification of the page layout very easily destroyed. A preferred method is to specify an anchor based on the surrounding content. By using this method, we rebuild the XPath expression:
Map the data into XML with this anchor, we can create code that actually extracts data. This code will appear in the form of XSL files. The purpose of the XSL file is to identify an anchors, specify how to obtain data we are looking for from the anchor (in a short jump method), and construct an XML output file with the format we need. This process is actually much simpler than imagined. Listing 2 gives the XSL code that will execute this process, which can also be obtained as an XSL text file. Elements Tell only the result of the transform we want is XML. The first, establishes the root element called to search the anchor. Second, Let us only match what you need to match. Last, defines an anchor in the Match property, then tells the processor to jump to the temperature and humidity data we try to excavate. Of course, only write XSL, the job will not be completed. We also need a tool to perform conversion. Therefore, we use the XMLHELPER class method to grammarize the XSL and perform this conversion. Methods to perform these tasks are Parsexmlfromurl () and Transformxml (), respectively. Listing 3 gives code using these methods. Listing 3 / **
* Retrieve the XHTML File Written to Disk in the Listing 1
* And Apply Our XSL Transformation To It. Write the Result
Merger and Process Results If we only perform a data extraction, we have completed it. However, we don't just want to know the temperature at a certain moment, but to know the temperature of several different moments. Now, what we need to do is to repeatedly perform the extraction process, incorporate the results into a single XML data file. We can use XSL execution again, but we will create a method for incorporation of the XSL file into the XMLHELPER class. The mergexml () method allows us to merge the data obtained in the currently extracted into a file containing previously extracted data. A code for running the entire process is given in the weatherextractor.java file. I left the program to leave the program to the reader, because the system-related methods that execute these tasks are generally advanced than simple programming methods. Figure 5 shows the result of running a WeatherExTractor every day to run a total of four days. Figure 5. The results of the result of the web extraction In this article, we have described and proved the basic principles of the largest source of information from the current existing information - the dominant way to extract information. We also discussed the coding tools that can start their own efforts to take their own efforts to make any Java developers. Although the examples in this article focus on extracting information about Washington, Seattle weather, all code here can be reused in any data extraction. In fact, in addition to a little change to the WeatherExtractor class, the code that needs to be changed for other data mining items only XSL transform code (by way of inciting, it does not need to be compiled). This method is as simple as the imagination. By wisely select reliable data sources and selecting an anchor associated with content but with the format in these data sources, you can have a maintenance cost, reliable data extraction system. Also, depending on the experience level and the amount of data to be extracted, you can install and run it within an hour. References Tidy for Java maintained by Sami Lempinen and downloaded from SourceForge. XML libraries, Xerces, and Xalan are available from the Apache Project website. About XML more details, developerWorks provides a zone related to this technology. There are many tutorials for XSL and XPath, you can use your favorite Web search engine to find them. Jussi MyLlymaki has a papers related to the relationship between the Web search and data extraction in the Andes system, published on the WWW10 in Hong Kong. Here is some techniques that make the website personalized, and the prompt to maximize site performance. "Management Website Performance" introduces how to fine-tune the website performance from the browser to the database server and the old system.
About the author Jared Jackson has been working in IBM Almaden Research Center in May 2000 Harvey Mudd University. Jared is a graduate student at Stanford University "Computer Science" department. Can be connected to Jared@almaden.ibm.com. Jussi MyLlymaki joined IBM Almaden Research Center in 1999 and won a Ph.D. in Wisconsin University from Madison. Can communicate with Jussi through Jussi@almaden.ibm.com