Rel = file-list href = "http://freehost11.Websamba.com/karymay/program/about_inf_dig.files/filelist.xml"> rel = Edit-time-data href = "http://freehost11.websamba.com /karymay/program/about_inf_dig.files/editdata.mso ">
Thoughts on Internet Information Acquisition
Thoughts on Internet Information Acquisition
This article is placed here so that it can be brainstormed. Please send it to karymay@163.net. Welcome to my homepage http://www.Websamba.com/karymay/]
The rapid development of the Internet provides us with rich information, but also proposes how effective useful problems, "rich data and poor knowledge" is increasingly prominent. The method currently mining data is generally referred to as "knowledge discovery" or "data mining". Knowledge findings involve data collection, data cleaning, data output, etc., is a product that combines statistically, pattern identification, artificial intelligence, machine learning. It can be considered to be such a process: collecting information from the heterogeneous data source and converts the information required by the user.
Isomerism of information source is the focus of network information is difficult to use. Due to the isomer of network information, "collecting difficulties" and "finishing difficult" are "finished". Many companies and companies at home and abroad have put a large amount of financial resources to develop. There are also many tools and products, universal types of Google search engines, which are more professional, such as MP3 search engines. Specially used to collect, such as "information warehouse". I can include the information collection category of Teleplort Pro, or the Google Background, and the CGROBOT program that I am responsible for developing has certain data reforming. However, these products are usually very specialized and is not suitable for small businesses and individual users. Although the TELEPORT PRO is also used by personal users, downloading (pages) data typically requires a lot of energy to edit to use. So far, there have been no more convenient data acquisition and finishing tools that are suitable for personal users and is suitable for companies.
This article tries to perform data mining from another angle. That is to say, although the data on the Internet is very integrated, it is structured for specific websites and web pages. If the original structural relationship is ignored, although the idea is simple, it is limited to artificial intelligence, even the most advanced system, it is impossible to meet most of the current user application.
Then we can analyze the relationship between the original website itself, the relationship between the page, and then convert the relationship between these elements to the data required by the user according to the user's instruction, then we say that the acquisition system is effective utilization The wisdom of the website producer and the wisdom of the user.
One.
Features of the web page
In this paper, the elements that can be manifested to the user during web pages are called web elements, including elements related to visual, auditory, and window events. It has a certain relationship with the specific internal elements of the webpage. However, this article is more from the user's angle. If you do not start from the user's angle, you are only afraid that the software is difficult to use or weak.
1. Web elements themselves have properties
1). The web page elements have spatial properties. Spatial properties are both a flat relationship (X, Y axis) at the time of web page display, and also appeared on the Z axis. For example, a web page element can override another element or a background of the web page.
2). The web page elements have time properties. A web page element can be moved, and it can also be displayed at a certain amount of time.
3). The web page elements have event properties. Web elements can respond to mouse events.
4). The web page elements can also be motion; can also be expressed as auditory (music).
2. The relationship between web page elements 1). Spatial positions often have relativity. The location of a web page can affect another web page element.
2). Time may have a sequential relationship. For example, one element is only displayed to display another element; or the other element will change after the element is clicked.
If the concept of the web page element can be considered, one window can also belong to a (composite) web page element. The title, status line, URL, etc. of the window are also web page elements. However, the scope of the web page elements concept is required for specific design to avoid situations where it is not possible or difficult to implement.
3). Parental relationship. The parent element is composed of sub-elements. On plane display is usually manifested as a parent element completely contains sub-elements (although this relationship will sometimes be broken).
two.
Information Collection
Information acquisition performance is the content you need to collect, which is mapped to which part of the database, and some other acquisition rules, then acquisition the system to collect this information provided by the user. It is important to be ease of use of software systems. Methods to improve ease of use can have multiple, such as TELEPORT or CGROBOT collection rules; CGROBOT's automatic extraction method, and current specified web layout elements and their relationships, etc. In order to form a competitive system, these means should be available.
Now just consider the web page elements and their relationships. At this time, in fact, users need users to tell the acquisition system: What is the need to pass it (or events) and place the element into a part of the database. This involves three steps: 1) User sets the process of collecting the need to experience; 2) What kind of element is set to collect; 3) Users set this element to the database.
The following is a simple example. This example is in fact to collect more convenient to use other methods.
Assuming that we need to collect all documents in Figure 1 in Figure 1, and extract authors, translators, and titles in Figure 2. At the same time, we can only enter the acquisition from http://www.websamba.com/karymay. Then the acquisition process can be defined as:
Navigate to (http://www.Websamba.com/karymay);
Click on the "Translation Work" area;
When the A area is reloaded
{
For each link in the A area
{
Click on this link;
When the B area occurs, the user is required to be defined.
"
The font size is greater than the XX in the B area as a title.
Find text in the B area, the text behind the "translator:" as the translator.
"
}
}
Note that the sub-area is not defined in the region. Of course, it can also be defined as three areas, namely heading, body, and translator. The title area sets the rule to the font greater than how much and its spatial position is at the top; while the translator can define the line of "translator:" string.
figure 1
figure 2
three.
Information reintegration
When the acquired data is placed in the database. We said that this time has basically met the needs of users. But there may be some problems. For example, if the rule definition is not meticulous enough, it is also collected in the collection. At this time, the user needs to manually organize the data. When considering implementing a powerful system, you should also consider how information can be flexibly. But this article will no longer discuss these.
four.
Some rules
1. Design system requires continuous proposing requirements, then modify the definition of the system. This iterate so that the system has powerful functions and is easy to use.
2. The mapping of the website page and user needs only knows itself, the program is not known. The program only needs to provide a user who can tell the program. Effective use of website creators' wisdom and user wisdom far more than the program itself is simple.
3. Excellent design comes from the imitation of reality. Although this article does not discuss data storage and reform, it must be considered when implementing. The complexity of user needs also leads to complexity of data storage and reforming. 4. The acquisition system is a tool that maps the Internet information structure as user needs.
5. Always look forward. We must consider XML.