The main steps of generating HTML methods are only two:
First, obtain the contents of the HTML file to be generated, save the content of the acquired HTML file as an HTML file.
I mainly explained here is the first step: How to get the content of the HTML file to be generated:
There are several ways to obtain the content of the HTML files, there are several ways:
1, this method is to write the HTML content to be generated in the script, less convenient to preview the content of the generated page, can be visualize the layout page, which is more complicated when changing the HTML template. There are a lot of people in this way, but I feel that this method is the most inconvenient.
Str = " Content HTML Tag>" Str = Str & " Content HTML Tag> Database Read Content ..... html tag> ..... "
2. Make a separate HTML template page, dynamic content as a tag with a specific character (eg, someone uses $ TITLE $ tagged as a web title), use adoDb.stream or scripting.FileSystemObject to load it, then use The replacement method replaces the original tag to dynamic content (eg Replace (loaded template content, "$ TITLE $", RS ("Title"))).
3. Get the HTML content displayed by the dynamic page with XMLHTTP or ServerXMLHTTP:
I often use an instance of generating an HTML file: 'Weburl is the dynamic page address to get' getHttppage (WebURL) is a function WebURL = "http:////////////////////////" .asp? id = "& RS (" ID ") &" "Specifies Dynamic Page Address Body = Gethttppage (WebURL) 'Use functions to get the content of dynamic page address
The biggest advantage of this method is that there is no need to write a static template page, just convert the dynamic page to an HTML static page, but the generator is not too fast. The method I often use HTML is the third type: use XMLHTTP to get the HTML content generated by the dynamic page, and then save it as an HTML file with adoDb.stream or scripting.FileSystemObject. The second step is to generate a file:
There are commonly used adodb.stream generated files and scripting.FileSystemObject generated files in ASP:
1. Scripting.FileSystemObject Generate file method: set fso = creteObject ("scripting.filesystemObject") file = server.mappath ("To generate file path and file name .htm") set txt = fso.opentextfile (File, 8, true ) DATA1 = "File content" uses WriteLine method to generate file txt.writeLine data1Data2 = "file content" "uses Write method to generate file txt.write data2txt.closetxt.fso
2, AdoDb.Stream generates file method:
Dim objAdoStreamset objAdoStream = Server.createObject ( "ADODB.Stream") objAdoStream.Type = 1objAdoStream.Open () objAdoStream.Write ( "file content") objAdoStream.SaveToFile To generate the file path and filename .htm, 2objAdoStream.Close () The principle of collecting:
The main steps of the acquisition process are as follows:
First, obtain the contents of the collected page, extract all data from the acquisition code
First, get the content of the collected page
I am currently mastering the content method of acquiring the acquired page:
1. Get data with ServerXMLHTTP components
Function GetBody (weburl) 'Create Object Dim ObjXMLHTTPSet ObjXMLHTTP = Server.CreateObject ( "MSXML2.serverXMLHTTP")' request file, in the form of an asynchronous ObjXMLHTTP.Open "GET", weburl, FalseObjXMLHTTP.sendWhile ObjXMLHTTP.readyState <> 4ObjXMLHTTP.waitForResponse 1000Wend 'Get the result getBody = Objxmlhttp.responsebody' release object set objxmlhttp = nothingend function
Call method: getBody (URLF address of the file)
2, or XMLHTTP components to get data
Function GetBody (weburl) 'create objects Set Retrieval = CreateObject ( "Microsoft.XMLHTTP") With Retrieval .Open "Get", weburl, False, "", "" .Send GetBody = .ResponseBodyEnd With' release objects Set Retrieval = Nothing END FUNCTION
Call method: getBody (URLF address of the file)
This acquisition data content needs to be encoded to be used.
Function BytesToBstr (body, Cset) dim objstreamset objstream = Server.CreateObject ( "adodb.stream") objstream.Type = 1objstream.Mode = 3objstream.Openobjstream.Write bodyobjstream.Position = 0objstream.Type = 2objstream.Charset = CsetBytesToBstr = objstream. Readtext objstream.closset objstream = Nothingend Function
Calling method: Bytestobstr (data to be converted, encoding) 'encoding is common to GB2312 and UTF-8.
Second, extract all use data from the acquisition code
1. Use the ASP built-in MID function to intercept the required data
Function Body (WSTR, START, OVER) Start = Newstring (WSTR, START) Set the unique start mark Over = NewString (WSTR, OVER) 'and START, which need to be processed, is the only thing that needs to be processed. End Tag Body = MID (WSTR, START, OVER-START) Set the range of the display page end function
Call method: Body (content, start tag, end tag) 2, get required data by regular
Function Body (WSTR, START, OVER) set xiaoqi = new regexp 'Settings configuration object xiaoqi.Ignorecase = true' ignore the case xiaoqi.global = true 'set to full-text search XIAOQI.PATTERN = "& Start &". ? "& Over & START &" "" 'Regular expression set matches = xiaoqi.execute (wstr) start execution configuration set xiaoqi = Nothing body = "" for each match in matchesbody = body & match.value "loop match Next'end Function
Calling method: body (content, start tag, end tag)
Collection procedure auspicious ideas:
1. Get the page list of the page list of the website, the patch address of most dynamic websites, such as: Dynamic Page: Index.asp? Page = 1 Page II: Index.asp? Page = 2 Page 3: Index.asp? Page = 3 .....
Static page 1: Page_1.htm Page 2: Page_2.htm Page 3: Page_3.htm .....
Take each page address of the page list page of the website, you only need to replace the change in the change of each page address, such as: page _ <% = "& page &"%>. Htm
2. Get the contents of the page list page for the acquisition site
3. Extract from the paging list code [color = blue] URL connection address of the acquired content page [/ color] There are also fixed rules, such as: Connection 1
connection 2
connection 3
Use the following code to get a URL connection collection set xiaoqi = new regexpxiaoqi.ignorecase = truexiaoqi.global = truexiaoqi.pattern = "" "" "" "" "" "" "" "" "" " URL = "" for Each Match in matchesurl = url & match.value next
4. Acquisitioned content page content, according to "Extraction Mark", from the collected content page, the data to be acquired.
Because it is a dynamically generated page, there are the same HTML tags in most content pages, we can extract the contents of the required parts based on these rule tags. Such as:
Each page has a web title
There are many ways to prevent collected methods, first introduce common anti-mining strategy methods and its drawbacks and collection strategies: 1. Judging an IP number of access to this site for a certain period of time, if it is obviously more than normal people's browsing speed Reject this IP access
Disadvantages: 1, this method is only available for dynamic pages, such as: ASP / JSP / PHP, etc. ... Static page cannot determine a number of IPs to access this site page for a certain time. 2, this method will seriously affect the search engine spider to include it, because the search engine spider is included, the browsing speed will be faster and multi-threaded. This method will also reject the search engine spider to collect file collection countermeasures: only slow collection speed, or not advice: Be a search engine spider IP library, only allow search engine spider to quickly browse the station. The collection of IP libraries in search engine spiders is not easy, a search engine spider, nor does it have only one fixed IP address. Comment: This method is effective to prevent the acquisition, but it will affect the search engine.
Second, use the JavaScript encrypted content page
Disadvantages: This method is suitable for static pages, but will seriously affect the search engine to its inclusion, the content received by the search engine is also encrypted content collection countermeasures: It is recommended that you don't take it, if you want, The JS script of the password is also taken. Recommendation: There is no good improvement recommendation Comment: It is recommended that the webmare that the search engine belt traffic should not use this method.
Third, replace the specific tag in the content page to [Color = Red] "Specific Tag Hide Copyright Writing" [/ Color]
Disadvantages: This method is not bad, just add a little page file size, but it is easy to retrore the collection countermeasures: replace the collected copyright text, or replace itself with its own copyright. Suggestion: There is currently no good improvement suggestion comment: I feel that the practical value is not big, even if it is a random hidden text, it is equal to the snake to add.
Fourth, only allow users to log in to browse
Disadvantages: This method will seriously affect the search engine spider to collect the collection. : There is currently no good improvement suggestion comment: It is recommended that the webmare that the search engine belt traffic should not use this method. However, this method prevents a general acquisition process or a bit effect.
V. Use JavaScript, VBScript script to make paging
Disadvantages: Impact search engines to collect the collected countermeasures: Analyze JavaScript, VBScript scripts, find out their paging rules, doing yourself for the paging collection page for this site. Suggestion: There is currently no good improvement suggestion comment: people who feel that I understand the scripting language can find their paging rules.
6. Only allow views to view this site, such as: Request.ServerVariables ("http_referer")
Disadvantages: Impact search engines have collected the collection countermeasures: I don't know if I can simulate web source. . . . At present, I have no advice on this method: there is no good improvement suggestion comment: It is recommended to count the webmaster with the search engine with traffic. Do not use this method. However, this method prevents a general acquisition process or a bit effect.
As can be seen from the above, it is currently a commonly used method, or it will have a big impact on the search engine record. Either the anti-collection effect is not good, and the effect of preventing the acquisition is not available. So, there is no effective anti-collection, but does not affect the search engine's approach? Then please continue to see it!
From the front of the acquisition principle, you can see that most of the acquisition procedures are collected by analysis rules, such as analyzing paging file name rules, analyzing page code rules. First, paging file name rules anti-acquisition countermeasures
Most collectors are analyzed by analytical paging file name rules for bulk, multi-page acquisition. If others can't find the file name rules for your paging file, then others cannot take a batch multi-page collection for your website.
Implementation:
I think the name of the MD5 encrypted paging file is a better way. Some people will say that you can use the MD5 encrypted paging file name, others can simulate your encryption rules according to this rule to get your paging file name.
I want to point out that when we encrypt the page name, don't only encrypt the file name change.
If i represents the page number of the paging, then we don't encrypt this: Page_name = MD5 (I, 16) & ". Htm"
It is best to follow one or more characters on the page you want to encrypt, such as: Page_name = MD5 (I & "Any one or several letters", 16) & ". Htm"
Because MD5 is unveiled, someone seeing the page letter is the result after MD5 encryption, so the additive can't know what you followed back, unless he uses violence **** MD5, but Not very realistic.
Second, the page code rules anti-collecting countermeasures
If there is no code rule of our content page, then others cannot extract one of the items they need from your code. So this step we want to make a deflection, so it is necessary to make the code.
Implementation:
Randomize the other party needs to be extracted
1, customize multiple web page templates, different important HTML tags in each web template, when presenting page content, randomly select the web page template, some pages use the CSS DIV layout, some pages use Table layout, this method is troublesome Point, one content page, do more template pages, but the anti-collecting itself is a very cumbersome thing, do more templates, can play the role of preventing collection, it is worth it. 2, if the above method is too trouble, randomize the important HTML mark in the webpage, or can. The more the web template is made, the more html code is randomized, the more troublesome when the other party analyzes the content code, the more troublesome, the other party is more difficult to write the collection strategy, at this time, most people, I will refund, because this person is because of lazy, I will collect someone else's website data ~~~ Let's talk about it, most people are collecting data to collect data with the acquisition procedures developed by others, and develop acquisition procedures to collect data. After all, it is a small number.
There are still some simple ideas to provide you:
1. Display 2 of the content of the data collector, and the content of the search engine is not important to display 2, divide one page data, divided into n page display, and how to increase the difficulty of collecting the difficulty of collecting , Because most collecting programs can only be acquired the top 3 layers of the website content, if the connection layer in which the content is located, it can also be avoided. However, this may cause inconvenience to customers. Such as:
Most websites are home ---- Content cable points --- Content page If it is changed: Home ---- Content indefinite page ---- Content page entry ---- Content Page Note: Content page entry is best Can add the code that automatically transferring the content page
转载请注明原文地址:https://www.9cbs.com/read-131953.html