Strengthen the processing of HEAD request
Recently, some search engine reptiles were found when grabbed data, first got a HEAD request to get Response's HEADER information, then request the Response's body information (ie, the content of the page) - first send the HEAD request to get the page. Update time (Last-Modified Domain in Response Header), is used to determine whether it is updated since the last page is incorporated by the rendence library, if the judgment page is not updated, ignore the page, otherwise it will be used again. The GET method gets the latest content and updates to the index library.
In the case where the page update frequency is relatively low or the cache setting is longer, this can avoid the BODY domain that is relatively large on the network, reducing network consumption, and can also shorten the index update time. However, the effect is the opposite when the page update frequency is relatively high, or the page cache time is relatively short:
If the captured page is in the cache, the case will be slightly better, the cache server (such as an expiRES_Module installed) returns the HEADER field of the cached response after receiving the HEAD request, in the next GET When the request is requested, the entire response (including the Header and Body fields) after the cache are returned to the reptile;
If the captured page is not in the cache, the program is lacking in the program, and the processing method for the HEAD request is lacking, then the page is generated twice - when processing the HEAD request, because there is no special method, thus general The method for processing the GET request will be executed, and the complete response generated after the program is executed, the cache server receives the response, but it will only return its Header information to the reptile, and the response will not be cached; When processed the next GET request, because there is no cache, the program is still regenerated into a complete response, and is transferred by the cache server to the crawler, the cache server will cook. This program is executed twice, and the first execution is a waste.
One way to solve the problem is to add processing on the HEAD request in the program. When processing the HEAD request, you can usually set the content-type and content-length in response header, such as: in servlets can be implemented by overloading DOHEAD (HTTPSERVLETREQUEST REQUEST, HTTPSERVLETRESPONSE RESPONSE):
Public void dohead (httpservletRequest Req, httpservletResponse resp) throws oews {
// set the content length and type
Resp.setContentType ("text / html; charSet = GB2312");
Resp.setContentLength (30000);
}
In JSP, you can follow the way below:
<%
/ * Handle The Head Request * /
IF (Request.getMethod (). Equals ("Head")) {
Response.SetDateHeader ("Last-Modified", System.currentTimeMillis ()); / * Sets Last-Modified * /
Response.setContentType ("text / html; charset = GB2312"); / * Sets Content-type * / response.setContentLength (30000); / * Setting Content-length * /
Return;
}
%>
Below is a piece of LOG, which shows the access log of an IP 202.108.1.4, a user / reptile / proxy server (strange useERAGENT item),: 202.108.1.4 - - [06 / Mar / 2005: 11: 21: 03 0800] "HEAD / 2001-03-07/28456.HTM HTTP / 1.1" 200 0 "-" "User-Agent: mozilla / 4.0 (compatible; msie 5.5; windows NT 5.0)" 202.108.1.4 - - [ 06 / MAR / 2005: 11: 21: 03 0800] "GET / 2001-03-07/28456.htm http / 1.1" 200 32182 "-" "User-agent: mozilla / 4.0 (compatible; msie 5.5; windows NT 5.0) "202.108.1.4 - - [06 / Mar / 2005: 11: 21: 09 0800]" Head / 2003-06-26/169417.htm http / 1.1 "200 0" "" "" "" "" "" "" Mozilla / 4.0 (Compatible; Msie 5.5; Windows NT 5.0) "202.108.1.4 - - [06 / MAR / 2005: 11: 21: 09 0800]" GET / 2003-06-26/169417.htm http / 1.1 " "-" "" "" "" "" "" "" "" " -5 / 361944.htm http / 1.1 "200 0" - "" "User-agent: mozilla / 4.0 (compatible; msie 5.5; Windows NT 5.0)" 202.108.1.4 - - [06 / mar / 2005: 11: 21: 11 0800] "GET / 2005-1-5/361944.HTM HTTP / 1.1" 200 36761 "-" "User-Agent: Mozilla / 4.0 (Compatible; Msie 5. WINDOWS NT 5.0) "
In addition, there is currently only less old-fashioned search engine reptiles in this way, such as the reptile of AOL, and most of the search engine reptiles are using another way: iF-Modified in the header requested in GET. -Sine item, the server determines if the page is updated.
See: