[转] Web server log statistics analysis complete solution

xiaoxiao2021-03-06  45

Web server log statistics analysis complete solution

From: http://kunming.cyberpolice.cn/aqjs/20004.html

Abstract For all ICPs, in addition to ensuring that the website is stable and normal, an important issue is the statistics and analysis reports of website access, which is for understanding and monitoring the operational status of the website, improving the service capabilities and service levels of various websites. It is essential. By analyzing and statistics on web servers, you can effectively master system operation and access to website content, strengthen maintenance and management of the entire website and its content. This article discusses the principles and techniques of Web server log analysis. (2003-04-03 13:26:53) -------------------------------------- ------------------------------------------ by idealhtp: //www.linuxaid .com.cn / Please note this article has been published in the "Open System World" 2003, the second phase, the copyright belongs to the magazine, please do not reprint, please keep this claim keyword: Web server log statistics Apache Abstract: For all ICPs, in addition to ensuring that the website is stable and normal, an important issue is the statistics and analysis reports of website access, which is for understanding and monitoring the operational status of the website, improving the service capabilities of various websites. Service level is essential. By analyzing and statistics on web servers, you can effectively master system operation and access to website content, strengthen maintenance and management of the entire website and its content. This article discusses the principles and techniques of Web server log analysis. Article Related Software: Webalizer http://www.mrunix.net/webalizer/ cronolog http://www.cronolog.org/ Apache http://www.apache.org/ First, with the development of Web services on the Internet Almost various government departments, companies, colleges, research institutes, etc. are constructing or being building their own websites. At the same time, in the construction of all units in the construction of the website will encounter a variety of problems, the analysis and comprehensive analysis of the operation and access situation of the web server is to understand the website of the website, and the website has the shortcomings of the website. The better development of the website is self-evident. The management web site is not just monitoring the speed of the web and the content transfer of the web. It is not only to pay attention to the daily throughput of the server, but also understand the external access to these web websites, understand the access situation of the website, according to the point of the page Frequency to improve the content and quality of the webpage, improve the readability of the content, track the steps containing business transactions and the management of the "behind-the-scenes" of the Web site. In order to better provide WWW services, monitor the operation of the web server, understand the detailed access status of the website content is more important and urgent. These requirements can be done by statistics and analysis of the log files of the web server. Second, the principle of Web Log Analysis The website server log records various original information such as a web server receiving processing request and runtime error. Through statistics, analysis, synthesis of logs, you can effectively grasp the service status of the server, discover and exclude erroneous reasons, understand the customer access distribution, etc., better strengthen the maintenance and management of the system. The WWW service model is very simple (see Figure 1): 1) Client (browser) and web server establish a TCP connection, after the connection is established, send an access request (such as: get) to the web server, according to the HTTP protocol A series of information including the client's IP address, the type of browser, the request URL, etc. Figure 1 WEB Access Mechanism 2) After receiving the request, return the page content required by the client to the client. If an error occurs, the error code is returned. 3) The server side will access the information and the error message to the log file.

Below is the content of the datagram that the client sends to the web server: get /enngineer/ideal/list.htm http / 1.1accept: image / gif, image / x-xbitmap, image / jpeg, image / pjpeg, Application / VND .ms-powerpoint, application / vnd.ms-excel, application / msword, * / * refere: http://www.linuxaid.com.cn/Engineer/ideal/accept-language: en-cnaccept-encoding: Gzip, DeflateUser-agent: mozilla / 4.0 (compatible; msie 6.0; windows nt 5.0) host: http://www.linuxaid.com.cn/connection: Keep-alive can be seen, in the client's request contains a lot of useful Information, such as: client type, etc. The web server will send the requested Web page to the client. The most commonly used web server has Apache, Netscape Enterprise Server, MS IIS, and so on. The most commonly used web server currently on the Internet is Apache, so we discuss this here, which is discussed in Linux Apache, and other application environments are similar. For Apache, support multiple log file formats, the most common is two modes of Common and Combined, where Combined mode is more than the information of the log of the CommON mode (where is the request comes from, for example from Yahoo ) And user-agent (user client type, such as Mozilla or IE). We discuss the combined type here.

Below is a log of Common type: 218.242.102.121 - - [06 / dec / 2002.121 - - [06 / dec / 2002: 00: 00 0000] "Get / 2/face/shnew/ad/via20020915logo.gif http / 1.1" 304 0 "http : //www.mpsoft.net/ "" Mozilla / 4.0 (Compatible; Msie 6.0; Windows 98) "61.139.226.47 - - [06 / dec / 200226.47 - - [06 / dec / 2002: 00: 00 0000]" GET / CGI-BIN /Guanggaotmp.cgi? "http / 1.3" 200 178 "http://www3.beareyes.com.cn/1/index.php" "Mozilla / 4.0 (compatible; msie 5.0; windows 98; digext)" 218.75.41.11 - - [06 / DEC / 2002: 00: 00: 00 0000] "Get / 2/face/shnew/ad/VIA20020915Logo.gif http / 1.1" 304 0 "http://www.mpsoft.net/" " Mozilla / 4.0 (Compatible; Msie 5.0; Windows 98; DiGext) "61.187.207.104 - - [06 / dec / 2002: 00: 00: 00 0000]" Get / IMAGES/LOGOLUN1.GIF HTTP / 1.1 "304 0" Http://www2.beareyes.com.cn/bbs/b.htm "" Mozilla / 4.0 (Compatible; Msie 6.0; Windows NT 5.1) "211.150.229.228 - - [06 / dec / 2002: 00: 00: 00: 00: 00: 00: 00: 00: 0000] "Get / 2/face/pub/image_top_l.gif http / 1.1" 200 260 "http://www.beareyes.com/2/lib/200201/12/20020112004.htm" "Mozilla / 4.0 (Compatible ; MSIE 5.5; Windows NT 5.0) "You can see the logging of logging records the IP address of the client. , The time of accessing the time, the page of the access request, the web server returns the status information returned to the request, the size of the content (in bytes), the reference address, the customer browser type, etc. information. Third, the configuration and management of Apache log This article We assume that our Apache has two virtual hosts: www.secfocus.com and www.tomorrowtel.com. We need to access log analysis and statistics on both virtual hosts. In the Apache configuration file, we need to care about the configured configuration with two: CustomLog / WWW / logs / access_log commit / error_log customlog is used to indicate the storage location of Apache's access log (saved here / www / Logs / Access_Log) and format (here common); errorlog is used to indicate the storage location of the Apache error message log. For a server that does not configure a virtual host, you only need to find the configuration of CustomLog directly in httpd.conf; and for a web server with multiple virtual servers, you need to separate access logs for each virtual server for Access statistics and analysis of individual virtual servers.

So this needs to be independent log configuration in the virtual server configuration, example: NamevirtualHost 75.8.18.19 ServerName http://www.secfocus.com/serveradmin secfocus@secfocus.comDocumentRoot / www / htdocs / secfocus / customlog "/ www / log / secfocus "combinedAlias ​​/ usage /" / www / log / secfocus / usage / "ServerName http://www.tomorrowtel.com/ServerAdmin tomorrowtel @ tomorrowtel.comDocumentRoot / www / htdocs / tomorrowtelCustomLog" / www / log / tomorrowtel " Combinedalias / usage / "/ www / log / tomorrowtel / usage /" Here you need to note that there is a custom command for each virtual host to specify the storage file of the virtual host to access the log; and the alias command is used to let The report generated by the log analysis can be accessed by www.secfocus.com/usage/. The saving of the log file is completed by the above configuration. But a question that is encountered is that the circulation of the log file is always increasing. If the log file is increasing, it will affect the operation efficiency of the web server; the rate is possible Experience the server hard disk space, causing the server to run normally, and if a single log file is greater than the limitations of the operating system single file size, thereby further affecting the operation of the web service. And if the log file does not turn to the log statistical analysis program, the log statistical analysis is statistically analyzed in the sky, and the log statistics analysis program will run particularly slow. So here you need to turn rounds a day on the web server log file. Fourth, the Web Server Log Run WEB Server Quality Ways have three ways: the first method is to use the Linux system itself 's log file rotation mechanism: logrotate; second method is to use the Apache's own log wheels The program Rotatelogs; the third is to use a relatively mature zero wheel CRONOLOG in the FAQ in Apache. For large web services, it is often used to use practical load balancing technology to improve Web site service capabilities, so that there are multiple servers with multiple servers in the background, which greatly facilitates the distribution planning and scaifery of services, but the distribution of multiple servers The logging of logs is required to make statistical analysis. Therefore, in order to ensure the accuracy of the statistics, it is necessary to generate a log file in strict accordance with the daily time period. 4.1 Logrotate Implementation Quality Round We discussed the use of Linux system's own log file routing mechanism: Logrotate's method. Logrotate is a logless routine for Linux system self-contained, which is a program that is trustworthy in various system logs (Syslogd, Mail). The program is run by the service crone of the running program. You can see the logrotate file in the /etc/cron.daily directory, which is as follows: #! / Bin / sh / usr / sbin / logrotate /etc/logrotate.conf can see that the LOGROTATE script under the /etc/cron.daily directory will be launched in the morning crones.

In /etc/logrorate.conf can be seen as follows: # see "man logrotate" for details # rotate log files weeklyweekly # keep 4 weeks worth of backlogsrotate 4 # create new (empty) log files after rotating old onescreate # uncomment this if you want your log files compressed # compress # RPM packages drop log rotation information into this directoryinclude /etc/logrotate.d# no packages own wtmp - we'll rotate them here / var / log / wtmp {monthlycreate 0664 root utmprotate 1 } # system-specific logs may be also be configured here. From logrotate's configuration file, you can see the configuration of the log of scrolling in the /etc/logroate.d directory in the /etc/logroate.D directory in the /etc/logroate.d directory. So we only need to create a configuration file called Apache in this directory to indicate how Logrotate is rounding up the web server log file. Here is an example: / www / log / secfocus {rotate 2 dailymissingoksharedScriptspostrotate / usr / bin / killall -HUP httpd 2> / dev / null || trueendscript} / www / log / tomorrowtel {rotate 2 dailymissingoksharedscriptspostrotate / usr / bin / killall -HUP httpd 2> / dev / null || trueendscript} where "rotate 2" represents Tripping only two backup files, that is, only: Access_log, Access_log.1, Access_log.2 three log backup files. In this way, the round of the log file for the two virtual hosts is realized. Later we will discuss how to use log statistics software to process log files. The advantage of this method is that there is no need for other third-party tools to implement the log truncation, but this method is not very practical for the heavy load server and the Web server using load balancing technology. Because it is a -HUP restart command for the corresponding service process to implement the truncation archive of the log, this will affect the continuity of the service. 4.2 Rotatelogs with Apache to implement the log wheels Apache provides the ability to write the log directly to the file, but through the power of the pipe to another program, this greatly enhances the ability to process the log. This program obtained through the pipe can be any program: such as log analysis, compression log, etc. To implement the log write to the pipe, you only need to replace the contents of the configured log file section to "| program name", for example: # compressed logscustomlog "| / usr / bin / gzip -c >> / var / log / access_log .gz "Common This can use the Rotary Tool of Apache: Rotatelogs to rotate the log file. Rotatelogs is basically used to control logs by time or by size.

Customlog "| / www / bin / rotationelogs / www / logs / secfocus / access_log 86400" The Apache Access Log is sent to the program Rotatelogs, Rotatelogs write the log / www / logs / secfocus / access_log, and every other 86400 seconds (one day) once a round of circulation. The files after the round are named /ww/logs/secfocus/access_log.nnnn, where nnn is the time that starts the log log. So in order to launch the service in the morning to 00:00 in the morning, the log is just a complete day of the log to provide access to the access statistical analysis program. If you start to generate a new log 00:00, then the loop's loop is Access_log.0000. 4.3 Using the cronolog to implement the log Roel to first need to download and install the cronolog, you can download the latest version of Cronolog. Http://www.cronolog.org. After the download is complete, the installation can be installed, the method is as follows: [Root @mail root] # tar xvfz cronolog-1.6.2.tar.gz [root @mail root] # cd cronolog-1.6.2 [root @ mail cronolog -1.6.2] # ./configure[Root@mail cronolog-1.6.2] # make [root @mail cronolog-1.6.2] # make check [root @ mail cronolog-1.6.2] # make install This is done The configuration and installation of cronolog, default cronolog is installed under / usr / local / sbin. Modify the apache log configuration command as follows: CustomLog "| / usr / local / sbin / cronolog / www / logs / secfocus /% w / access_log" Combined here% W indicates that the log is saved in different directories according to the date of the date. The way will save a week of log. To make log analysis, you need to copy (or move, if you do not want to save a week) to a fixed location to facilitate log analysis statistics to process, using crontab -e, add timing tasks: 5 0 * * / bin / mv / www / logs / secfocus / `Date -v-1D /% W` / Access_log / www / logs / secfocus / access_log_yesterday This will use the log statistian program to process file access_log_yesterday. For large sites using load balancing techniques, there is a merger processing problem of access logs for multiple servers. For this case, you cannot use Access_log_yesterDay when you define or move the log file, you should bring the server number, for example Information such as server IP address is distinguished. Then run the website image and backup service on various servers rsyncd (refer to article "to implement website mirroring and backup", ttp://www.linuxaid.com.cn/Engineer/ideal/article/rsync.htm, then Each server has a merger to download a server that is dedicated to access statistical analysis through RSYNC.

Merge multiple servers log files, such as: log1 log2 log3 and output to log_all: sort -m -t "" -k 4 -o log_all log1 log2 log3 -m: Using MERGE optimization algorithm, -k 4 means Sort by time, -O means that the sort result is stored in the specified file. 5. Log Statistics Analysis Program Webalizer installation and configuration Webalizer is an efficient, free web server log analyzer. The results of its analysis are HTML file formats, so that it can be easily browsing through the web server. Many sites on the Internet use Webalizer to analyze the web server log analysis. Webalizer has some features: a program written with C, so it has high operating efficiency. On a machine with a 200 MHz, Webalizer can analyze 10,000 records per second, so analyze a 40M size log file only take 15 seconds. Webalizer supports standard general log file formats; in addition, several combined logfile formats are also supported, allowing you to count customer situations and customer operating system types. And now Webalizer has already supported the WU-ftpd xferlog log format and the Squid log file format. Support command line configurations and profiles. You can support multiple languages ​​or localization yourself. Supports a variety of platforms, such as UNIX, Linux, NT, OS / 2, and MacOS, etc. The above figure shows the content of the first page of Webalizer generated access statistical analysis report, which contains a table and bar graph statistical analysis of the average access amount per month. Click on each month to get detailed statistics per day this month. 5.1 Installing before installation The first need to ensure that the system has a GD library installed, you can use: [root @ mail root] # rpm -qa | grep gdgd-devel-1.8.4-4GDBM-Devel-1.8.0-14GDBM- 1.8.0-14SyskLogD-1.4.1-8GD-1.8.4-4 to confirm that the system has two RPM packages that have GD-DEVE and GDs are installed. There are two ways to install Webalizer, one is to download source code to install, one is to install directly using the RPM package. Using the RPM package installation is very simple, find the Webalizer package from rpmfind.net, after downloading: rpm -ivh webalizer-2.01_10-1.i386.rpm can be installed. For the source code, you need to download from http://www.mrunix.net/webalizer/, then install, first untie the source code package: TAR XVZF Webalizer-2.01-10-src.tgz has a lang directory in the generated directory There are all language files in this directory, but only traditional Chinese versions can be converted into simplifications, or they re-translate them. Then enter the generated directory: CD webalizer-2.01-10./configuremake --with-language = chinesemake install After the compilation is successful, install a Webalizer executable in / usr / local / bin / directory. 5.2 Configuring and Running Controls for Webalizer runs by profile or in two ways of command line specifying parameters. The use of configuration file is relatively simple and flexible, suitable for an application environment for automatic web server log statistical analysis.

Webalizer's default profile is /etc/webalizer.conf, when not using the "-f" option when starting Webalizer, Webalizer will look for files /etc/webalizer.conf, or use "-f" to specify the configuration file ( When the server has a virtual host, you need to configure multiple different Webalizer configuration files. The Webalizer of different virtual hosts uses different configuration files. The configuration options you need to modify in theWebalizer.conf configuration file are as follows: logfile / www / logs / secfocus / Access_log is used to indicate the path information of the configuration file. Webalizer uses the log file as an input; OutputDir / WWW / HTDOCS / SecFOCUS / usage is used to indicate the save directory of the generated statistical report, in front of us using Alias, make Users can use http://www.secfocus.com/usage/ to access the statistical report. Hostname http://www.secfocus.com/ Used to indicate the host name, the host name will be referenced in the statistics. Other options do not need After modification, after the configuration file is modified, it is necessary to generate statistical analysis of the day Webalizer. Run with root: crontab -e enters timed operation task editing status, add the following tasks: 5 0 * * / usr / local / BIN / Webalizer -f /etc/secfocus.webalizer.conf15 0 * * * / usr / local / bin / Webalizer -f /etc/tomorrowtel.Webalizer.conf We assume that the system has two virtual hosts and defined separately. Log Analysis Profile SECFOCUS.WEBALIZER.CONF and TOMORROWTEL.WEBALIZER.CONF. This way we define statistical analysis in the morning 00:05 on Secfocus; at 00:15, Tomorrowtel's logs, then the second day, Use http://www.secfocus.com/usage/ and http://www.tomorrowtel.com/USAGE to find some log analysis reports. Sixth, the protection log statistics analysis report is not unauthorized users to access us, no I hope that my website will access the statistics will be browsed by others, so I need to protect the usage directory, only allowed French user access. Here you can use the basic authentication mechanism of Apache, which will be connected to this address will need to provide a password to access this page: 1. Conditions in the configuration file "/" should be set to: DocumentRoot / WWW / HTDOCS / Secfocus / AccessFileName .htaccessallowoverride All 2, demand requirements: Restrictions to http://www.secfocus.com/usage/ access, requiring user authentication to access. Here, the user is "admin" and the password is "12345678". 3. Establish user files using Htpasswd htpasswd -c /www/.htpasswd admin This program will ask the user "admin" password, you enter "12345678" and take effect twice.

转载请注明原文地址:https://www.9cbs.com/read-79909.html

New Post(0)