I. Introduction
With the development of Web services on the Internet, almost various government departments, companies, colleges, research institutes, etc. are constructing or being building their own websites. At the same time, in the construction of all units in the construction of the website will encounter a variety of problems, the analysis and comprehensive analysis of the operation and access situation of the web server is to understand the website of the website, and the website has the shortcomings of the website. The better development of the website is self-evident.
The management web site is not just monitoring the speed of the web and the content transfer of the web. It is not only to pay attention to the daily throughput of the server, but also understand the external access to these web websites, understand the access situation of the website, according to the point of the page Frequency to improve the content and quality of the webpage, improve the readability of the content, track the steps containing business transactions and the management of the "behind-the-scenes" of the Web site.
In order to better provide WWW services, monitor the operation of the web server, understand the detailed access status of the website content is more important and urgent. These requirements can be done by statistics and analysis of the log files of the web server.
Second, the principle of Web log analysis
The website server log records various original information such as a web server receiving processing request and runtime error. Through statistics, analysis, synthesis of logs, you can effectively grasp the service status of the server, discover and exclude erroneous reasons, understand the customer access distribution, etc., better strengthen the maintenance and management of the system.
It is very simple in the WWW service model (see Figure 1):
1) Client (browser) and web server establish TCP connection, after the connection is established, send access requests (such as: get) to the web server, which contains the client's IP address, the type, request of the browser according to the HTTP protocol. A series of information such as URLs.
Figure 1 Web Access Mechanism
2) After receiving the request, the web server returns the page content required by the client to the client. If an error occurs, the error code is returned.
3) The server side will access the information and the error message to the log file. Below is the content of the datagram that the client sends to the web server:
Get /Engineer/ideal/list.htm http / 1.1
Accept: image / gif, image / x-xbitmap, image / jpeg, image / pjpeg, application / vnd.ms-powerpoint, application / vnd.ms-excel, application / msword, * / *
Referr:
http://www.linuxaid.com.cn/ENGINEER/ideal/
Accept-language: zh-cn
Accept-encoding: Gzip, deflate
User-agent: mozilla / 4.0 (compatible; msie 6.0; windows NT 5.0)
Host:
Www.linuxaid.com.cn
Connection: Keep-alive
It can be seen that a lot of useful information is included in the client's request, such as the client type, and more. The web server will send the requested Web page to the client.
The most commonly used web server has Apache, Netscape Enterprise Server, MS IIS, and so on. The most commonly used web server currently on the Internet is Apache, so we discuss this here, which is discussed in Linux Apache, and other application environments are similar. For Apache, support multiple log file formats, the most common is two modes of Common and Combined, where Combined mode is more than the information of the log of the CommON mode (where is the request comes from, for example from Yahoo ) And user-agent ㄓ ㄓ э ф 死 死 鏼 鏼 鏼 鏼 鏼 鏼 鏼 鏼 鏼 ф 死 死 ㄓ 死 死 死 死We discuss the combined type here. Below is a log of Common type: 218.242.102.121 - - [06 / DEC / 2002.121 - - [06 / DEC / 2002: 00: 00 0000] "Get / 2/face/shnew/ad/via20020915logo.gif http / 1.1" 304 0 "
http://www.mpsoft.net/ "" Mozilla / 4.0 (compatible; msie 6.0; windows 98) "
61.139.226.47 - - [06 / DEC / 2002: 00: 00: 00 0000] "GET /CGI-BIN/GUANGGAOTMP.CGI? HTTP / 1.1" 200 178 "
http://www3.beareyes.com.cn/1/index.php "" Mozilla / 4.0 (Compatible; Msie 5.0; Windows 98; DiGext)
218.75.41.11 - - [06 / DEC / 2002: 00: 00: 00 0000] "Get / 2/face/shnew/ad/via20020915logo.gif http / 1.1" 304 0 "
http://www.mpsoft.net/ "" Mozilla / 4.0 (Compatible; Msie 5.0; Windows 98; DiGext)
61.187.207.104 - - [06 / DEC / 2002: 00: 00: 00 0000] "Get / IMages/logolun1.gif http / 1.1" 304 0 "
Http://www2.beareyes.com.cn/bbs/b.htm "" Mozilla / 4.0 (Compatible; Msie 6.0; Windows NT 5.1) "
211.150.229.228 - - [06 / DEC / 2002: 00: 00: 00 0000] "GET / 2/FACE/Pub/Image_top_l.gif http / 1.1" 200 260 "
Http://www.beareys.com/2/lib/200201/12/20020112004.htm "" Mozilla / 4.0 (Compatible; Msie 5.5; Windows NT 5.0) "
From the above log file, you can see that the logging records the client's IP address, accessing the time, the access request page, the web server returns the status information returned by the request, return to the content of the content (in bytes) ), The requested reference address, customer browser type, etc. information.
Third, the configuration and management of Apache logs
In this article we assume that our Apache has two virtual hosts:
Www.secfocus.com and
Www.tomorrowtel.com. We need to access log analysis and statistics on both virtual hosts.
In the Apache configuration file, we need to care about and log related configurations: CustomLog / WWW / logs / access_log Common
ErrorLog / WWW / LOGS / ERROR_LOG
Customlog is used to indicate the storage location of Apache's access log (saved in / www / logs / access_log) and format (here as common); errorlog is used to indicate the storage location of the Apache error message log.
For a server that does not configure a virtual host, you only need to find the configuration of CustomLog directly in httpd.conf; and for a web server with multiple virtual servers, you need to separate access logs for each virtual server for Access statistics and analysis of individual virtual servers. So this requires a stand-alone log configuration in a virtual server configuration, examples:
NamevirtualHost 75.8.18.19
Servername
Www.secfocus.com
ServerAdmin secfocus@secfocus.com
DocumentRoot / WWW / HTDOCS / Secfocus /
Customlog "/ www / log / secfocus" Combined
Alias / usage / "/ www / log / secfocus / usage /"
Servername
Www.tomorrowtel.com
ServerAdmin Tomorrowtel @ Tomorrowtel.com
Documentroot / WWW / HTDOCS / TOMOURROWTEL
Customlog "/ www / log / tomorrowtel" Combined
Alias / usage / "/ www / log / tomorrowtel / usage /"
It should be noted here that the definition of each virtual host has a CustomLog command to specify the storage file of the virtual host to access the log; and the alias command is used to generate the report to be generated by the log analysis
Www.secfocus.com/USAGE/ is visited. The saving of the log file is completed by the above configuration.
But a question that is encountered is that the circulation of the log file is always increasing. If the log file is increasing, it will affect the operation efficiency of the web server; the rate is possible Experience the server hard disk space, causing the server to run normally, and if a single log file is greater than the limitations of the operating system single file size, thereby further affecting the operation of the web service. And if the log file does not turn to the log statistical analysis program, the log statistical analysis is statistically analyzed in the sky, and the log statistics analysis program will run particularly slow. So here you need to turn rounds a day on the web server log file.
Fourth, the web server log
There are three ways to follow the web server log wheels: the first method is to use the Linux system itself 'own log file rotation mechanism: logrotate; the second method is to use the Apache's own log rotatelogs; the third is Using the recommended development of a mature log wheel in the Apache's FAQ. Cronolog.
For large web services, it is often used to use practical load balancing technology to improve Web site service capabilities, so that there are multiple servers with multiple servers in the background, which greatly facilitates the distribution planning and scaifery of services, but the distribution of multiple servers The logging of logs is required to make statistical analysis. Therefore, in order to ensure the accuracy of the statistics, it is necessary to generate a log file in strict accordance with the daily time period.
4.1 Logrotate Realization Quality Run
First we discuss the use of Linux system's own log file routine: logrotate method. Logrotate is a logless routine for Linux system self-contained, which is a program that is trustworthy in various system logs (Syslogd, Mail). The program is run by the service crone of running programs to 4:02 every morning, you can see the logrotate file in the /etc/cron.daily directory, the content is as follows: #! / Bin / sh
/ usr / sbin / logrotate /etc/logrotate.conf
It can be seen that every morning crond will launch the logrotate script under the /etc/cron.daily directory to perform the log wheels.
And you can see the content in /etc/logrorate.conf:
# See "man logrotate" for details
# Rotate log files weekly
WEEKLY
# Keep 4 Weeks Worth of Backlogs
Rotate 4
# CREATE New (Empty) Log Files After Rotating Old Ones
Create
# UNComment this if you want your log file worth compressed
#compress
# RPM Packages Drop Log Rotation Information Into this Directory
INCLUDE / Etc/logrotate.d
# no packages OWN WTMP - We'll Rotate Them Here
/ var / log / wtmp {
Monthly
Create 0664 root uTMP
Rotate 1
}
# SYSTEM-Specific Logs May Be Also Be Configured Here.
From logrotate's profile, you can see that the configuration of the logs that need to scroll through the WTMP is saved in the /etc/logroate.d directory. So we only need to create a configuration file called Apache in this directory to indicate how Logrotate is rounding up the log file of the Web server. Here is an example:
/ www / log / secondfocus {
Rotate 2
Daily
Missingok
Sharedscripts
Postrotate
/ usr / bin / killall -hup httpd 2> / dev / null || TRUE
Endscript
}
/ www / log / tomorrowtel {
Rotate 2
Daily
Missingok
Sharedscripts
Postrotate
/ usr / bin / killall -hup httpd 2> / dev / null || TRUE
Endscript
}
Here "Rotate 2" indicates that only two backup files are only included in the rotation, that is, only: Access_log, Access_log.1, Access_log.2 three log backup files. In this way, the round of the log file for the two virtual hosts is realized. Later we will discuss how to use log statistics software to process log files.
The advantage of this method is that there is no need for other third-party tools to implement the log truncation, but this method is not very practical for the heavy load server and the Web server using load balancing technology. Because it is a -HUP restart command for the corresponding service process to implement the truncation archive of the log, this will affect the continuity of the service.
4.2 Rotatelogs with Apache implementation log wheels
Apache provides the ability to write directly to the file, but through the pipeline to another program, this greatly enhances the ability to process the log, this program obtained through the pipe can be any program: such as a log Analysis, compression log, etc. To implement the log write to the pipe, you only need to replace the contents of the configured log file section to "program name", for example: # compressed logs
Customlog "| / usr / bin / gzip -c >> /VAR/LOG/Access_log.gz" Common
This allows the rotation tools that comes with Apache: Rotatelogs can rotate the log files. Rotatelogs is basically used to control logs by time or by size.
Customlog "| / www / bin / rotucus / www / logs / secfocus / access_log 86400" CommON
The Apache Access Log is sent to the program Rotatelogs, and Rotatelogs will write the logs / www / logs / secfocus / access_log, and the log is rounded every 86400 seconds (one day). The files after the round are named /ww/logs/secfocus/access_log.nnnn, where nnn is the time that starts the log log. So in order to launch the service in the morning to 00:00 in the morning, the log is just a complete day of the log to provide access to the access statistical analysis program. If you start to generate a new log 00:00, then the loop's loop is Access_log.0000.
4.3 Using Cronolog to implement the log wheels
First you need to download and install cronolog, you can
Http://www.cronolog.org Download the latest version of Cronolog. After the download is complete, the installation is installed, the method is as follows:
[root @mail root] # tar xvfz cronolog-1.6.2.tar.gz
[root @mail root] # cd cronolog-1.6.2
[root @mail cronolog-1.6.2] # ./configure
[root @mail cronolog-1.6.2] # make
[root @mail cronolog-1.6.2] # make check
[root @mail cronolog-1.6.2] # make install
This completes the configuration and installation of cronolog. By default, cronolog is installed under / usr / local / sbin.
Modify the Apache Log Configuration The command is as follows:
Customlog "| / usr / local / sbin / cronolog / www / logs / secfocus /% w / access_log" combined
Here% w indicates that the log is saved in different directories according to the date of the date, which will save a week. To make log analysis, you need to copy (or move, if you do not want to save a week log) to a fixed location to facilitate the log analysis statistics for processing, using CRONTAB -E, add timing tasks:
5 0 * * * / bin / mv / www / logs / secfocus / `Date -V-1D /% W` / Access_log / www / logs / secfocus / access_log_yesterday
This will use the log statistian analysis program to process the file Access_log_yesterday.
For large sites using load balancing techniques, there is a merger processing problem of access logs for multiple servers. For this case, you cannot use Access_log_yesterDay when you define or move the log file, you should bring the server number, for example Information such as server IP address is distinguished. Then run the website image and backup service on various servers rsyncd (refer to article "to implement website mirroring and backup", ttp://www.linuxaid.com.cn/Engineer/ideal/article/rsync.htm, then Each server has a merger to download a server that is dedicated to access statistical analysis through RSYNC. Multi-server log files, such as: log1 log2 log3 and output to log_all:
Sort-M -T "" -k 4 -o log_all log1 log2 log3
-m: Use the MERGE optimization algorithm, -k 4 represents the sort based on time, and -O means that the sort result is stored in the specified file.
V. Log Statistics Analysis Procedure Webalizer Installation and Configuration
Webalizer is an efficient, free web server log analyzer. The results of its analysis are HTML file formats, so that it can be easily browsing through the web server. Many sites on the Internet use Webalizer to analyze the web server log analysis. Webalizer has the following features:
It is a program written with C, so it has high operating efficiency. On a machine with a 200 MHz, Webalizer can analyze 10,000 records per second, so analyze a 40M size log file only take 15 seconds.
Webalizer supports standard general log file formats; in addition, several combined logfile formats are also supported, allowing you to count customer situations and customer operating system types. And now Webalizer has already supported the WU-ftpd xferlog log format and the Squid log file format.
Support command line configurations and profiles.
You can support multiple languages or localization yourself.
Supports a variety of platforms, such as UNIX, Linux, NT, OS / 2, and MacOS, etc.
The above figure shows the content of the first page of Webalizer generated access statistical analysis report, which contains a table and bar graph statistical analysis of the average access amount per month. Click on each month to get detailed statistics per day this month.
5.1 installation
First, you must first need to make sure the system has a GD library installed, you can use:
[root @mail root] # rpm -qa | grep gd
GD-Devel-1.8.4-4
GDBM-Devel-1.8.0-14
GDBM-1.8.0-14
Sysklogd-1.4.1-8
GD-1.8.4-4
To confirm that the system has two RPM packages that GD-DEVE and GD are installed.
There are two ways to install Webalizer, one is to download source code to install, one is to install directly using the RPM package.
Using the RPM package mode is very simple, find the Webalizer package from rpmfind.net, after downloading:
RPM-IVH Webalizer-2.01_10-1.i386.rpm
Installation can be implemented.
For source code, first need
Http://www.mrunix.net/webalizer/ Download, then install, first untie the source code package:
TAR XVZF Webalizer-2.01-10-src.tgz
There is a LANG directory in the generated directory, which saves various language files, but only traditional Chinese version can be converted into simplifications, or re-translate themselves. Then enter the resulting directory: CD Webalizer-2.01-10
./configure
Make --with-language = Chinese
Make Install
After the compilation is successful, a Webalizer executable will be installed in the / usr / local / bin / directory.
5.2 Configuring and running
Control running on Webalizer can be performed by configuring files or in two ways to specify parameters in the command line. The use of configuration file is relatively simple and flexible, suitable for an application environment for automatic web server log statistical analysis.
Webalizer's default profile is /etc/webalizer.conf, when not using the "-f" option when starting Webalizer, Webalizer will look for files /etc/webalizer.conf, or use "-f" to specify the configuration file ( When the server has a virtual host, you need to configure multiple different Webalizer configuration files. Webalizer for different virtual hosts uses different profiles. The configuration options you need to modify in theWebalizer.conf configuration file are as follows:
Logfile / www / logs / secfocus / access_log
The path information used to indicate the configuration file, Webalizer will statistically analyze the log file as an input;
Outputdir / WWW / HTDOCS / Secfocus / USAGE
Used to indicate the save directory of the generated statistics, we use Alias in front to make users can use
Http://www.secfocus.com/USAGE/ to access the statistical report.
Hostname
Www.secfocus.com
The host name is used to indicate the host name and the statistics report will be referenced.
Other options do not need to be modified. After the configuration file is modified, it needs to be at timeline Webalizer and generate statistical analysis of the day every day.
Run as root: crontab -e enters timing tasks editing status, add the following tasks:
5 0 * * * / usr / local / bin / Webalizer -f /etc/secfocus.webalizer.conf
15 0 * * * / usr / local / bin / Webalizer -f /etc/tomorrowtel.webalizer.conf
We assume that the system runs two virtual hosts and defines the log analysis profile secfocus.webalizer.conf and Tomorrowtel.Webalizer.conf. This way we define the statistical analysis of the Secfocus's log in the early hours of the morning;
Then use the next day
http://www.secfocus.com/usage/ and
Http://www.tomorrowtel.com/USAGE Observes the respective log analysis reports.
6. Protection log statistics analysis reports are not accessed by unauthorized users
We will not want to visit the statistics from our website to be browsed by others, so you need to protect the USAGE directory, only allow legal users to access. Here you can use the basic authentication mechanisms of Apache, and then connect this address after configuration, you will need to provide a password to access the page:
1, conditions
The directory "/" should be set to:
DocumentRoot / WWW / HTDOCS / Secfocus /
AccessFileName .htaccess
ALLOWOVERRIDE ALL
2, demand
Demand: limit pair
Access to http://www.secfocus.com/USAGE/ requires user authentication to access. Here, the user is "admin" and the password is "12345678". 3. Use htpasswd to establish user files
htpasswd -c /www/.htpasswd admin
This program will ask the user "admin" password, you enter "12345678" and take effect twice.
4, build .htaccess file
Create a file with the VI in / www / logs / secfocus / usage / directory. Htaccess, write the following lines:
Authname admin-only
Authtype Basic
Authiserfile / www/htpasswd
Require User Admin
5, test
Access this time through browser
Http://www.secfocus.com/USAGE will enter the username and password, which will enter Admin, 12345678, can access the access log statistical analysis report.