There are many different packages that allow you to generate reports onwho's visiting your site and what they're doing. The most popular at thistime appear to be 'Analog', 'The Webalizer' and 'AWStats'which are installed by default on many shared servers.
The Apache access logs stores information about events that occurred on your Apache web server. For instance, when someone visits your website, a log is recorded and stored to provide the Apache web server administrator with information such as the IP address of the visitor, what pages they were viewing, status codes, browser used, etc. If you're experiencing web server difficulties, or you just want to see what Apache is doing, log files should be your first stop. Apache records information about all visitors to your site, as well as any problems the server encounters. Log Files An Apache log is a record of the events that have occurred on your Apache web server. Apache stores two kinds of logs: Access Log Contains information about requests coming in to the web server. This information can include what. Apache server records all incoming requests and all requests processed to a log file. The format of the access log is highly configurable. The location and content of the access log are controlled by the CustomLog directive. Default apache access log file location. As mentioned above, the Apache access log is one of several log files produced by an Apache HTTP server. This particular log file is responsible for recording data for all requests processed by the Apache server. So if an individual visits a webpage on your site, the access log file will contain details regarding this event.
While such programs generate attractive reports, they only scratchthe surface of what the log files can tell you. In this section we lookat ways you can delve more deeply - focussing on the use of simplecommand line tools, particularly grep, awk and sed.
Combined log format
The following assumes an Apache HTTP Server combined log format where each entry in the log filecontains the following information:
%h %l %u %t '%r' %>s %b '%{Referer}i' '%{User-agent}i'
where:
%h = IP address of the client (remote host) which made the request%l = RFC 1413 identity of the client%u = userid of the person requesting the document%t = Time that the server finished processing the request%r = Request line from the client in double quotes%>s = Status code that the server sends back to the client%b = Size of the object returned to the client
The final two items: Referer and User-agent give details on where the request originated andwhat type of agent made the request.
Sample log entries:
66.249.64.13 - - [18/Sep/2004:11:07:48 +1000] 'GET /robots.txt HTTP/1.0' 200 468 '-' 'Googlebot/2.1'66.249.64.13 - - [18/Sep/2004:11:07:48 +1000] 'GET / HTTP/1.0' 200 6433 '-' 'Googlebot/2.1'
Note: The robots.txt file gives instructionsto robots as to which parts of your site they are allowed to index. Arequest for / is a request for the default index page, normallyindex.html.
Using awk
The principal use of awk is to break up each line of a file into'fields' or 'columns' using a pre-defined separator. Because each line ofthe log file is based on the standard format we can do many things quiteeasily.
Using the default separator which is any white-space (spaces or tabs)we get the following:
awk '{print $1}' combined_log # ip address (%h)awk '{print $2}' combined_log # RFC 1413 identity (%l)awk '{print $3}' combined_log # userid (%u)awk '{print $4,5}' combined_log # date/time (%t)awk '{print $9}' combined_log # status code (%>s)awk '{print $10}' combined_log # size (%b)
You might notice that we've missed out some items. To get to them weneed to set the delimiter to the ' character which changesthe way the lines are 'exploded' and allows the following:
awk -F' '{print $2}' combined_log # request line (%r)awk -F' '{print $4}' combined_log # refererawk -F' '{print $6}' combined_log # user agent
Now that you understand the basics of breaking up the log file andidentifying different elements, we can move on to more practicalexamples.
Examples
You want to list all user agents ordered by the number of times theyappear (descending order):
awk -F' '{print $6}' combined_log | sort | uniq -c | sort -fr
All we're doing here is extracing the user agent field from the logfile and 'piping' it through some other commands. The first sortis to enable uniq to properly identify and count unique useragents. The final sort orders the result by number and name(both descending).
The result will look similar to a user agents report generated by oneof the above-mentioned packages. The difference is that you can generatethis ANY time from ANY log file or files.
If you're not particulary interested in which operating system thevisitor is using, or what browser extensions they have, then you can usesomething like the following:
awk -F' '{print $6}' combined_log | sed 's/(([^;]+; [^;]+)[^)]*)/(1)/' | sort | uniq -c | sort -fr
Note: The at the end of a line simplyindicates that the command will continue on the next line.
This will strip out the third and subsequent values in the 'bracketed'component of the user agent string. For example:
becomes:
The next step is to start filtering the output so you can narrow downon a certain page or referer. Would you like to know which pages Googlehas been requesting from your site?
awk -F' '($6 ~ /Googlebot/){print $2}' combined_log | awk '{print $2}'
Or who's been looking at your guestbook?
awk -F' '($2 ~ /guestbook.html/){print $6}' combined_log
It's just too easy isn't it!
Using just the examples above you can already generate your ownreports to back up any kind of automated reporting your ISP provides. You could even write your own log analysis program.
Using log files to identify problems with your site
The steps outlined below will let you identify problems with your siteby identifying the different server responses and the requests that causedthem:
awk '{print $9}' combined_log | sort | uniq -c | sort
The output shows how many of each type of request your site is getting. A 'normal' request results in a 200 code which means a page or file hasbeen requested and delivered but there are many other possibilities.
The most common responses are:
200 - OK206 - Partial Content301 - Moved Permanently302 - Found304 - Not Modified401 - Unauthorised (password required)403 - Forbidden404 - Not Found
Note: For more on Status Codes you can read the article HTTP Server Status Codes.
A 301 or 302 code means that the request has been re-directed. Whatyou'd like to see, if you're concerned about bandwidth usage, is a lot of304 responses - meaning that the file didn't have to be delivered becausethey already had a cached version.
A 404 code may indicate that you have a problem - a broken internallink or someone linking to a page that no longer exists. You might needto fix the link, contact the site with the broken link, or set up a PURL so that the link can workagain.
The next step is to identify which pages/files are generating thedifferent codes. The following command will summarise the 404('Not Found') requests:
# list all 404 requestsawk '($9 ~ /404/)' combined_log# summarise 404 requestsawk '($9 ~ /404/)' combined_log | awk '{print $9,$7}' | sort
Or, you can use an inverted regular expression to summarise therequests that didn't return 200 ('OK'):
awk '($9 !~ /200/)' combined_log | awk '{print $9,$7}' | sort | uniq
Or, you can include (or exclude in this case) a range of responses,in this case requests that returned 200 ('OK') or 304('Not Modified'):
awk '($9 !~ /200|304/)' combined_log | awk '{print $9,$7}' | sort | uniq
Suppose you've identifed a link that's generating a lot of 404 errors. Let's see where the requests are coming from:
awk -F' '($2 ~ '^GET /path/to/brokenlink.html'){print $4,$6}' combined_log
Now you can see not just the referer, but the user-agent making therequest. You should be able to identify whether there is a broken linkwithin your site, on an external site, or if a search engine or similaragent has an invalid address.
If you can't fix the link, you should look at using Apache mod_rewrite or a similar scheme to redirect(301) the requests to the most appropriate page on your site. By using a301 instead of a normal (302) redirect you are indicating to searchengines and other intelligent agents that they need to update their linkas the content has 'Moved Permanently'.
Who's 'hotlinking' my images?
Something that really annoys some people is when their bandwidth isbeing used by their images being linked directly on other websites.
Here's how you can see who's doing this to your site. Just changewww.example.net to your domain, and combined_log to yourcombined log file.
awk -F' '($2 ~ /.(jpg|gif)/ && $4 !~ /^http://www.example.net/){print $4}' combined_log | sort | uniq -c | sort
Translation:
- explode each row using ';
- the request line (%r) must contain '.jpg' or '.gif';
- the referer must not start with your website address (www.example.net in this example);
- display the referer and summarise.
You can block hot-linking usingmod_rewrite but that can also result in blocking various searchengine result pages, caches and online translation software. To see ifthis is happening, we look for 403 ('Forbidden') errors in the imagerequests:
# list image requests that returned 403 Forbiddenawk '($9 ~ /403/)' combined_log | awk -F' '($2 ~ /.(jpg|gif)/){print $4}' | sort | uniq -c | sort
Translation:
- the status code (%>s) is 403 Forbidden;
- the request line (%r) contains '.jpg' or '.gif';
- display the referer and summarise.
You might notice that the above command is simply a combination ofthe previous, and one presented earlier. It is necessary to callawk more than once because the 'referer' field is onlyavailable after the separator is set to ', wheras the 'statuscode' is available directly.
Blank User Agents
A 'blank' user agent is typically an indication that the request isfrom an automated script or someone who really values their privacy. The following command will give you a list of ip addresses for thoseuser agents so you can decide if any need to be blocked:
awk -F' '($6 ~ /^-?$/)' combined_log | awk '{print $1}' | sort | uniq
A further pipe through logresolve will give you thehostnames of those addresses.
References
Related Articles - Log Files
- Controlling what logs where with rsyslog.conf[SYSTEM]
- Logging sFTP activity for chrooted users[SYSTEM]
- Analyzing Apache Log Files[SYSTEM]
- Referer Spam from Microsoft Bing[SYSTEM]
- Bash script to generate broken links report[SYSTEM]
- Blocking Unwanted Spiders and Scrapers[SYSTEM]
- Fake Traffic from AVG[SYSTEM]
- Referer Spam from Live Search[SYSTEM]
Send a message to The Art of Web:
press <Esc> or click outside this box to close
User Comments
Apache Web Server Log File Analysis Tool
Post your comment or questionApache Web Server Log Analyzer
Web server logs all traffic to a log file. There are various formats and this page will help you understand the log formats that are used. The most popular logging formats are the NCSA (Common or Combined) used mostly by Apache and the W3C standard used by IIS. These formats will be explain in more detail below.
APACHE LOG FILES
One of the many pieces of the Website puzzle is Web logs. Traffic analysis is central to most Websites, and the key to getting the most out of your traffic analysis revolves around how you configure your Web logs. Apache is one of the most, if not the most powerful open source solutions for Website operations. You will find that Apache’s Web logging features are flexible for the single Website or for managing numerous domains requiring Web log analysis. For the single site, Apache is pretty much configured for logging in the default install. The initial httpd.conf file (found in /etc/httpd/conf/httpd.conf in most cases) should have a section on logs that looks similar to this (Apache 2.0.x), with descriptive comments for each item. Your default logs folder will be found in /etc/httpd/logs . This location can be changed when dealing with multiple Websites, as we’ll see later. For now, let’s review this section of log configuration.
Error Logs
The error log contains messages sent from Apache for errors encountered during the course of operation. This log is very useful for troubleshooting Apache issues on the server side. Apache Log Tip: If you are monitoring errors or testing your server, you can use the command line to interactively watch log entries. Open a shell session and type “tail ?f /path/to/error_log” . This will show you the last few entries in the file and also continue to show new entries as they occur. There are no real customization options available, other than telling Apache where to establish the file, and what level of error logging you seek to capture. First, let’s look at the error log configuration code from httpd.conf.
You may wish to store all error-related information in one error log. If so, the above is fine, even for multiple domains. However, you can specify an error log file for each individual domain you have. This is done in the container with an entry like this:
If you are responsible for reviewing error log files as a server administrator, it is recommended that you maintain a single error log. If you’re hosting for clients, and they are responsible for monitoring the error logs, it’s more convenient to specify individual error logs they can access at their own convenience.
Apache’s definitions for their error log levels are as follows:
Level | Description |
---|---|
Emerg | Emergencies – system is unusable |
Alert | Action must be taken immediately |
Crit | Critical Conditions |
Error | Error conditions |
Warn | Warning Conditions |
Notice | Normal but significant condition |
Info | Information |
Debug | Debug-level messages |
Tracking Website Activity – Access Logs
Often by default, Apache will generate a log file called access. This tracks the accesses to your Website, the browsers being used to access the site and referring urls that your site visitors have arrived from. It is commonplace now to utilize Apache’s “combined” log format, which compiles all three of these logs into one logfile. This is very convenient when using traffic analysis software as a majority of these third-party programs are easiest to configure and schedule when only dealing with one log file per domain. Let’s break down the code in the combined log format and see what it all means.
LogFormat starts the line and simply tells Apache you are defining a log file type (or nickname), in this case, combined. Now let’s look at the cryptic symbols that make up this log file definition.
Symbol | Description |
---|---|
%h | IP Address of client (remote host) |
%l | Identd of client (normally unavailable) |
%u | User id of user requesting object |
%t | Time of request |
%r | Full request string |
%>s | Status code |
%b | Size of request (excluding headers) |
%{Referer}i | The previous webpage |
%{User-agent}i | The Client’s browser |
To review all of the available configuration codes for generating a custom log, see Apache’s docs on the module_log_config , which powers log files in Apache.
Apache Log Tip: You could capture more from the HTTP header if you so desired. A full listing and definition of data in the header is found at the World Wide Web Consortium. http Logs Viewer supports a number of log formats and directives and these can be found here.
For a single Website, the default entry would suffice:
However, for logging multiple sites, you have a few options. The most common is to identify individual log files for each domain. This is seen in the example below, again using the log directive within the container for each domain.
In the above example, we have three domains with three unique Web logs (using the combined format we defined earlier). A traffic analysis package could then be scheduled to process these logs and generate reports for each domain independently.
IIS LOG FILES
IIS uses different formats to create log files. The most common two are NCSA and W3C standard.
NCSA
This format is identical to the Apache Common log format. You can treat such a log file similar to how you would treat an apache log file.
W3C
The field definitions of the W3C logging format are shown below. Some Fields start with a prefix which explain which host (client/server/proxy) the field refers to.
Prefix | Description |
---|---|
c | Client |
s | Server |
r | Remote |
cs | Client to Server. |
sc | Server to Client. |
sr | Server to Remote Server (used by proxies) |
rs | Remote Server to Server (used by proxies) |
Field Defenition | Description |
---|---|
date | Date at which transaction completed |
time | Time at which transaction completed |
time-taken | Time taken for transaction to complete in seconds |
bytes | bytes transferred |
cached | Records whether a cache hit occurred |
ip | IP address and port |
dns | DNS name |
status | Status code |
comment | Comment returned with status code |
method | Method |
uri | URI |
uri-stem | Stem portion alone of URI (omitting query) |
uri-query | Query portion alone of URI |
A sample W3C log file is shown below: