Managing Logging with Numerous VirtualHosts in Apache

Note, this document was recently published in a slightly revised form in the February 2002 issue of SysAdmin.

Introduction

This document is slightly different than previous documents in this series. Rather than describe installation and configuration options which that will work across nearly any UNIX system, this article will focus on a particular solution to a problem which the author has implemented. The steps described here may work on all platforms, but it will potentially involve major changes in the standard operating procedure for a particular site. Not all administrators will wish to change their procedures, but the information presented here may still prove useful by presenting alternative means of achieving a goal.

The Problem

Running Apache with numerous virtualhosts can present many problems to the site administrator. One such problem I ran into early on involved how to handle logging for all of these virtualhosts. It is not uncommon for one of our larger shared hosting servers to handle 500+ virtualhosts. One of the first problems encountered had to do with file descriptors. Apache opens a file descriptor for each log file that is open, even if nothing is being written to the log. Earlier versions of the Linux kernel (pre 2.2.x) had static limits on the number of open file descriptors allowed per process. At the time, the only way to modify these was through a kernel patch and a recompile of the kernel. Running out of file descriptors for the Apache processes leads to unusual behavior in the web server, including the server refusing to spawn external interpreters to handle CGI scripts.

The second issue with multiple log files is storage--where does each virtualhost write its log file, and which user owns the directory? I have seen numerous examples of administrators allowing virtualhost log files to be stored in user home directories. This is a serious, and potentially fatal (to your system), security violation! It is a simple matter for a user to remove the root owned file (as they own the underlying directory) and replace the filename with a symbolic link to some other file, such as /vmlinuz, or /etc/shadow. It should be clear what would happen in a situation like this.

The Solution (Overview)

The solution arrived at was to log all virtualhosts to a single log file (the access_log) which is later split into its respective virtualhost log files. This already solves one problem, that of the excessive file descriptors. A nice by-product of this solution is that each virtualhost no longer needs individual TransferLog and ErrorLog statements. Two additional lines of text per virtualhost can really add up to a large amount of extra text in the httpd.conf file when there are 500+ virtualhosts listed.

The Apache distribution ships with a very useful perl script, called split-logfile. This script, given a certain format of an access_log, will split the monolithic log into its respective parts. The script returns the new individual log files back to a standard format by removing the specific part that it requires to identify the virtualhost. Finally, split-logfile also automatically appends to existing files with the same name, allowing the main log to be split once or more per day, while maintaining continuity in the individual virtualhost logs.

The individual log files can be manipulated in the same way as a standard access_log, running them through log analyzers, etc. These logs have the added benefit of not being open file descriptors, so they can be edited, deleted, or moved without having to signal the Apache daemon. The drawback to this is that they are not updated in real time, only as often as the main log is split.

The Solution (Specifics)

First, we must modify Apache's setup so that log files are written in a format that split-logfile can read. Apache's mod_log_config module allows for granular control of logging in the server. One of the nice features is the ability to determine which fields we want to have logged. A few default examples are already provided in the httpd.conf file when Apache is installed. I prefer to use the combined referer and agent log format. The default format for this log is as follows:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

Apache is told to use a specific format of logs through the CustomLog statement, which references the name given to the specific log format, which is 'combined' in the above example. The complete directive is as follows:

CustomLog /usr/local/apache/var/log/access_log combined

In order to determine which virtualhost a log entry belongs to, the name of the virtualhost must be appended to each log entry. Luckily, this is easy to do with the LogFormat directive. The token used to represent the virtualhost name is %v. As split-logfile requires this to be the first field in the log, our new LogFormat directive appears thusly:

LogFormat "%v %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combined

Since most log analysis programs will not recognize the format of the log files with the additional virtualhost field at the beginning, split-logfile strips this field out of the log as it processes them, leaving a standard combined format log file.

We also require a specific directory structure to be setup for our new logging methods. Apache normally writes its logs to /usr/local/apache/var/log. Within this directory we will create two additional directories, old-logs and cust-logs. The old-logs directory is used to store the older, compressed, pre-split logs for the system. Local retention policies and tape archiving policies will determine how long you want to keep these logs. I generally keep them for about a month or so, or until I notice disk space beginning to get tight on the device. The cust-logs directory is used to store the actual virtualhost log files after they have been split. Because of the number of virtualhosts per server and the size of these logs, we usually dedicate a custom disk of between 4GB and 9GB for these logs. Due to the way split-logfile works, it is necessary to store the script in the cust-logs directory. This directory should probably be world readable so users can look at or copy their log files.

The `split-logfile` and `logcron` Scripts

split-logfile is a very simple perl script which simply parses a monolithic log file and splits the file, line by line, into numerous separate log files based on the first token on the line. If you made the changes to your LogFormat, above, this first token will be the name of the virtualhost that particular log entry belongs to. The script will also strip that first token out of the logs, as the extraneous field would confuse most log analysis programs.

One minor shortcoming of split-logfile as it is distributed with Apache is that it will only dump its output logs into its current working directory, hence the need to place the script into the cust-logs directory, above.

The split-logfile script can be found in the src/support directory of the Apache Distribution. For convenience, the script is also available here. Note that this script is the intellectual property of the Apache Group, who holds all applicable copyrights. Reproduction of the script on this website does not indicate that the website author provides any kind of warranty or support for this script.

While split-logfile is extremely useful, it doesn't achieve its true potential without an external wrapper program which handles all aspects of log file rotation and archival. Thus, logcron, a fairly simple korn shell script, was developed locally. logcron is available here.

The script is intended to be run through the cron facility on a daily basis, or at whatever interval is desired by the local administrator. Please note that if the script is to be run more than once daily, some minor modifications will be necessary to the archival process, as it relies on using the month and day for file identification. The script is heavily commented and should be easy to follow. The basic procedure taken by logcron is as follows:

Copy the access_log and error_log to the archive directory and rename them with the date appended to the filename to make it unique.
Create new, blank logfiles for the system to work with.
Restart the Apache process with the USR1 signal so that it closes the old file descriptors and starts writing to the new ones.
Run split-logfile on the access_log we just archived.
Gzip the original log files for more efficient storage of the older logs.

We run logcron through cron once daily at 11:55 PM, as it can sometimes take a minute or two to copy the monolithic log file to the archive directory and restart Apache. Thorough instructions are provided in the comments of the script itself on running the script from cron for those who are not familiar with the procedure.

logcron is designed to allow unlimited local customization and expansion. Some examples might be using the script to spawn log analysis programs or bandwidth monitoring tools. If you do not perform reverse DNS lookups while the webserver is serving requests in order to increase performance, the lookups can be performed through logcron before the monolithic log file is passed to split-logfile. It is also possible to use logcron to perform other specific log maintenance such as removing old logs on a monthly basis, or even to do general Apache maintenance on a specific timeline.

Conclusion

Through the use of a very handy perl script included with Apache, a simple custom written korn shell script, and a few simple modifications to Apache's configuration directives, we have brought the task of virtualhost logging under control at our site. Using our method we have not only achieved respite from the problem of running out of file descriptors (which has the potential to bring down or cripple potentially hundreds or thousands of sites), but we have also turned logging into a centralized, organized procedure. Rather than having logs scattered throughout the file system, they are now all maintained on a dedicated disk device where analysis and maintenance can be more easily run. The key goal of this project is to have a more robust environment which can be easily customized to local administration practices.