This document is slightly different than previous documents in this series. Rather than describe installation and configuration options which that will work across nearly any UNIX system, this article will focus on a particular solution to a problem which the author has implemented. The steps described here may work on all platforms, but it will potentially involve major changes in the standard operating procedure for a particular site. Not all administrators will wish to change their procedures, but the information presented here may still prove useful by presenting alternative means of achieving a goal.
Running Apache with numerous virtualhosts can present many problems to the site administrator. One such problem I ran into early on involved how to handle logging for all of these virtualhosts. It is not uncommon for one of our larger shared hosting servers to handle 500+ virtualhosts. One of the first problems encountered had to do with file descriptors. Apache opens a file descriptor for each log file that is open, even if nothing is being written to the log. Earlier versions of the Linux kernel (pre 2.2.x) had static limits on the number of open file descriptors allowed per process. At the time, the only way to modify these was through a kernel patch and a recompile of the kernel. Running out of file descriptors for the Apache processes leads to unusual behavior in the web server, including the server refusing to spawn external interpreters to handle CGI scripts.
The second issue with multiple log files is storage--where does each virtualhost write its log file, and which user owns the directory? I have seen numerous examples of administrators allowing virtualhost log files to be stored in user home directories. This is a serious, and potentially fatal (to your system), security violation! It is a simple matter for a user to remove the root owned file (as they own the underlying directory) and replace the filename with a symbolic link to some other file, such as /vmlinuz, or /etc/shadow. It should be clear what would happen in a situation like this.
The solution arrived at was to log all virtualhosts to a single log file (the
access_log
) which is later split into its respective virtualhost log
files. This already solves one problem, that of the excessive file descriptors. A
nice by-product of this solution is that each virtualhost no longer needs
individual
TransferLog
and ErrorLog
statements. Two additional lines
of text per virtualhost can really add up to a large amount of extra text in the
httpd.conf
file when there are 500+ virtualhosts listed.
The Apache distribution ships with a very useful perl script, called
split-logfile
. This script, given a certain format of an
access_log
, will split the monolithic log into its respective parts. The
script returns the new individual log files back to a standard format by removing the
specific part that it requires to identify the virtualhost. Finally,
split-logfile
also automatically appends to existing files with the same
name, allowing the main log to be split once or more per day, while maintaining
continuity in the individual virtualhost logs.
The individual log files can be manipulated in the same way as a standard
access_log
, running them through log analyzers, etc. These logs have the
added benefit of not being open file descriptors, so they can be edited, deleted, or
moved without having to signal the Apache daemon. The drawback to this is that they
are not updated in real time, only as often as the main log is split.
First, we must modify Apache's setup so that log files are written in a
format that split-logfile
can read. Apache's mod_log_config
module
allows for granular control of logging in the server. One of the nice
features is the ability to determine which fields we want to have
logged. A few default examples are already provided in the
httpd.conf
file when Apache is installed. I prefer to use
the combined referer and agent log format. The default format for this
log is as follows:
LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combinedApache is told to use a specific format of logs through the
CustomLog
statement, which references the name given to the
specific log format, which is 'combined' in the above example. The
complete directive is as follows:
CustomLog /usr/local/apache/var/log/access_log combinedIn order to determine which virtualhost a log entry belongs to, the name of the virtualhost must be appended to each log entry. Luckily, this is easy to do with the
LogFormat
directive. The token used to
represent the virtualhost name is %v
. As
split-logfile
requires this to be the first field in the log,
our new LogFormat
directive appears thusly:
LogFormat "%v %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" combinedSince most log analysis programs will not recognize the format of the log files with the additional virtualhost field at the beginning,
split-logfile
strips this field out of the log as it
processes them, leaving a standard combined format log file.
We also require a specific directory structure to be setup for our new
logging methods. Apache normally writes its logs to
/usr/local/apache/var/log
. Within this directory we will
create two additional directories, old-logs
and
cust-logs
. The old-logs
directory is used to
store the older, compressed, pre-split logs for the system. Local
retention policies and tape archiving policies will determine how long
you want to keep these logs. I generally keep them for about a month or
so, or until I notice disk space beginning to get tight on the
device. The cust-logs
directory is used to store the actual
virtualhost log files after they have been split. Because of the number
of virtualhosts per server and the size of these logs, we usually dedicate
a custom disk of between 4GB and 9GB for these logs. Due to the way
split-logfile
works, it is necessary to store the script in
the cust-logs
directory. This directory should probably be
world readable so users can look at or copy their log files.
split-logfile
and logcron
Scripts
split-logfile
is a very simple perl script which simply
parses a monolithic log file and splits the file, line by line, into
numerous separate log files based on the first token on the line. If you
made the changes to your LogFormat
, above, this first token
will be the name of the virtualhost that particular log entry belongs
to. The script will also strip that first token out of the logs, as the
extraneous field would confuse most log analysis programs.
One minor shortcoming of split-logfile
as it is distributed
with Apache is that it will only dump its output logs into its current
working directory, hence the need to place the script into the
cust-logs
directory, above.
The split-logfile
script can be found in the
src/support
directory of the Apache Distribution. For
convenience, the script is also available here. Note that this script is the
intellectual property of the Apache Group, who holds all applicable
copyrights. Reproduction of the script on this website does not indicate
that the website author provides any kind of warranty or support for this
script.
While split-logfile
is extremely useful, it doesn't
achieve its true potential without an external wrapper program which
handles all aspects of log file rotation and archival. Thus,
logcron
, a fairly simple korn shell script, was developed
locally. logcron
is available here.
The script is intended to be run through the cron facility on a daily
basis, or at whatever interval is desired by the local
administrator. Please note that if the script is to be run more than once
daily, some minor modifications will be necessary to the archival process,
as it relies on using the month and day for file identification. The
script is heavily commented and should be easy to follow. The basic
procedure taken by logcron
is as follows:
access_log
and error_log
to the
archive directory and rename them with the date appended to the
filename to make it unique.USR1
signal so that
it closes the old file descriptors and starts writing to the new
ones.split-logfile
on the access_log
we just
archived.
We run logcron
through cron once daily at 11:55 PM, as it can
sometimes take a minute or two to copy the monolithic log file to the
archive directory and restart Apache. Thorough instructions are provided
in the comments of the script itself on running the script from cron for
those who are not familiar with the procedure.
logcron
is designed to allow unlimited local customization
and expansion. Some examples might be using the script to spawn log
analysis programs or bandwidth monitoring tools. If you do not perform
reverse DNS lookups while the webserver is serving requests in order to
increase performance, the lookups can be performed through
logcron
before the monolithic log file is passed to
split-logfile
. It is also possible to use
logcron
to perform other specific log maintenance such as
removing old logs on a monthly basis, or even to do general Apache
maintenance on a specific timeline.
Through the use of a very handy perl script included with Apache, a simple custom written korn shell script, and a few simple modifications to Apache's configuration directives, we have brought the task of virtualhost logging under control at our site. Using our method we have not only achieved respite from the problem of running out of file descriptors (which has the potential to bring down or cripple potentially hundreds or thousands of sites), but we have also turned logging into a centralized, organized procedure. Rather than having logs scattered throughout the file system, they are now all maintained on a dedicated disk device where analysis and maintenance can be more easily run. The key goal of this project is to have a more robust environment which can be easily customized to local administration practices.