Data warehousing Web server log files

Jan

I have begun to create a data warehouse for CRRA (VuFind) Web server log files. This posting introduces the topic.

The problem

There is an understandable need/desire to know how well the "Catholic Portal" is operating. But for the life of me I was not able to enumerate metrics defining success. On the other hand, Pat Lawton had no problem listing quite a few. Here are most of her suggestions:

Are users looking at records?
Are users searching in English? Other languages?
Are users using field searches?
Can we get a sense of the number of records viewed per search?
Do we know how many searches resulted in zero hits?
How many hits came from a google search result? Or other search engine?
How many hits per day?
How many times were each institution’s records viewed?
How many times were the Web 2.0 things used?
How many users set up an account?
How often were the tabs at the top clicked on?
Per searches where records were looked at?
What is the average number of hits retrieved per search?
What percentage of queries resulted in an error message?
What sorts of search strings are entered?
When are the peak periods of use? Is there a pattern?
Where are users coming from?
Which geographic locations and types of institutions?

If you know about Web (Apache) server log files, you know that that answers to many of these questions can be found there, sort of. If you are Web server administrator who deals with these log files, then you probably know about Analog, Webalizer, and Google Analytics. These tools can answer many of the questions, above, but the information would need to be "gleaned" from the reports. Time consuming at best and rather frustrating.

So the problem is, how do I generate regular or on-demand reports answering the questions listed above?

The solution

The initial solution was to write some sort of computer program regularly reading log files, and outputing the desired answers. Upon reflection, this would be tedious because the business logic -- the questions needing answered -- would either be hard-coded into the program, or the program would require an abunanced of command line switches. Complicated and not very flexible. Remember, good computer programs are programs that do one thing and do it well -- the "Unix Way".

Instead, the solution will be to first create a database -- a "data warehouse" -- containing log file content, and second, to provide a front-end to the database enabling people to query it. With this approach, counting the number of times anything occurs could be as easy as a single SQL (Structured Query Language) query as opposed to tabulating 10s of thousands of log file entries.

To date, the database is simple and defined by the following MySQL-specific SQL statement:

  CREATE TABLE IF NOT EXISTS `crra_logs`.`transactions` (
	`id`          INT           NOT NULL  AUTO_INCREMENT PRIMARY KEY,
	`host`        VARCHAR(128)  NOT NULL,
	`username`    VARCHAR(16)   NOT NULL,
	`password`    VARCHAR(16)   NOT NULL,
	`datetime`    DATETIME      NOT NULL,
	`timezone`    VARCHAR(8)    NOT NULL,
	`method`      VARCHAR(8)    NOT NULL,
	`request`     VARCHAR(1024) NOT NULL,
	`protocol`    VARCHAR(8)    NOT NULL,
	`statuscode`  VARCHAR(8)    NOT NULL,
	`bytessent`   INT           NOT NULL,
	`referrer`    VARCHAR(1024) NOT NULL,
	`useragent`   VARCHAR(1024) NOT NULL,
	`hosttype`    VARCHAR(16)   DEFAULT 'unknown',
	`requesttype` VARCHAR(16)   DEFAULT 'unknown'
  );

The astute Web server administrator will notice how the database's structure mirrors almost exactly an Apache "combined" log file, with the following exceptions:

id is a unique key
datetime is a reformulation of the time stamp found in the Apache's logs
hosttype and requesttype are fields used to classify transactions, explained below

I then wrote a Perl script to read log files, parse each entry into fields, and stuff the result into the database. ("Thank goodness for regular expressions!") Once this is done it is almost trivial to answer questions like this:

How many different computers from the University of Notre Dame used the "Portal"? - SELECT COUNT(host) AS c, host FROM transactions WHERE host LIKE '%.nd.edu' GROUP BY host ORDER BY c DESC
What are the 100 most popular requests sent to the server? - SELECT COUNT(request) AS c, request FROM transactions GROUP BY request ORDER BY c DESC LIMIT 100
My computer's address is lib-1234.library.nd.edu. What requests did I make against the Portal on December 13, 2010, and in what order? - SELECT datetime, request FROM transactions WHERE host = 'lib-1234.library.nd.edu' AND datetime LIKE '2010-12-13%' ORDER BY datetime ASC

Unfortunately, without some extra knowledge answering Pat's questions is still problematic. For example, how does one count "hits" against the Portal when requests from Internet robots and spiders bloat the input? How does one accurately count searches for content and record views when so many of the requests include calls for images, javacript files, and cascading stylesheets?

The answers lie in the use of classification as well as the hosttype and requesttype fields. Many (most) of the "hits" on the Portal come from a computer named googlebot.com. I know this is a robot, and I can flag database records accordingly with the following SQL -- UPDATE transactions SET hosttype = 'robot' WHERE host LIKE '%.googlebot.com'. Once I do this for all the robots hitting the Portal, I can accurately answer the question, "What computers operated by humans use the Portal the most?" -- SELECT COUNT(host) AS c, host FROM transactions WHERE hosttype <> 'robot' GROUP BY host ORDER BY c DESC LIMIT 100.

Because VuFind uses HTTP GET methods almost exclusively, all transactions are saved in the Web server log files. These transactions have patterns. Searches contain the string "?lookfor=". Record views all start with "/Record/". Requests for supporting content contain things like ".gif", ".css", ".js", etc. Consequently it is easy to classify the requests with SQL statements like this -- UPDATE transactions SET requesttype = 'record' WHERE request LIKE '/Record/%'. Now it is really easy to count the most frequent record views by humans -- SELECT COUNT(request) AS c, request FROM transactions WHERE hosttype <> 'robot' AND requesttype = 'record' GROUP BY request ORDER BY c DESC LIMIT 100.

Much of the work described above has been implemented in a handful of files -- 4 SQL files and 1 Perl script -- available for downloading. More classification work needs to be done but the foundation has been layed. The next big steps include automating the ingestion of new log file content and building a user interface to query the database.

Summary

Log file analysis will be greatly simplified through the use of data warehousing technqiues, and the consistently structured requests implemented by VuFind will make it much easier to learn who is using the Portal and how.

Data warehousing Web server log files

The problem

The solution

Summary

Comments on "Data warehousing Web server log files"

Comments 0-5 of 0

Announcements

Follow us on Facebook