Java Program: Apache Logfile Parsing
the web’s history. It is one of the world’s best-known open source projects, and one
of many fostered by the Apache Foundation. But the name Apache is a pun on the
origins of the server; its developers began with the free NCSA server and kept hack-
ing at it or “patching” it until it did what they wanted. When it was sufficiently dif-
ferent from the original, a new name was needed. Since it was now “a patchy server,”
the name Apache was chosen. One place this patchiness shows through is in the log
file format. Consider this entry:
123.45.67.89 - - [27/Oct/2000:09:27:09 -0400] "GET /java/javaResources.html HTTP/1.0" 200 10450 "-" "Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)"
The file format was obviously designed for human inspection but not for easy pars-
ing. The problem is that different delimiters are used: square brackets for the date,
quotes for the request line, and spaces sprinkled all through. Consider trying to use a
StringTokenizer ; you might be able to get it working, but you’d spend a lot of time
fiddling with it. However, this somewhat contorted regular expression * makes it easy
to parse:
^([\d.]+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(.+?)" (\d{3}) (\d+) "([^"]+)" "([^"]+)"
You may find it informative to refer back to Table 4-2 and review the full syntax used
here. Note in particular the use of the non-greedy quantifier +? in \"(.+?)\" to match
a quoted string; you can’t just use .+ since that would match too much (up to the
quote at the end of the line). Code to extract the various fields such as IP address,
request, referer URL, and browser version is shown in Example:
LogRegExp.java import java.util.regex.*; /** * Parse an Apache log file with Regular Expressions */ public class LogRegExp implements LogExample { public static void main(String argv[]) { String logEntryPattern = "^([\\d.]+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(.+?)\" (\\d{3}) (\\d+) \"([^\"]+)\" \"([^\"]+)\""; System.out.println("Using regex Pattern:"); System.out.println(logEntryPattern); System.out.println("Input line is:"); System.out.println(logEntryLine); Pattern p = Pattern.compile(logEntryPattern); Matcher matcher = p.matcher(logEntryLine); if (!matcher.matches( ) || NUM_FIELDS != matcher.groupCount( )) { System.err.println("Bad log entry (or problem with regex?):"); System.err.println(logEntryLine); return; } System.out.println("IP Address: " + matcher.group(1)); System.out.println("Date&Time: " + matcher.group(4)); System.out.println("Request: " + matcher.group(5)); System.out.println("Response: " + matcher.group(6)); System.out.println("Bytes Sent: " + matcher.group(7)); if (!matcher.group(8).equals("-")) System.out.println("Referer: " + matcher.group(8)); System.out.println("Browser: " + matcher.group(9)); } }
The implements clause is for an interface that just defines the input string; it was used
in a demonstration to compare the regular expression mode with the use of a
StringTokenizer . The source for both versions is in the online source for this chap-
ter. Running the program against the sample input shown above gives this output:
Using regex Pattern: ^([\d.]+) (\S+) (\S+) \[([\w:/]+\s[+\-]\d{4})\] "(.+?)" (\d{3}) (\d+) "([^"]+)" "([^"]+)" Input line is: 123.45.67.89 - - [27/Oct/2000:09:27:09 -0400] "GET /java/javaResources.html HTTP/1.0" 200 10450 "-" "Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)" IP Address: 123.45.67.89 Date&Time: 27/Oct/2000:09:27:09 -0400 Request: GET /java/javaResources.html HTTP/1.0 Response: 200 Bytes Sent: 10450 Browser: Mozilla/4.6 [en] (X11; U; OpenBSD 2.8 i386; Nav)
The program successfully parsed the entire log file format with one call to matcher.
matches( ) .
No comments:
Post a Comment