Analysing web server logs for WordPress

I’ve found the WordPress stats plugins a bit annoying at times. So, I wrote my own web server log analyser. It runs on the command line. Most web stats plugins seem to rely on javascript in the pages delivered. Since users might have javascript turned off this might not count their page requests. However, if you have access to the web server logs then you have an amazing source of information.

This program looks through the logs and works out which IP addresses have been used to access the admin pages or used by a bot (such as google’s GoogleBot). These can then be excluded from the overall page counts for the site. The page requests also ignores requests for images, RSS feeds or anything else that doesn’t look like a page request. Originally I wrote it for a website serving mp3′s so it counts mp3 requests. The code is clear and commented so it should be easy to change.

Copy and paste into a file called “logs.php” and run from the command line with:

php logs.php logfilename

If no filename is given then it defaults to “access-log”.

This code is available under the GPL. Feel free to use it and tweak it any way you want. Here’s the code:


<?php
error_reporting(E_ALL);
ini_set("display_errors", 1);

/*
Script to parse WordPress log files.
Counts page requests from IPs which have not been used by a bot or an admin.

works in 2 stages:
1. go through log, pulling out all unique IPs, making IPs used by bots
and admins
2. go back through entire log, pulling out unique URLs, from IPs not used
by bot or admin, counting number of requests for each
(and counting up non-bot non-admin page requests by ignoring ".jpg",
".png", etc and also counting non-bot non-admin feed hits separately)

Whether a user agent is a bot is determined by looking for substrings common in
bot names. I've added all the ones that have been visiting me.

I recommend setting $outputuseragents=true and checking for bots and adding them to
the array $botsubmatches (in capitals), or set $outputnonmozuseragents (outputs
only user agents not starting in "Mozilla") if the list is huge, but be warned that
some obvious bots do start with "Mozilla".

By James Wilkes

/

*/

// **************************
// *** edit these values: ***
// **************************
$filename = "access-log";

// dispay all unique URLs requested by real users and counts
$outputpages = true;
$outputpagesmin = 10;

// dispay breakdown by date of all unique URLs requested by real users and counts
$outputpagesbydate = true;

// dispay all unique user-agents used by real users (handy for spotting bots)
$outputuseragents = false;

// dispay all non-Mozilla browsers (handy for spotting bots to add)
$outputnonmozuseragents = false;

// display all user-agent names tagged as bot (handy for spotting false positives on bot check)
$outputbotnames = false;

// these must be UPPERCASE
$botsubmatches = array(
"BOT", "SPIDER", "SURPHACE", "PEAR",
"JAVA", "FACEBOOK", "ATTENTIO", "INCUTIO", "IA_ARCHIVER",
"CORALWEB", "SQUIDOO", "PYCURL", "PERL", "WORDPRESS", "RADIAN",
"ROME CLIENT", "JUSTSIGNAL", "CORALWEBPRX", "SURPHACE", "WGET",
"YAHOO", "PYTHON", "WGET", "YANDEX", "CMS SURVEY", "FEEDFETCHER",
"FEEDBURNER", "BLOGPULSELIVE", "POSTRANK", "JAKARTA",
"LARBIN", "TWINGLY", "CRAWLER", "SCOUT", "ZEND_HTTP_CLIENT",
"UNIVERSALFEEDPARSER", "FINLY", "WHOIS365", "OSSPROXY",
"GOOGLE-SITE-VERIFICATION", "MOZILLA/4.0 (COMPATIBLE;)", "TWICELER"
);
// *********************************
// *** end of edit these values: ***
// *********************************

// Here we go...
echo "Hello world\n";

$linecount = 0;    // total lines in log file
$ips = array();    // unique IPs, marked 0=user, 1=admin, 2=bot
$useragents = array();    // useragents
$botnames = array();    // bot names

if (sizeof($argv) > 1 ) {
$filename = $argv[1];
} else {
echo "No filename given so defaulting to " . $filename . "\n";
}

$fh = fopen($filename, "rb");

if ($fh == false) {
echo "Could not open file '" . $filename . "'\n";
return;
}

// get unique IPs, mark IPs as bot or admin, 0=user, 1=admin, 2=bot
while(!feof($fh)) {
$oneline = fgets($fh);

$oneline = trim($oneline);

$i = strpos($oneline, " ");

$ip = "" . substr($oneline, 0, $i);

if (!array_key_exists($ip, $ips)) {
$ips[$ip] = "0";        // default: not admin
}

$i = strpos($oneline, "GET ");
if ($i !== false) {
$url = substr($oneline, $i + 4, strpos($oneline, " ", $i + 4) - $i - 4);
if (strpos($url, "wp-admin") || strpos($url, "cpanel")) {
$ips[$ip] = "1";        // mark IP as admin
}
}

// strip last "
$useragent = substr($oneline, 0, -1);
$useragent = "" . substr($useragent, strrpos($useragent, "\"") + 1);

// check if it matches bot list
$useragentuppercase = strtoupper($useragent);
foreach ($botsubmatches as $botmatch) {
if (strpos($useragentuppercase, $botmatch) !== false) {
$ips[$ip] = "2";        // mark IP as bot
if (!array_key_exists($useragent, $botnames)) {
$botnames[$useragent] = $botmatch;
}
}
}

// if not marked as bot in all that then add to legit user agents
if ($ips[$ip] == "0") {
if (!array_key_exists($useragent, $useragents)) {
$useragents[$useragent] = true;
}
}

$linecount++;
}

$admincount = 0; // admin
$botcount = 0; // bot
$nonadminnonbotcount = 0; // non-admin, non-bot
foreach ($ips as $key => $value) {
if ($value == "2") {
$botcount++;
} else if ($value == "1") {
$admincount++;
} else {
$nonadminnonbotcount++;
}
}

// goint through again and leaving out admins

$pages = array();
$mp3count = 0;
$pagecount = 0;
$feedcount = 0;

$pagesbydate = array();

$fh = fopen($filename, "rb");

while(!feof($fh)) {
$oneline = fgets($fh);

$oneline = trim($oneline);

$i = strpos($oneline, " ");
$ip = "" . substr($oneline, 0, $i);

$i = strpos($oneline, "GET ");
if ($i !== false) {
$url = "" . substr($oneline, $i + 4, strpos($oneline, " ", $i + 4) - $i - 4);

if ($ips[$ip] == "0") {
if (strpos($url, ".jpg") === false) {
if (strpos($url, ".png") === false) {
if (strpos($url, ".swf") === false) {
if (strpos($url, ".css") === false) {
if (strpos($url, ".gif") === false) {
if (strpos($url, ".txt") === false) {
if (strpos($url, ".js") === false) {
if (strpos($url, ".ico") === false) {
if (strpos($url, "/wordpress/") === false) {

if (strpos($url, ".mp3") !== false) {
$mp3count++;
} else if (strpos($url, "feed") || strpos($url, "rss")) {
$feedcount++;
} else {
$pagecount++;

if (array_key_exists($url, $pages)) {
$pages[$url] = $pages[$url] + 1;
} else {
$pages[$url] = 1;
}

$i = strpos($oneline, "[");
$j = strpos($oneline, ":", $i);
if (($i > 0) && ($j > 0)) {
$datestring = substr($oneline, $i + 1, $j - $i - 1);
if (array_key_exists($datestring, $pagesbydate)) {
$pagesbydate[$datestring] = $pagesbydate[$datestring] + 1;
} else {
$pagesbydate[$datestring] = 1;
}
}
}

}
}
}
}
}
}
}
}
}
}
}
}

echo "\n";

echo "*********************\n";
echo "***    LOG FILE   ***\n";
echo "*********************\n";
echo "filename = " . $filename . "\n";
echo "linecount = " . $linecount . "\n";

echo "\n";

echo "*********************\n";
echo "***    IP DATA    ***\n";
echo "*********************\n";
echo "unique IPs = " . sizeof($ips) . "\n";
echo "unique bot IPs = " . $botcount . "\n";
echo "unique admin IPs = " . $admincount . "\n";
echo "non admin, non bot unique IPs = " . $nonadminnonbotcount . " real users\n";

echo "\n";

echo "*********************\n";
echo "*** REQUEST DATA  ***\n";
echo "*********************\n";
echo "unique urls = " . sizeof($pages) . "\n";
echo "unique useragents = " . sizeof($useragents) . "\n";
echo "unique botnames = " . sizeof($botnames) . "\n";

echo "mp3count  (excl bots + admins) = " . $mp3count . "\n";
echo "feedcount (excl bots + admins) = " . $feedcount . "\n";
echo "pagecount (excl bots + admins) = " . $pagecount . " real pages delivered to real users \n";

echo "\n";

// dispay all unique URLs requested by real users and counts
if ($outputpages) {
echo "************************\n";
echo "***  PAGES (BY URL)  ***\n";
echo "************************\n";
asort($pages);
$t = 0;
foreach ($pages as $key => $val) {
if ($val >= $outputpagesmin) {
echo $key . " = " . $val . "\n";
$t += $val;
}
}
//    echo "Total=" . $t;
echo "\n";
}

// display count of all real user page hits by date
if ($outputpagesbydate) {
echo "***************************\n";
echo "***  PAGE HITS BY DATE  ***\n";
echo "***************************\n";
$t = 0;
foreach ($pagesbydate as $key => $val) {
echo $key . " = " . $val . "\n";
$t += $val;
}
//    echo "Total=" . $t;
echo "\n";
}

asort($useragents);
// dispay all unique user-agents used by real users (handy for spotting bots)
if ($outputuseragents) {
echo "**************************\n";
echo "***  REAL USER AGENTS  ***\n";
echo "**************************\n";
foreach ($useragents as $key => $val) {
echo $key . "\n";
}
echo "\n";
}

// dispay all non-Mozilla browsers (handy for spotting bots to add)
if ($outputnonmozuseragents) {
echo "*********************************\n";
echo "***  NON-MOZILLA USER AGENTS  ***\n";
echo "*********************************\n";
if (substr($key,0,7) != "Mozilla") {
echo $key . "\n";
}
echo "\n";
}

// dispay all user agent strings which have been flagged as bots
// (handy for spotting false positives on bot check)
if ($outputbotnames) {
asort($botnames);
echo "*************************************\n";
echo "***  USER AGENTS FLAGGED AS BOTS  ***\n";
echo "*************************************\n";
foreach ($botnames as $key => $val) {
echo $key . "\n";
}
echo "\n";
}

echo "*********************\n";
echo "***    IN SHORT   ***\n";
echo "*********************\n";
echo "Note: The page count is accurate, the user count will be high as\n";
echo "it includes everyone who connects to the internet with a different\n";
echo "IP address each time.\n";
echo "\n";
echo $pagecount . " real pages, " . $mp3count . " mp3s delivered to " . $nonadminnonbotcount . " IPs of real users \n";

?>

Share this:
Share this page via Email Share this page via Stumble Upon Share this page via Digg this Share this page via Facebook Share this page via Twitter

Leave a Comment