As you already know, we recently launched our new VoIP termination website, and we try to keep it updated, good-looking, fashionable and in good style! With all the hard work, there are times when you make simple link changes to your website and you perform a complete site automation test with tools such as Selenium.

We wanted a way to automate testing for broken links after new version deployments of the site (we are using Capistrano for deployments) so we needed a simple web crawler to automate the tests – wget:

wget --mirror --no-directories --delete-after http://www.example.com 2 > ./examplecom-wgetlog.txt

This will tell wget to mirror an entire site and dump stdout to examplecom-wgetlog.txt.
Once completed, you can then do the following to print out the errors:

cat examplecom-wgetlog.txt | grep -B4 ERROR

This will simply show to you stuff like:

--2012-08-28 12:21:04--  http://www.commpeak.com/robots.txt
Connecting to www.commpeak.com|108.162.194.180|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2012-08-28 12:21:04 ERROR 404: Not Found.

And now, for the fun part, we made a simple cronjob to monitor our modified file timestamp (mtime) and if it changes, we run a script that runs the wget test and if it finds errors (grep returns 0) it will mail us the report so we can fix it immediately.

#
# Simple script to run wget in order to catch 404 and other annoying stuff.
#

MAILTO="[email protected]"

CMDLINE='wget --verbose --mirror --no-directories --delete-after domain.example'
LOGFILE="/tmp/.$RANDOM.wgetlog.txt"

# This tracks the mtime of index.php when last test was last made.
TRACKFILE="/tmp/.domain.lastdeploytest"
FILETOTRACK="/var/www/index.html"

# We wanna make sure nobody but us can mess with this file.
umask 077
if [ ! -e "$TRACKFILE" ]; then
 /usr/bin/env touch $TRACKFILE
 /usr/bin/env chmod 600 $TRACKFILE
 /usr/bin/env expr `/usr/bin/env stat -c %Y $FILETOTRACK` - 100 >$TRACKFILE
fi

TRACKDATA=`cat $TRACKFILE`
if [ `/usr/bin/env stat -c %Y $FILETOTRACK` = $TRACKDATA ]; then
 # Our track data matches the timestamp of deployed file, no need to run test again.
 echo "$TRACKFILE timestamp matches $FILETOTRACK, no work is necessary."
 exit -1
fi

/usr/bin/env rm -f $LOGFILE
if [ $? != "0" ]; then
 /usr/bin/env echo "Fatal: unable to delete $LOGFILE"
 exit 1
fi

/usr/bin/env touch $LOGFILE
/usr/bin/env chmod 600 $LOGFILE
/usr/bin/env echo Running: $CMDLINE
/usr/bin/env $CMDLINE 2> $LOGFILE
/usr/bin/env cat $LOGFILE | /usr/bin/env grep -B3 "ERROR"

if [ $? = "0" ]; then
 # Wget found errors
 TMPFILE="/tmp/.$RANDOM$RANDOM$RANDOM"
 /usr/bin/env touch $TMPFILE
 /usr/bin/env chmod 600 $TMPFILE
 /usr/bin/env echo "Found errors using wget broken links check:" > $TMPFILE
 /usr/bin/env cat $LOGFILE | /usr/bin/env grep -B3 "ERROR" >> $TMPFILE
 /usr/bin/env echo "" >> $TMPFILE
 /usr/bin/env echo "" >> $TMPFILE
 /usr/bin/env cat $LOGFILE >> $TMPFILE
 /usr/bin/env cat $TMPFILE | mail -s "wget error log" "$MAILTO"
 /usr/bin/env rm -f $LOGFILE
 /usr/bin/env rm -f $TMPFILE
 /usr/bin/env stat -c %Y $FILETOTRACK > $TRACKFILE
 exit 1
else
 # Wget found no errors, exit successfully
 /usr/bin/env rm -f $LOGFILE
 /usr/bin/env stat -c %Y $FILETOTRACK > $TRACKFILE
 exit 0
fi

There are many more fancy and perhaps efficient ways of accomplishing this task, but the way we maintain the CommPeak website, it is a suitable way to handle this particular scenario.