Today at the office we were forced to once again deal with a bit of Wikipedia vandalism, something that really shouldn’t happen as frequently as it actually does.  Not only were two of our pages modified, but one of actually deleted by a Wikipedia moderator for “blatant advertising”.  (To be honest it was a bit spammy, but I don’t think it crossed the line when compared to other high profile brands like ours.)  This is something we’ve had to deal with on Wikipedia before but I guess that is the nature of high profile pages.  Since we hadn’t been keeping a close eye on these pages the changes were not noticed for quite some time.

We’ve now been tasked to start monitoring certain pages for any future vandalism, but what is the best method to handle this?  My first thought was to use the built-in Wikipedia watchlist but those have proven to be problematic in the past so I wanted to avoid that from the start.  I also looked at using the Wikipedia RSS Feeds to monitor for changes like they discussed on digital inspiration but that requires setting everyone up to use an RSS News Reader or building out email alerts that would be vague as best.  (These notifications need to go to various different PR Teams not Technology folks, so they need to be easy to understand.)  It was at that point that I decided to write a tool that specifically met the requirements of the task at hand.

The functionality I wanted for my first version was pretty simple, I wanted to monitor pages on Wikipedia for changes and then email an alert to the responsible team when something is detected.  Since these changes seem to happen randomly this means no-one is forced to review the page daily for vandalism and only have to react when an alert comes in.  Since I have full control over the email I can ensure it is Blackberry friendly, making it even more useful.  (Vandals tend to strike outside of normal business hours, go figure…)

Now I needed to determine how I would determine if changes have occurred and how to react to the varied levels of vandalism.  I decided to start with a basic system that will do an MD5 hash on the webpage text and compare it to the previous known good value.  If the hashes are different the page text is compared to determine the level of difference using a slightly modified O(ND) Difference Algorithm.  The text is also scanned at this time using a list of known trigger words (swears).  The level of difference in the text and the weight of the trigger words that were found determines if the alert email is sent as high priority or not.  This ensures that an alert is not generated for small updates, but only when someone replaces a large block of text or fills the page with profanity.

I setup a local wiki that I can use as a testbed and so far things are looking promising.  No false positives yet, and minor updates of a few words have went without a single alert.  Adding one strong swear word, however, generates an instant high priority email.  Perfect.

If there is any interest I can package and release the C# source code, just leave a comment below.