Web Scraping using C#
Now and then I come across a set of data on the Internet that I wish I could toss into an Excel spreadsheet for sorting but more than a few pages and copy/paste is out of the question. For times like these I generally wrote a web scraper using C# and some crazy regular expressions. Today I needed to grab a much larger dataset that is generated by a very old application we have on hand. The interface for the app generates very clean HTML but unfortunately its data is stored in an old proprietary format and we needed to move it into one of our SQL Servers to be able to report against it.
Since the HTML is much more complex this time around, I knew that regular expressions were not going to cut it and so I started looking around for an HTML parsing engine. I came across the HTML Agility Pack on CodePlex and was excited by the feature list so I decided to give it a try. I imported the library and instantly noticed the primary object I would be using, HtmlAgilityPack.HtmlDocument, conflicts with the System.Windows.Forms.HtmlDocument object. If this was a larger project I would build out a class to do the parsing so I didn’t have to fully qualify around the conflicts, but I had to remind myself that I will only be running this import once so there’s no need to over-engineer things.
To load your HTML into the HtmlDocument class you can either load directly from a stream object or download the data yourself and pass it in via a string. I opted to use the quick and dirty method with WebClient.DownloadData since I didn’t want to deal with asynchronous calls and having to maintain any sort of state in the application. I also added a failsafe try/catch combo in case the server failed to load a page, but it really should have a proper error handler here.
private HtmlDocument ParseHtml(string URL)
{
HtmlDocument hDoc = new HtmlDocument();
try
{
WebClient wClient = new WebClient();
byte[] bData = wClient.DownloadData(URL);
hDoc.LoadHtml(ASCIIEncoding.ASCII.GetString(bData));
}
catch
{
hDoc.LoadHtml("");
}
return hDoc;
}
Now that we have an HtmlDocument we can use the standard SelectNodes and SelectSingleNode methods with some Xpath to grab the proper nodes. For instance here I will loop through all div’s on the page that have a class value of “result”.
foreach (HtmlNode hNode in hDoc.DocumentNode.SelectNodes("//div[@class='result']"))
{
Log(hNode.InnerText);
}
By building more and more complex Xpath statements you can drill right down to the value you need to store. In this program I used a generic List object that is a collection of columns and rows of data which can then be saved into a CSV file with a simple function.
// Setup the data storage
List<List> lData = new List<List>();
// Add a few rows of data
List lRow;
lRow = new List();
lRow.Add("column1");
lRow.Add("column2");
lRow.Add("column3");
lData.Add(lRow);
lRow = new List();
lRow.Add("column1");
lRow.Add("column2");
lRow.Add("column3");
lData.Add(lRow);
// Write the rows of data
TextWriter tWriter = new StreamWriter("data.csv");
foreach (List lItem in lData)
{
string sLine = "";
// Build the columns
foreach (string sData in lItem)
{
sLine += "\"" + sData + "\",";
}
tWriter.WriteLine(sLine);
}
tWriter.Close();
With the HTML Agility Pack I easily saved two hours on this code and I will be sure to tuck it away in my toolbox for the next time I need to deal with remote HTML.
How to compare and choose web scraping tools
I recommend reading these series of posts is dedicated to executives taking charge of projects that entail scraping information from one or more websites.
http://www.fornova.net/blog/?p=18