Kakariki

Our blog. We live and breathe this stuff. Here we write musings on the subjects that matter.

Graphing StackOverflows Reputation

Mon 15th March

UPDATE:Since I have written this I have found that StackOverflow make all their data downloadedable with a creative commons licence at StackOverflow Data Dump So my fun little screen scraper is of even less use.

For this weekends project, I decided I wanted to make a graph of the distibution of reputation within the stackoverflow comunity. I also felt the urg to play with javascript, so I tried to do it all in the browser with javascript, after being reminded of the cross site restriction I realised this was impossible and created a quick little IHttpHandler that did a screen scrape of the stackoverflow users page and returned JSON with the rank, reputation and name of all the users on a given page.
Here is a sample:

Overview

Libaries Used:

  • JQuery - For all its greatness
  • jqplot - a jQuery plugin for drawing graphs. I have yet to be convicenced this is great. But it seemed to do the job here and was quick to get going
  • HtmlAgilityPack - to prevent me trying to use RegEx to parse html. This is a full fledged html parser and made screen scrapping SOF as easy as remembering XPath

Example Request

Client Side:
The page is loaded with no data, the main javascript object "sof" get created and we bind the function drawGraph to its data changed event. We then start a serries of getPage calles which use AJAX to fetch UserSummaries serilised as JSON. When the AJAX call compleates sof.data is updated and drawGraph is called.
The function drawGraph shapes the data to what jqplot expects removes any existing graph (as jqplot does not seem to be able to) and draws the graph with the current data. As the variouse getPages compleate the graph is updated.

Server Side:
The hosting page does nothing fancy in fact it can be a static page. Each call of the javascript function getPage makes a call through to an IHttpHandler SOFPagedUserSummary with two optional parameters page and format. SOFPagedUserSummary parses the parameters then delegates to SOFPageFetch.
SOFPageFetch checks to see if this page/format is in the cache (written to disk to out live worker process recycles)  if present on disk we output the cached version. the heavy use of caching here is to prevent this project becoming a pain for StackOverflow.  If not present in the cache we delegate to SOFNetPageFetcher which returns a List<UserSummary> these are then formatted with in HTML or JSON, this formatted output is cached on disk then returned to the client.
SOFNetPageFetcher connect to the stackoverflow users page passing the page number through and parsers the returned html using the HtmlAgilityPack . We then procced to scrape the page using XPath to extract data on each of the 35 users returned, currently we are gathering rank, reputation and name. This is returned as a List<UserSummary> . SOFPageFetch

HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(
    String.Format(UserPage, pageNumber), "GET");
int rank = ((pageNumber - 1) * 35);

foreach (HtmlNode userInfo in 
  doc.DocumentNode.SelectNodes("//div[@class='user-info']"))
                            

 

Next article

Previous article