Our blog. We live and breathe this stuff. Here we write musings on the subjects that matter.
Mon 15th March
UPDATE:Since I have written this I have found that StackOverflow make all their data downloadedable with a creative commons licence at StackOverflow Data Dump So my fun little screen scraper is of even less use.
For this weekends project, I decided I wanted to make a graph of the distibution
of reputation within the stackoverflow comunity. I also felt the urg to play
being reminded of the cross site restriction I realised this was impossible and
created a quick little IHttpHandler that did a screen scrape of the
stackoverflow users page and returned JSON with the rank, reputation and name of
all the users on a given page.
Here is a sample:
The function drawGraph shapes the data to what jqplot expects removes any existing graph (as jqplot does not seem to be able to) and draws the graph with the current data. As the variouse getPages compleate the graph is updated.
SOFPageFetch checks to see if this page/format is in the cache (written to disk to out live worker process recycles) if present on disk we output the cached version. the heavy use of caching here is to prevent this project becoming a pain for StackOverflow. If not present in the cache we delegate to SOFNetPageFetcher which returns a List<UserSummary> these are then formatted with in HTML or JSON, this formatted output is cached on disk then returned to the client.
SOFNetPageFetcher connect to the stackoverflow users page passing the page number through and parsers the returned html using the HtmlAgilityPack . We then procced to scrape the page using XPath to extract data on each of the 35 users returned, currently we are gathering rank, reputation and name. This is returned as a List<UserSummary> . SOFPageFetch
HtmlWeb hw = new HtmlWeb(); HtmlDocument doc = hw.Load( String.Format(UserPage, pageNumber), "GET"); int rank = ((pageNumber - 1) * 35); foreach (HtmlNode userInfo in doc.DocumentNode.SelectNodes("//div[@class='user-info']"))