subscribe

Every now and then I prototype an idea for an application or tool, and find myself copy/pasting demo data from a service just so I the prototype can feel more real. This is known as web scraping or screen scraping, and it can be both fun and excruciatingly tedious. So I wrote a small script you to use.

From Wikipedia: “Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper is an API to extract data from a web site. Companies like Amazon AWS, Webdatacapture.com, 80legs.com and Google provide web scrapping tools, services and public data available free of cost to end users.”

There are quite a few cool solutions out there, both in the form of services (Import.io) and software:

In most cases, the structures I’m trying to extract are very simple, and I’ve noticed too much boilerplate in my code whenever I script simple extractors.

I’ve put together a hack-y prototype using CsQuery (C#) , that allows you to configure 1 simple json file and scrape simple web sites on .NET or mono:

input.json_and_Add_New_Post_‹_Ala_Shiban_—_WordPress

 

Here’s the urban spoon page and the elements that will be extracted:

urbanseattle

 

To figure out how to get those CSS selectors, you can use FireFox with FireBug and FirePath:

firepath

While the app obviously runs on Azure, I figured I’d show it running on DigitalOcean (ubuntu droplet) with mono:

ala_—_alashiban_AlaShiban____mono_—_ssh

 

This is meant to give you a feel on how CsQuery can be leveraged to grab quick demo data for prototyping

Feel free to tinker with the code on GitHub and Visual Studio 2013 project. Like/Share/Tweet if you liked it!