Every now and then I prototype an idea for an application or tool, and find myself copy/pasting demo data from a service just so I the prototype can feel more real. This is known as web scraping or screen scraping, and it can be both fun and excruciatingly tedious. So I wrote a small script you to use.
From Wikipedia: “Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper is an API to extract data from a web site. Companies like Amazon AWS, Webdatacapture.com, 80legs.com and Google provide web scrapping tools, services and public data available free of cost to end users.”
There are quite a few cool solutions out there, both in the form of services (Import.io) and software:
- Scrapy / Beautiful Soup (Python)
- CasperJS (JS)
- Html Agility Pack / CsQuery (C# / .NET)
In most cases, the structures I’m trying to extract are very simple, and I’ve noticed too much boilerplate in my code whenever I script simple extractors.
I’ve put together a hack-y prototype using CsQuery (C#) , that allows you to configure 1 simple json file and scrape simple web sites on .NET or mono:
Here’s the urban spoon page and the elements that will be extracted:
To figure out how to get those CSS selectors, you can use FireFox with FireBug and FirePath:
While the app obviously runs on Azure, I figured I’d show it running on DigitalOcean (ubuntu droplet) with mono:
This is meant to give you a feel on how CsQuery can be leveraged to grab quick demo data for prototyping
Feel free to tinker with the code on GitHub and Visual Studio 2013 project. Like/Share/Tweet if you liked it!