subscribe

Every now and then I prototype an idea for an application or tool, and find myself copy/pasting demo data from a service just so I the prototype can feel more real. This is known as web scraping or screen scraping, and it can be both fun and excruciatingly tedious. So I wrote a small script you to use.

From Wikipedia: “Web pages are built using text-based mark-up languages (HTML and XHTML), and frequently contain a wealth of useful data in text form. However, most web pages are designed for human end-users and not for ease of automated use. Because of this, tool kits that scrape web content were created. A web scraper is an API to extract data from a web site. Companies like Amazon AWS, Webdatacapture.com, 80legs.com and Google provide web scrapping tools, services and public data available free of cost to end users.”

There are quite a few cool solutions out there, both in the form of services (Import.io) and software:

In most cases, the structures I’m trying to extract are very simple, and I’ve noticed too much boilerplate in my code whenever I script simple extractors.

I’ve put together a hack-y prototype using CsQuery (C#) , that allows you to configure 1 simple json file and scrape simple web sites on .NET or mono:

input.json_and_Add_New_Post_‹_Ala_Shiban_—_WordPress

 

Here’s the urban spoon page and the elements that will be extracted:

urbanseattle

 

To figure out how to get those CSS selectors, you can use FireFox with FireBug and FirePath:

firepath

While the app obviously runs on Azure, I figured I’d show it running on DigitalOcean (ubuntu droplet) with mono:

ala_—_alashiban_AlaShiban____mono_—_ssh

 

This is meant to give you a feel on how CsQuery can be leveraged to grab quick demo data for prototyping

Feel free to tinker with the code on GitHub and Visual Studio 2013 project. Like/Share/Tweet if you liked it!

2 thoughts on “Web Scraping – How to turn web sites into data

  1. Hi Nice article,

    This is the nice way to scrap very important data from the web. I also want to recommend you one more tool DataCrops which is popular web data extraction tool. This tool extracts data from websites, Data directories, Financial Data, Business Profiles, Social media site, Reviews Sites, Product pricing from Amazon & eBay.

    This tool also extracts images and videos too. If you think this tool is perfact to add in this article kindly add it.

    Thanks

Leave a Reply to Mark Cancel reply