The following is a guest post by Fabien Durand in the Everyone can do Data Science series. Here, Fabien shows us the 1st step of his tutorial, how to crawl real estate data from the web using Import.IO
Import.IO is a graphical app for scraping data from web pages and storing it into spreadsheets or CSV files. In the screencast below, I am showing how I used Import.IO to scrape data from realtor.com, a US-based real estate portal, without writing a single line of code. I managed to get data for 9,000+ real estate properties by letting the crawler run for an hour. I can then use that data to create a predictive model that determines the price of a given house, which I will show in a later article.
The video contains step-by-step instructions that you can easily reproduce. I show how to build a "crawler" and generate a structured spreadsheet of raw data for a large number of real estate properties. Each line of the spreadsheet represents a "data point", i.e. a property for sale. Each column represents an attribute of the properties: number of bedrooms, surface, etc.
You can then save this as a CSV file (here is mine: realtor_importio_raw.csv) which will be the foundation for the next steps of this tutorial. Here is how it looks in Google Spreadsheet (make sure you scroll to the right):
As you can see in the spreadsheet above the data we have extracted is not yet prepared and needs to be cleaned. It contains missing fields, useless columns, two different units of measurement are used in the "lot size" column (acres and square feet) and misleading entries (entries for real estate properties) are included such as empty lots without building.
Having a clean set of data heavily reduces the risk of errors when we will make predictions with BigML, which will take place in a later article. BigML takes into account all the information present in the spreadsheet. We have to make sure that it is flawless before we import it into BigML.
In this tutorial I chose Las Vegas, Nevada as a location. The geographical location you choose is up to you, just keep in mind that a narrow geographical area is better. Why? Because the pricing of a real estate property is subject to geographical factors, so limiting the area to one particular city makes the task of creating a real estate pricing model simpler.
Other posts in the series
"Everyone can do Data Science"
Fabien is a business student at the EBS University for Business and Law in Germany and at the KEDGE Business School in France. He loves to bring the IT and Business world together. At the moment, everything from Data Science to Digital Innovation brings him much joy! You can find him on LinkedIn and Twitter.