Google
Everyone can do Data Science, Part 3 — BigML

Everyone can do Data Science, Part 3 — BigML

The following is a guest post by Fabien Durand in the Everyone can do Data Science series. Here, Fabien shows us the 3rd step of his tutorial, how to predict real estate prices using BigML. The 1st step dealt with scraping data from the web and the 2nd step with cleaning that data.

By now you've learnt how to scrape data from realtor.com using Import.IO and how to prepare this data using Pandas. Here comes the most interesting part: building a predictive model of real estate pricing from the data we collected. I will show you how to use BigML for that.

BigML: Machine Learning made accessible

BigML is a powerful Machine Learning service that offers an easy-to-use interface for you to import your data and get predictions out of it. The beauty of the service is such that you do not need a profound knowledge of Machine Learning techniques to get the most out of ML. Sure you have advanced options available on the service, but in our case (and most of the cases) you will not need them. BigML creates predictive models easily thanks to its powerful "1 Click" feature. BigML is also free for tasks up to 16MB in development mode, which is more than enough for this tutorial.

How Machine Learning works

It is important to remember that when I talk about predictions here, I mean predictions in the sense of Machine Learning predictions. What is to be understood by Machine Learning is:

  • We want to create a model based on our existing data. This process is called Training. We train the model with examples, e.g. real estate data scraped from realtor.com. The model is an interpretation of the dataset and the relationships between the attribute we want to predict(house price) and the other attributes (lot size, number of bedrooms, etc.). In BigML, the predictive model is represented as a decision tree (see below).
  • Then, we confront new inputs to our model, in our case: the characteristics of a house (lot size, number of bedrooms, etc.), and BigML predicts the output (house price). This process is simply called Predicting, or sometimes Scoring. The output (also called 'goal' in BigML) is a value, in our case it is a numerical value, the price of the house.

For a more complete explanation on how Machine Learning works with concrete examples, have a look at the book Bootstrapping Machine Learning.

Getting Started

All you need in order to avoid mistakes is a well prepared CSV file that can be correctly interpreted by BigML. If you followed the instructions of the previous article and prepared your CSV file without problem, you are good to go. You can also download mine: realtor_importio_cleaned.csv.

Here is what the CSV looks like in Google Spreadsheet:

Create a Source

Let's start by going to the BigML dashboard page, under the tab "Sources". For the duration of this tutorial we are going to stay under the Development Mode which is the equivalent of a free plan but with minor restrictions that won't affect us.

Follow the instructions in the screenshot below to upload the CSV.

Now the CSV is listed as a source and can be used in BigML. Click on it to have a look at what was imported.

Note that the first line was added when we used Pandas in the previous step. It must be removed as it doesn't represent an attribute influencing the final pricing and shouldn't be used by BigML to create a predictive model.

Create a Dataset

We need to create a dataset from the source before we generate our predictive model. To do that, follow the instructions below.

Here is what it looks like.

Configure the Dataset

The last line is by default set as the goal (crosshair icon). The goal is what we want to predict. BigML calls it the Objective Field. If your last line is not the goal you can change by hovering your mouse over the line you want. An icon will appear for you to make the modifications. Removing the line added in Pandas is quite similar, see below. Un/selecting fields and choosing the objective field can also be done using the configuration panel.

It should look like this.

Filter the Dataset

This step is important because we need to exclude the houses that are too expensive, let's say more than US$1 million. Why? Because they are very rare but they could still mess up our results when evaluating the accuracy of our model later on. Discarding them makes our life easier and still allows us to make predictions for most houses.

Follow the instructions below to filter the dataset.

 

Below is what we had before filtering the price.

And now after the filtering.

After filtering the dataset you see that the representation of the price distribution is less concentrated on the right side of the histogram.

Create a Predictive Model

This is where you can experience the power of BigML. You can choose either to configure your predictive model manually or let BigML do the job itself thanks to its "1-click" features. We will of course choose the later and select the "1-click" model feature. See instructions below.

By default, BigML represents the predictive model as a decision tree.

The decision tree shows you how BigML classified the houses according to their attributes. By clicking on a node you can see how BigML grouped certain houses together and why. The panel on the right shows what the houses in the selected node have in common.

You can also choose a sunburst visualization and interact with it (sunburst embedded below). The only difference with the decision tree is that it starts from the center instead of the top.

BigML also offers a Model Summary Report feature. It helps you to visualize the importance of each field in the model in a simple histogram. Moreover, with the Download Actionable Model you can transform your model into code that you'll be able to run wherever you want.

Evaluating the Predictive Model

Having a predictive model is good, assessing its accuracy is better. To evaluate the model we will split the dataset we used in 2 parts.

  • The first part is called the training dataset, represents 80% of the original dataset and will be used to create a training model, exactly how we just did previously with the original dataset.
  • The second part is called the test dataset, it represents the remaining 20% of the original dataset.
  • We then run an evaluation where the model (built from the training set) will be used to make predictions on the inputs of the test set, and these predictions will be compared to the outputs of the test set.

All of these steps can be done easily in BigML, here is how.

Below is what you should have by now.

We can now easily create the training model as shown below.

We have successfully created a test dataset and a model from the training dataset. Now we ask BigML to use these two to perform an evaluation of the model and to assess its accuracy.

You need to go under the "Evaluations" tab, see below.

As shown in the following screenshot, you select the model on the left and the test dataset on the right.

The evaluation of the model is shown as a benchmark between the model and two baselines: mean-value predictions and random predictions. When in green we know that BigML outperformed the baselines. You can compare the results to see by how much. The R-squared method shows how much better the model performs compared to the mean.

We are told that the average error is $58,275.83 (Mean Absolute Error). It would be interesting in this case to also have the average error relative to the true value we tried to predict (the Mean Absolute Percentage Error). Indeed, making an error of 50K for a house that’s worth 1M is less of a big deal than for a house that's worth 100K.

One way to decrease the error is by adding even more attributes like the year of construction/renovation, presence of a swimming pool, proximity to schools, public transports, etc. We only grabbed data from realtor.com but adding economical and social data (Open Data) would be a great way to improve the results. Building such a model is more complex and out of the scope of this tutorial.

Another way to increase the accuracy of the model would be by increasing the volume of data we used. At the beginning we grabbed data for 9,000+ houses with Import.IO. After we cleaned the data we ended up with data for around 5,000 houses. More data means more accuracy as the Machine Learning algorithm can use a larger volume of examples for building a predictive model.

You can also try to create an ensemble of models to improve accuracy. What is an ensemble? "By learning multiple models over different subsamples of your data and taking a majority vote at prediction time, the risk of overfitting a single model to all of the data is mitigated. (BigML Blog)". Creating an ensemble in BigML can easily be done by clicking on "1-click Ensemble" instead of "1-clicl Model".

How much is your house worth on the market?

Now that you have your model you can enter the attributes of your house and get a prediction back. Note that the fields are ordered by importance, meaning that the first fields contribute more to error reduction.

Wrap up

If you need to remember one thing, remember that the predictions you get will be as good as the data you fed the algorithm with (in our case, the BigML algorithm). In this example we have been using what we call a regression model, in other words our problem was to find the most accurate numerical value (the price) for given attributes. BigML can also deal with classification problems, where the outcome you want to predict is a class (e.g. "blue", "red", "green").

Enjoyed this article? Vote on Hacker News!

Fabien is a business student at  the EBS University for Business and Law in Germany and at the KEDGE Business School in France. He loves to bring the IT and Business world together. At the moment, everything from Data Science to Digital Innovation brings him much joy! You can find him on LinkedIn and Twitter.

Everyone can do Data Science, Part 2 — Pandas

Everyone can do Data Science, Part 2 — Pandas

The following is a guest post by Fabien Durand in the Everyone can do Data Science series. Here, Fabien shows us the 2nd step of his tutorial, how to prepare the data extracted with Import.IO in step 1, before using BigML.

The first step dealt with extracting real estate data from realtor.com using Import.IO. The CSV we generated in the end can be downloaded here: realtor_importio_raw.csv. You can also see how it appears in Google Spreadsheet:

The only problem with the CSV is that errors are present and we need to get rid of them. By errors I mean useless columns, different units of measurement in the column "lot size", some listings are lands without even a building and there are duplicates. We call this procedure cleaning or preparing the data. I will show you how to remove the errors and prepare the data with the help of Pandas.

For those who are unfamiliar with Pandas, it is a very popular Python library designed to facilitate data wrangling tasks. Check out the "10 minutes to Pandas" tutorial for a quick introduction.

To see Pandas capabilities in action, I recommend you to have a look at this tutorial by Alexandre Vallette. In it, Alexandre uses open data to predict elections' abstention rates, and the data-cleaning part is carried out with Pandas.

Wakari to code in the cloud

IPython Notebook is a very nice tool to edit, run and comment your code directly from your web browser. Wakari is a service that can host your notebooks and run them in the cloud. By hosting on Wakari you avoid all the hassle of installing and configuring IPython on your local computer.

I made an IPython Notebook for our example, it is publicly available (see below) so you only need to click on "Run/Edit this notebook" to duplicate it on your Wakari account (you can create one for free). Once you have it on your Wakari account you will be able to upload your CSV in the same directory as the IPython Notebook (the file manager is the left panel).

Cleaning the Data

The notebook I have made shows step by step how to clean the data we got from Import.IO. It will only work for the particular data structure that we have here but you can easily adapt it to other needs and other datasets.

You can see the embedded IPython notebook below but I strongly recommend you to go the original page, click here, as it is way nicer to work directly on Wakari.

Once you have completed all the instructions from the IPython Notebook you will be able to save save the resulting data as a CSV file (the last instruction creates the file). You can download mine here: realtor_importio_cleaned.csv. This is what it looks like in Google Spreadsheet:

Next step: predictive modelling

Now that the data has been cleaned, it is ready to be imported into BigML. We will see in the next article how to use this data to automatically create a model that predicts the value of real estate properties.

Stay tuned!

Fabien is a business student at  the EBS University for Business and Law in Germany and at the KEDGE Business School in France. He loves to bring the IT and Business world together. At the moment, everything from Data Science to Digital Innovation brings him much joy! You can find him on LinkedIn and Twitter.

Everyone can do Data Science, Part 1 — Import.IO

Everyone can do Data Science, Part 1 — Import.IO

The following is a guest post by Fabien Durand in the Everyone can do Data Science series. Here, Fabien shows us the 1st step of his tutorial, how to crawl real estate data from the web using Import.IO

Import.IO is a graphical app for scraping data from web pages and storing it into spreadsheets or CSV files. In the screencast below, I am showing how I used Import.IO to scrape data from realtor.com, a US-based real estate portal, without writing a single line of code. I managed to get data for 9,000+ real estate properties by letting the crawler run for an hour. I can then use that data to create a predictive model that determines the price of a given house, which I will show in a later article.

The video contains step-by-step instructions that you can easily reproduce. I show how to build a "crawler" and generate a structured spreadsheet of raw data for a large number of real estate properties. Each line of the spreadsheet represents a "data point", i.e. a property for sale. Each column represents an attribute of the properties: number of bedrooms, surface, etc.

You can then save this as a CSV file (here is mine: realtor_importio_raw.csv) which will be the foundation for the next steps of this tutorial. Here is how it looks in Google Spreadsheet (make sure you scroll to the right):

Data quality

As you can see in the spreadsheet above the data we have extracted is not yet prepared and needs to be cleaned. It contains missing fields, useless columns, two different units of measurement are used in the "lot size" column (acres and square feet) and misleading entries (entries for real estate properties) are included such as empty lots without building.

Having a clean set of data heavily reduces the risk of errors when we will make predictions with BigML, which will take place in a later article. BigML takes into account all the information present in the spreadsheet. We have to make sure that it is flawless before we import it into BigML.

Remarks

In this tutorial I chose Las Vegas, Nevada as a location. The geographical location you choose is up to you, just keep in mind that a narrow geographical area is better. Why? Because the pricing of a real estate property is subject to geographical factors, so limiting the area to one particular city makes the task of creating a real estate pricing model simpler.

You can find helpful ressources on Import.IO on the support section of their website and on their Youtube channel.

Fabien is a business student at  the EBS University for Business and Law in Germany and at the KEDGE Business School in France. He loves to bring the IT and Business world together. At the moment, everything from Data Science to Digital Innovation brings him much joy! You can find him on LinkedIn and Twitter.

Everyone can do Data Science, Part 0 — Introduction

Everyone can do Data Science, Part 0 — Introduction

The following is a guest post by Fabien Durand in a series called Everyone can do Data Science.

Many data-related tasks that once needed you to be a software engineer, a PhD graduate in an analytical field, or a "Data Scientist", can now be performed with easy-to-use graphical tools and beginner-friendly software. There are different types of tools tailored for different stages of data processing, and most of them require less to no coding experience.

In this tutorial, I will show you how these tools can all be used in a complementary manner in order to create a predictive model of the value of real estate properties, based on data extracted from the web. I am not a programmer but a student in a business school. What I've done will show you that the new tools are democratizing Data Science and making it easier than ever.

Data What?

Before we start, I want to give you my definition of Data Science so you can understand why I structured the following tutorial the way I did. When I talk about Data Science, I think of a combination of activities that help to solve a problem for which data is available. Traditionnally, a Data Scientist is someone who has cross-domain knowledge ranging from Mathematics to Statistics, Machine Learning, Computer Science, Business Analytics, Marketing, Social Science, etc. Data Science will not look the same in two different situations. It largely depends on the specificities of a given problem, the skills of the individual and the domain of application.

“I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.” – Hal Varian, chief economist at Google

How is this tutorial different?

Now that I have given my definition of Data Science, I want to explain why this tutorial is not a traditional Data Science tutorial. I truly believe that the best way to learn something is by actively doing instead of just listening or reading. Here will be only practice, by that I mean concrete and useful examples.

This tutorial will be a basic introduction to the world of Data Science. It aims to show you how to solve a Data Science problem in a pragmatic way, using the latest tools available. The goal here is not academical but instead very practical.

The tutorial is divided in four parts and each takes less than 15 minutes to complete.

Formalize a problem

Start by asking a question

“You can have data without information, but you cannot have information without data.” – Daniel Keys Moran

There are tons of useful information out there and knowing how to leverage the data available online is an advantage that can be cultivated. All you need is a starting point, a simple question.

For this tutorial we will use a concrete example that anyone can reproduce and adapt. Let us ask ourselves this question:

How much is my house worth right now in the market?

In other words, we want to predict the value of a given house (predict in the sense of a Machine Learning prediction, which we'll explain in part 3 of the tutorial).

Refine with some specifications

Problems are easier to tackle when they've been well specified. Here, I am adapting the framework given in Bootsrapping Machine Learning (you can find further explanations and more examples in the book):

Input (what we're given): Real estate property.

Input representation: Number of bedrooms and bathrooms, surface, year built, type of property (house, townhouse, condo), ZIP code, proximity to good schools, public transports, and access to shops and amenities.

Output (what we need to come up with): Value (in dollars).

Who is concerned: Buyers need to know that the asking price of a property for sale is justified, given its characteristics and in comparison with other properties on the market. Sellers need to find the asking price that will maximize their chances of making a good sale. Therefore, it is beneficial for both types of people to estimate the value of real-estate properties based on data on previous transactions.

Solve the problem in 4 steps

Making predictions does not have to be an overwhelming process. If you look at what happened in the data science community during the last couple of years, many tools and services have emerged to make data-related work a lot easier and accessible. We will take advantage of these tools.

Let's go back to our problem: How much is my house worth right now in the market? To answer this question, the tutorial is divided in four parts. Each part focuses on a different stage of the process and highlights a specific tool:

  1. Extract real estate data from the web with Import.IO
  2. Clean the data with Pandas
  3. Use the data to create a real estate pricing model with BigML

Fabien is a business student at  the EBS University for Business and Law in Germany and at the KEDGE Business School in France. He loves to bring the IT and Business world together. At the moment, everything from Data Science to Digital Innovation brings him much joy! You can find him on LinkedIn and Twitter.