Everyone can do Data Science, Part 3 — BigML

The following is a guest post by Fabien Durand in the Everyone can do Data Science series. Here, Fabien shows us the 3rd step of his tutorial, how to predict real estate prices using BigML. The 1st step dealt with scraping data from the web and the 2nd step with cleaning that data.

By now you've learnt how to scrape data from realtor.com using Import.IO and how to prepare this data using Pandas. Here comes the most interesting part: building a predictive model of real estate pricing from the data we collected. I will show you how to use BigML for that.

BigML: Machine Learning made accessible

BigML is a powerful Machine Learning service that offers an easy-to-use interface for you to import your data and get predictions out of it. The beauty of the service is such that you do not need a profound knowledge of Machine Learning techniques to get the most out of ML. Sure you have advanced options available on the service, but in our case (and most of the cases) you will not need them. BigML creates predictive models easily thanks to its powerful "1 Click" feature. BigML is also free for tasks up to 16MB in development mode, which is more than enough for this tutorial.

How Machine Learning works

It is important to remember that when I talk about predictions here, I mean predictions in the sense of Machine Learning predictions. What is to be understood by Machine Learning is:

  • We want to create a model based on our existing data. This process is called Training. We train the model with examples, e.g. real estate data scraped from realtor.com. The model is an interpretation of the dataset and the relationships between the attribute we want to predict(house price) and the other attributes (lot size, number of bedrooms, etc.). In BigML, the predictive model is represented as a decision tree (see below).
  • Then, we confront new inputs to our model, in our case: the characteristics of a house (lot size, number of bedrooms, etc.), and BigML predicts the output (house price). This process is simply called Predicting, or sometimes Scoring. The output (also called 'goal' in BigML) is a value, in our case it is a numerical value, the price of the house.

For a more complete explanation on how Machine Learning works with concrete examples, have a look at the book Bootstrapping Machine Learning.

Getting Started

All you need in order to avoid mistakes is a well prepared CSV file that can be correctly interpreted by BigML. If you followed the instructions of the previous article and prepared your CSV file without problem, you are good to go. You can also download mine: realtor_importio_cleaned.csv.

Here is what the CSV looks like in Google Spreadsheet:

Create a Source

Let's start by going to the BigML dashboard page, under the tab "Sources". For the duration of this tutorial we are going to stay under the Development Mode which is the equivalent of a free plan but with minor restrictions that won't affect us.

Follow the instructions in the screenshot below to upload the CSV.

Now the CSV is listed as a source and can be used in BigML. Click on it to have a look at what was imported.

Note that the first line was added when we used Pandas in the previous step. It must be removed as it doesn't represent an attribute influencing the final pricing and shouldn't be used by BigML to create a predictive model.

Create a Dataset

We need to create a dataset from the source before we generate our predictive model. To do that, follow the instructions below.

Here is what it looks like.

Configure the Dataset

The last line is by default set as the goal (crosshair icon). The goal is what we want to predict. BigML calls it the Objective Field. If your last line is not the goal you can change by hovering your mouse over the line you want. An icon will appear for you to make the modifications. Removing the line added in Pandas is quite similar, see below. Un/selecting fields and choosing the objective field can also be done using the configuration panel.

It should look like this.

Filter the Dataset

This step is important because we need to exclude the houses that are too expensive, let's say more than US$1 million. Why? Because they are very rare but they could still mess up our results when evaluating the accuracy of our model later on. Discarding them makes our life easier and still allows us to make predictions for most houses.

Follow the instructions below to filter the dataset.


Below is what we had before filtering the price.

And now after the filtering.

After filtering the dataset you see that the representation of the price distribution is less concentrated on the right side of the histogram.

Create a Predictive Model

This is where you can experience the power of BigML. You can choose either to configure your predictive model manually or let BigML do the job itself thanks to its "1-click" features. We will of course choose the later and select the "1-click" model feature. See instructions below.

By default, BigML represents the predictive model as a decision tree.

The decision tree shows you how BigML classified the houses according to their attributes. By clicking on a node you can see how BigML grouped certain houses together and why. The panel on the right shows what the houses in the selected node have in common.

You can also choose a sunburst visualization and interact with it (sunburst embedded below). The only difference with the decision tree is that it starts from the center instead of the top.

BigML also offers a Model Summary Report feature. It helps you to visualize the importance of each field in the model in a simple histogram. Moreover, with the Download Actionable Model you can transform your model into code that you'll be able to run wherever you want.

Evaluating the Predictive Model

Having a predictive model is good, assessing its accuracy is better. To evaluate the model we will split the dataset we used in 2 parts.

  • The first part is called the training dataset, represents 80% of the original dataset and will be used to create a training model, exactly how we just did previously with the original dataset.
  • The second part is called the test dataset, it represents the remaining 20% of the original dataset.
  • We then run an evaluation where the model (built from the training set) will be used to make predictions on the inputs of the test set, and these predictions will be compared to the outputs of the test set.

All of these steps can be done easily in BigML, here is how.

Below is what you should have by now.

We can now easily create the training model as shown below.

We have successfully created a test dataset and a model from the training dataset. Now we ask BigML to use these two to perform an evaluation of the model and to assess its accuracy.

You need to go under the "Evaluations" tab, see below.

As shown in the following screenshot, you select the model on the left and the test dataset on the right.

The evaluation of the model is shown as a benchmark between the model and two baselines: mean-value predictions and random predictions. When in green we know that BigML outperformed the baselines. You can compare the results to see by how much. The R-squared method shows how much better the model performs compared to the mean.

We are told that the average error is $58,275.83 (Mean Absolute Error). It would be interesting in this case to also have the average error relative to the true value we tried to predict (the Mean Absolute Percentage Error). Indeed, making an error of 50K for a house that’s worth 1M is less of a big deal than for a house that's worth 100K.

One way to decrease the error is by adding even more attributes like the year of construction/renovation, presence of a swimming pool, proximity to schools, public transports, etc. We only grabbed data from realtor.com but adding economical and social data (Open Data) would be a great way to improve the results. Building such a model is more complex and out of the scope of this tutorial.

Another way to increase the accuracy of the model would be by increasing the volume of data we used. At the beginning we grabbed data for 9,000+ houses with Import.IO. After we cleaned the data we ended up with data for around 5,000 houses. More data means more accuracy as the Machine Learning algorithm can use a larger volume of examples for building a predictive model.

You can also try to create an ensemble of models to improve accuracy. What is an ensemble? "By learning multiple models over different subsamples of your data and taking a majority vote at prediction time, the risk of overfitting a single model to all of the data is mitigated. (BigML Blog)". Creating an ensemble in BigML can easily be done by clicking on "1-click Ensemble" instead of "1-clicl Model".

How much is your house worth on the market?

Now that you have your model you can enter the attributes of your house and get a prediction back. Note that the fields are ordered by importance, meaning that the first fields contribute more to error reduction.

Wrap up

If you need to remember one thing, remember that the predictions you get will be as good as the data you fed the algorithm with (in our case, the BigML algorithm). In this example we have been using what we call a regression model, in other words our problem was to find the most accurate numerical value (the price) for given attributes. BigML can also deal with classification problems, where the outcome you want to predict is a class (e.g. "blue", "red", "green").

Enjoyed this article? Vote on Hacker News!

Fabien is a business student at  the EBS University for Business and Law in Germany and at the KEDGE Business School in France. He loves to bring the IT and Business world together. At the moment, everything from Data Science to Digital Innovation brings him much joy! You can find him on LinkedIn and Twitter.