Everyone can do Data Science, Part 0 — Introduction

The following is a guest post by Fabien Durand in a series called Everyone can do Data Science.

Many data-related tasks that once needed you to be a software engineer, a PhD graduate in an analytical field, or a "Data Scientist", can now be performed with easy-to-use graphical tools and beginner-friendly software. There are different types of tools tailored for different stages of data processing, and most of them require less to no coding experience.

In this tutorial, I will show you how these tools can all be used in a complementary manner in order to create a predictive model of the value of real estate properties, based on data extracted from the web. I am not a programmer but a student in a business school. What I've done will show you that the new tools are democratizing Data Science and making it easier than ever.

Data What?

Before we start, I want to give you my definition of Data Science so you can understand why I structured the following tutorial the way I did. When I talk about Data Science, I think of a combination of activities that help to solve a problem for which data is available. Traditionnally, a Data Scientist is someone who has cross-domain knowledge ranging from Mathematics to Statistics, Machine Learning, Computer Science, Business Analytics, Marketing, Social Science, etc. Data Science will not look the same in two different situations. It largely depends on the specificities of a given problem, the skills of the individual and the domain of application.

“I keep saying that the sexy job in the next 10 years will be statisticians, and I’m not kidding.” – Hal Varian, chief economist at Google

How is this tutorial different?

Now that I have given my definition of Data Science, I want to explain why this tutorial is not a traditional Data Science tutorial. I truly believe that the best way to learn something is by actively doing instead of just listening or reading. Here will be only practice, by that I mean concrete and useful examples.

This tutorial will be a basic introduction to the world of Data Science. It aims to show you how to solve a Data Science problem in a pragmatic way, using the latest tools available. The goal here is not academical but instead very practical.

The tutorial is divided in four parts and each takes less than 15 minutes to complete.

Formalize a problem

Start by asking a question

“You can have data without information, but you cannot have information without data.” – Daniel Keys Moran

There are tons of useful information out there and knowing how to leverage the data available online is an advantage that can be cultivated. All you need is a starting point, a simple question.

For this tutorial we will use a concrete example that anyone can reproduce and adapt. Let us ask ourselves this question:

How much is my house worth right now in the market?

In other words, we want to predict the value of a given house (predict in the sense of a Machine Learning prediction, which we'll explain in part 3 of the tutorial).

Refine with some specifications

Problems are easier to tackle when they've been well specified. Here, I am adapting the framework given in Bootsrapping Machine Learning (you can find further explanations and more examples in the book):

Input (what we're given): Real estate property.

Input representation: Number of bedrooms and bathrooms, surface, year built, type of property (house, townhouse, condo), ZIP code, proximity to good schools, public transports, and access to shops and amenities.

Output (what we need to come up with): Value (in dollars).

Who is concerned: Buyers need to know that the asking price of a property for sale is justified, given its characteristics and in comparison with other properties on the market. Sellers need to find the asking price that will maximize their chances of making a good sale. Therefore, it is beneficial for both types of people to estimate the value of real-estate properties based on data on previous transactions.

Solve the problem in 4 steps

Making predictions does not have to be an overwhelming process. If you look at what happened in the data science community during the last couple of years, many tools and services have emerged to make data-related work a lot easier and accessible. We will take advantage of these tools.

Let's go back to our problem: How much is my house worth right now in the market? To answer this question, the tutorial is divided in four parts. Each part focuses on a different stage of the process and highlights a specific tool:

  1. Extract real estate data from the web with Import.IO
  2. Clean the data with Pandas
  3. Use the data to create a real estate pricing model with BigML

Fabien is a business student at  the EBS University for Business and Law in Germany and at the KEDGE Business School in France. He loves to bring the IT and Business world together. At the moment, everything from Data Science to Digital Innovation brings him much joy! You can find him on LinkedIn and Twitter.