Here is a summarized archive of the emails I sent to my newsletter.
What if you could predict churn?
(June 27, 2014)
A couple of weeks ago I wrote a guest post on churn prediction for Kissmetrics, and I was happy to find out this week that they just published it.
Churn prediction is one of the most popular Big Data use cases in business. It consists in detecting which customers are likely to cancel a subscription to a service based on how they use the service. If you've read Bootstrapping Machine Learning, you already know about this. But I think the article is still interesting as it is kind of a mix of different things I have written about.
Being able to predict churn based on customer data has proven extremely valuable to big telecom companies. Now, thanks to prediction services and APIs, it’s become accessible to businesses of all sizes — not only those who can afford to hire teams of data scientists.
The process is as follows:
Step 1: Gather historical customer data that you save to a CSV file.
Step 2: Upload that data to a prediction service that automatically creates a “predictive model.”
Step 3: Use the model on each current customer to predict whether they are at risk of leaving.
Check out my full post on the Kissmetrics blog for details on these steps and for an illustration of how to go through Steps 2 and 3 with BigML.
The brain behind the hardware
(June 17, 2014)
The post is about a new trend in connected objects, which is to predict the user's next move. As you might guess, there is some machine learning involved ;) If you've read BML, you'll know for instance that Ford built a prototype of a car that predicts its destination — they just used Google Prediction API for that.
Check out the post on RudeBaguette.
If I had more spare time, I would build a system that would learn from my habits and do things such as figuring out which channel to tune in when I turn my TV on, or which setting to turn my Hue lights to, based on the context. If you're interested in these ideas or if you have others, let me know. I often talk about uses cases in web apps and businesses, but as you can see I am also very excited about the machine learning opportunities in connected objects! What about you?
Automating the data scientist
(May 15, 2014)
In my latest blog post, I am exploring the topic of automating the Data Scientist a bit further. The post is too long to include here, but if you're interested you'll find my answers to the following questions:
- What is a Data Scientist?
- What is it that they do that's being automated? What is not?
- What are the new solutions out there that are bringing Data Science directly to domain experts?
- What are these solutions changing?
The book is live
(May 7, 2014)
This is just a quick email to say that Bootstrapping Machine Learning is now live and ready for purchases: http://www.louisdorard.com/machine-learning-book
The 20% off sale will run for 24 hours until midnight, Pacific Time. If you are thinking about getting the book, now is the time.
P.S. Have a question about the book? Just hit reply.
This will save you a few days' work
(April 22, 2014)
When I started playing with Prediction APIs, I started with the web interfaces. It’s enough to have to deal with setting up accounts and projects at first, without having to worry about code! I recorded screencasts of what I did to help people quickly get a feel of how these APIs work and what they allow you to do, but also to have something to refer to when setting up their own accounts and projects with BigML and Google Prediction API.
Then, once I was able to train models and make predictions, I started coding. The language of choice for hacking and for data-related stuff is Python. It was debatable a few years ago, but not anymore. At the same time as I started using the APIs’ Python wrappers, I discovered iPython notebooks. Essentially, they act as interactive web-based code tutorials. They are web pages in which there are blocks of code that you can edit and run. The code is run on the same server that serves the page and the output is displayed on the page. I realised that iPython notebooks were a great way to show step by step how to use Prediction APIs, and you could teach things about the APIs in between blocks of code.
So I created a couple of notebooks. Now, when I want to bring someone up to speed on Prediction APIs, I have them go through the notebooks. They can also use them if they need to quickly test a piece of code in a friendly environment. I reckon that the screencasts and notebooks combined can save people who are starting out with these APIs at least a full day’s worth of work. If I’ve had to waste time figuring out the quirks and gotchas of Prediction APIs, you don’t have to.
Also, I made code to evaluate the performance of these APIs on any given dataset. That’s something you’re going to need at some point, because there’s no way you’ll use predictions for real if you haven’t evaluated their quality first. So I figured that if you would just re-use my code, you would save even more time. It’s as simple as running:
> python evaluate.py --filename=yourdatafile.csv --services=bigml,gpred
and it gives you back performance measures.
Still, even with all of this that's already been done for you, you’ll have to install Python plus various Python packages and the wrappers for the APIs, and hope that there won’t be any issues specific to your machine. To save you from doing that, I have created a Virtual Machine that runs with Virtualbox (a free virtualization software). You can recreate the VM with Vagrant, a tool that allows to dialog with Virtualbox programmatically and thus to script the creation of VMs. You can try that now if you want since the base VM (also called "base box") is public and you just have to launch these three commands:
> vagrant init louisdorard/bml-base > vagrant up > vagrant ssh
You’re then logged into the VM and you can start using Prediction APIs. Again, it's going to save you lots of time!
Wanna see some concrete uses of prediction APIs?
(March 5, 2014)
Last Thursday I gave a Bootstrapping Machine Learning presentation in which I talked about reimplementing Priority Inbox with Prediction APIs. In case you’re not familiar with it, Priority Inbox is a Gmail feature that detects important incoming emails and separates them from other emails by placing them at the top of the email client interface and marking them with a special icon. Detecting important emails is a classification problem that can be tackled by Machine Learning.
The talk wasn't recorded but I thought you would find some useful content in the slides:
- an explanation of the two phases of all Machine Learning systems;
- snippets of Python and JS code employing Prediction APIs;
- the format of the data I used to learn a model of email importance;
- some methods of Google Apps Script you can use yourself to collect your own data on Google services;
- a link with some more detailed information and example code;
- links to my offline email analysis with BigML;
- a link to a Google Spreadsheet that uses Google Apps Script and Google Prediction API to allow you to make predictions and fill in missing values in your spreadsheet.
Head over to my blog to look at the slides and to get a better idea of how bootstrapping ML can be done:
Obviously, all of this is covered in greater detail in my book. By the way, I completed a first version a few days ago! I’m now bringing some minor corrections to the manuscript and I'm working on the additional content (tutorials, code, etc.). I’ll announce a launch date soon, so stay tuned!
Example uses of machine learning
(January 27, 2014)
Ever wondered what Machine Learning is used for? Here is a list of examples:
- Churn analysis;
- Up-sell opportunity analysis;
- Pricing optimization;
- Sales optimization;
- Fraud detection;
- Credit scoring.
- Targeted advertising;
- Website and email optimization;
- Priority inbox (important email detection);
- Car trip optimization;
- Tweet sentiment analysis;
- Crowd prediction.
In the same way that in ML we try to make machines learn from examples, it's a great idea to learn what ML is about from a few example use cases. But if you're like me, you probably haven't been very satisfied with the ML examples you've read about. Most of the time they are too abstract. You never know which question exactly the predictive model is trying to answer, which are the inputs and outputs, and how the learning happens. Conversely, you can find very detailed examples in books on ML or Data Science, but they would be so detailed that it would take time to follow from beginning to end, or they would be intertwined with the rest of the content of the book.
In Bootstrapping Machine Learning, I have gathered examples in one single chapter, so you can get an overview of diverse uses of ML and you can easily refer to them. I have just finished writing about all the above examples! Each of them is a couple of pages long, which allows to cover quite a few examples while still providing enough information, structured in the following way: who the example concerns, a description of the context and what we're trying to do, which question are we asking to the predictive system, what type of ML problem is it, what is exactly the input, the output, what features do we use, how do we collect data (+ how much do we collect) and how do we use predictions (+ how often do we make them).
Just a couple more chapters to finish now... it's good to get closer to the end! How are things on your end? Any questions about ML you'd like to ask?
2014 will be the year of machine learning
(January 3, 2014)
As far as I'm concerned, 2013 was a year full of developments. It ended on a very positive note after running my first workshop (and learning quite a lot from it) with people from all around the world. In order to tell you why I am now so excited about 2014, and why you should be too, allow me to go over some history and recent developments related to Machine Learning and Prediction APIs.
In 2011, McKinsey published its Big Data white paper in which it predicted a shortage of people with expertise in machine learning. Funnily enough, this was the same year that the first Prediction APIs appeared. If you've been following me for a few weeks, you'll know that I believe that these APIs are part of the answer to this shortage of talent, because they democratize Machine Learning.
Last December 31 I stumbled upon an article from The New York Times in which a Machine Learning Stanford course was mentioned:
The largest class on campus this fall at Stanford was a graduate level machine-learning course covering both statistical and biological approaches, taught by the computer scientist Andrew Ng. More than 760 students enrolled. “That reflects the zeitgeist,” said Terry Sejnowski, a computational neuroscientist at the Salk Institute, who pioneered early biologically inspired algorithms. “Everyone knows there is something big happening, and they’re trying find out what it is.”
This was also picked up by Forbes in Why Is Machine Learning (CS 229) The Most Popular Course At Stanford?. Besides, Andrew Ng's online ML course is one of the most popular on Coursera. Unfortunately, these courses do not tell you about Prediction APIs but about the inner workings of ML algorithms. This is way more technical than necessary, and I believe that this is why there's a 90% drop-out rate (more on that later). Anyway, it all testifies to the fact that people want to learn ML and to use it in what they are doing.
I believe that 2014 will be the year of Machine Learning because there will be even more Prediction services and APIs, more resources to help people use them, competition will push them to introduce more awesome features and to find ways to make it easier and quicker to start coding with Prediction APIs (last month, Codenvy announced an integration of BigML).
I hope you'll have as much fun as I will with what 2014 will bring us. Happy new year everyone!
(December 3, 2013)
Remember that workshop I told you about a couple of emails back? Here is some more information.
The workshop will be hosted online and it will take place on Monday 16 December at 8am PST (5pm CET). It will last 2 to 3 hours and there will be 3 parts (with breaks in between):
- a presentation covering the main aspects of Bootstrapping Machine Learning;
- a hands-on demo;
- questions and answers.
There will be 10 seats only and they will be given, free of charge, on a first-come-first-served basis.
I will post the link to sign up to the workshop later this week. If you haven't done so already, and if you want to get the sign up link before it is posted publicly, send me an email now!
Which blogs should I guest-post on?
(November 27, 2013)
In my previous email, I announced that I was giving away seats to my upcoming online workshop on Bootstrapping Machine Learning in December (to be announced soon). If you’re interested but haven’t told me yet, you can still shoot me an email.
I am now giving you the chance to also get the book for free!
For this, I’m asking you to suggest popular blogs to which I should be pitching guest posts in relation to Bootstrapping Machine Learning, Prediction APIs, and Big Data. Do let me know if you also have ideas for what I should write about.
I will send the book for free (when it’s out) to the first 5 people who reply to this email with their list of top 3 popular blogs I should guest-post on!
(November 4, 2013)
I'm going to be traveling a bit during the next couple of months and there might be a chance that I come near where you are... If so, I'd be happy to meet up with you if you like and to talk about what you do and how Machine Learning can help!
Here are my travel dates: * I’ll be in the Bay Area from 7 November until 1 December (I'll be available during Thanksgiving week). * I'll be in Bordeaux from 4 until 17 December * I'll be in Paris on 2 and 3 December and from 18 until 21 December. Drop me a line if you're going to be in any of these areas at these dates!
Otherwise, we can still talk online about ML and any problems you have, during my open office hours or during my upcoming workshop.
Chapter 3 is out!
(October 28, 2013)
I have just released chapter 3 of the book, which is about Machine Learning concepts. Here is a summary:
- The most common types of Machine Learning (ML) problems are classification, regression and recommendation. In classification, the output you try to predict is a class (e.g. "spam"/"ham"). In regression it is a number (e.g. the value of a house for sale). Then there is recommendation, where given a user and a number of items N to recommend, you must find the N items that are most likely to be of interest to the user.
- Recommendation can be done by either classifying user-item pairings as relevant or not, or by Collaborative Filtering. CF consists in representing connections between users and items. For a given user u, it finds other users who have items of interest in common with u, and it looks for the items of interest of those users that u hasn't seen yet. The best method for recommendation depends on how many items there are, how much data you have, and how many items a user is connected to on average.
- There are two phases in ML: training and prediction. Training consists in analyzing example input-output data and it returns a model. Prediction consists in taking a new input and a model to return a predicted output.
- Generic Prediction APIs have two core methods that correspond to these two phases. There are also Prediction Algorithms APIs, where one needs to figure out and provide the best algorithmic parameters for training, and Specialized Prediction APIs, where the training has already been done. As you may guess, in my book I focus on teaching how to use Generic Prediction APIs.
- Some of these APIs are specialized in handling Big Data, but they may not be the best option if your data is not so big — one size doesn't fit all. It is actually quite rare to have Big Data. If it is the case and if you're just starting out with ML, it's best to first experiment with data that is small enough to be loaded into a spreadsheet program (for this you could randomly extract a subset of the data).
That’s it for now. If you want more updates, you can follow me on Twitter!
Why machine learning fails
(October 16, 2013)
Do you know what are the top 3 reasons why machine learning fails?
If you've read chapter 2 of Bootstrapping Machine Learning until the end, you will know 2 reasons: representativeness of the examples and similar inputs being associated to different outputs. I've added a 3rd one: noise. I have also expanded the explanations on the 2nd reason. You can read this new content in this blog post I have just published.