Google

Immediately after PAPIs.io '14 — write-up coming soon! — I spent a couple of days at Strata in Barcelona. Strata has several tracks and I ended up going mostly to “business” sessions, but this synthesis of things I heard at the conference will be of interest to technical people as well. Actually there was one business session that had more code in it than another data science session I went to!

Here is my selection of key take-away messages, from sessions I attended (Strata is a huge conference so this is just a very partial view of it). Click the links to go directly to the ones you're interested in:

Make predictions that you can act on

Vince Darley: Big Data at King

Making predictions about customers (e.g. engagement, churn) is useful, but keep in mind that an action has to follow from this prediction…

  • You could use a descriptive model that tells you why this prediction was made, i.e. which combination of factors results in predicted churn, etc., which will help you choose the right action (e.g. "the customer might churn because of this so we should address this particular point").
  • You could directly predict an action to take, based on data you’ve collected about the success of different actions in different contexts.

When hitting a performance plateau, use new data — not fancier algorithms

Foster Provost: Digging into Predictive Analytics with Fine-grained Behavior Data

When you’re hitting a performance plateau with your predictive model, it probably means that you’ve extracted all the information there was to extract from the (limited) data you had. There’s no need to try ever fancier algorithms, but what you should do is work on the data itself. Foster Provost talks of looking for “massive fine-grained data”. I didn't know this terminology. I would just say to make a new pass on feature engineering (i.e. finding all the things we could measure that could have an impact on the outcome we want to predict). Here are some examples he gave of data to take into account:

  • locations frequented
  • things liked on facebook
  • people connected to
  • words, phrases in text
  • mobile analytics
  • inferred personal traits
  • document classes

Ensuring your work is reproducible has never been easier — Now make it your habit

Jeroen Janssens (@jeroenhjanssens): Data Science Toolbox and the Importance of Reproducible Research

Virtual Machine

The take-away message here is that you should always work off of a Virtual Machine which is created from a common base (e.g. Ubuntu 14.04) and a set of instructions that is contained in a configuration file (this can be through Vagrant or Docker coupled with a provisioning tool such as Ansible). This config file should be added to your version control system and should be “in sync” with your code: if you make changes to the VM config, it should recreate the VM from scratch and you should make sure your code still runs.

Jeroen created a base VM called the Data Science Toolbox, which contains most of what you'll need for Data Science (at the command line). If you need extra things to install, it’s just a change in the config file. The provisioning instructions of the DST are open source.

Drake: Make for data

For Data Science, you also want to be careful when using spreadsheet programs for data transformation. While I personally recommend using Excel (up to a certain point — open but don't save), if you need to transform your data you should do it with scripts and keep a copy of the original data file. You’ll probably need to reproduce your data transformations when you get more data or fresher data. Have a look at Drake to make it easier to do so.

Notebooks

Notebooks are a great way to document how to reproduce your analyses as you’re able to explain in a web page all the steps you’ve taken to go from pre-requirements to final result, with embedded, runnable blocks of code for each step. This is popular in the Python community with IPython notebooks but the Jupyter project extends it to other languages.

Getting a business edge with data requires automating decisions

Uwe Weiss (@WeissU): Why decision automation is key in big data analysis

Decision making is very challenging for humans, for several reasons:

  • situations are complex (many things to take into account)
  • decisions must be made frequently
  • we humans have natural biases which make us take irrational/bad decisions (see Daniel Kahneman's Thinking Fast and Slow)

Using data can help us make more informed decisions and remove some of the bias. But Uwe’s practical experience points out to communication issues:

  • metrics dashboards are contradictory and confusing
  • monthly reports are ignored after two iterations
  • in-house analyst teams are overworked and powerless

Another part of the solution to improve decision making process is to write business rules. But they’re still going to be followed by a human — with the limitations it implies. Business rules are like computer programs, written by non programmers. So how about having them written by a programmer? Better yet: how about having the rules figured out (and written) by a machine, after an analysis of all the data that’s available (i.e. with machine learning)? This data is always much more than a human can process, and often times a small change in one of the things we measure about a given situation can have an important impact on the outcome we observe.

Predictive modelling can be used to extract relationships between situations and outcomes, from historical data. In businesses that sell perishable goods, it is of interest to predict demand based on the current context. You can then decide how much to have in stock accordingly. If you don’t have enough then you’re losing business; if you have too much then you’ve wasted money. The ideal decision-making process is as follows (completely automated): the machine collects and stores data, it analyzes it to create a predictive model, it uses the model to make predictions, and it makes a decision based on predictions.

Others

Google Docs can work as a front-end to your data warehouse

Seeking Alpha have done it and they've showed us how. Google Docs are easily accessible (web, tablet, smartphone), shareable, and they can be used by everyone up to C-level execs. They take care of a lot of the IT and front-end work in making it possible to show your company’s latest data internally to as many people as possible.

A/B testing does not work in social

In many cases where you need to test in the real world how good your predictive models are doing, A/B testing does the job, but not when you’re testing features with social components (e.g. for games): what if 2 friends are in different segments? I’d never come across that problem and I’m not sure what to do in that situation but it’s worth mentioning and thinking about…

Would I recommend Strata?

This was my first Strata conference and it's certainly what I'd call a commercial conference (in my own "commercial" vs "industry community" vs "academic" conference classification). For $1000 it’s not exactly small business-friendly… (I was very lucky to get a free pass) It also feels like you get sold a lot of things and some talks are thinly-veiled sales pitches for the technology they're presenting. But you still get to hear very interesting things and stories (as shown in this post I hope!) and it's a place where many people in the data community come and meet (I was told the audience is around 1500 people).

If you're a big company, Strata will surely be something you'll want to attend. I wish there were a similar event that would be more suited to small businesses...

Enjoyed this article? Follow me!