Skills of a Data Miner

The sexy job in the next ten years will be statisticians

This post has been inspired by an interesting, if unusual, quote in a McKinsey Quarterly:

I keep saying the sexy job in the next ten years will be statisticians. People think I’m joking, but who would’ve guessed that computer engineers would’ve been the sexy job of the 1990s? The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, not only at the professional level but even at the educational level for elementary school kids, for high school kids, for college kids. Because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it.

— Hal Varian, Google’s Chief Economist

Gold mine

Hal Varian is referring to what is now considered a pseudo-science on its own: data mining. Data mining is not only statistics, even if statistics is the most recognised academic component of it. It also includes data cleaning, machine learning and data visualization.

To put everything together, you need a good dose of programming skills. Therefore, a modern statisticians, a data miner, should be able to perform if not all, most of the following activities

  • Data cleaning. A number of processes that are applied to an initial data set to convert it into a different, but related data set. These processes will fall into a number of categories: recognition, parsing, filtering, and transformation. In other worlds this is the painful-but-fundamental process of cleaning data before any one performs any meaningful analysis on it.

  • Statistics. Without statistics, there would be no data mining, as statistics are the foundation of most technologies on which data mining is built. A degree, if not a master of science or a PhD on applied mathematics would be helpful.

  • Machine learning. The discipline to write computer programs which learn the data they study. Here you need knowledge of scientific dynamic languages such as Python and R.

  • Visualisation. At the end we, and most importantly our clients, need to see our precious results. The likelihood is those results are not easy to display on a simple two-dimensional graph. Advanced visualisation software is required in order to handle multidimensional datasets.

These techniques are used together to study data and find previously-hidden trends or patterns within. Data mining is increasing acceptance in science and business areas which need to analyse large amounts of data to discover trends which they could not otherwise find.

We are being swamped with data we mostly cannot use for any business advantage. The ammount of data will double in the next three years. Raw data is useless. Those geeks or sexy statisticians who can model, clean, and visually communicate data are going to have fun.