Becoming A Data Scientist

Sep 02, 2014

What is data science? A buzzword with meanings as varied as butterflies. If you flip through the job descriptions of data scientists, they all require similar skill sets, and seemingly want the same group of people, but it is still unclear what is exactly expected. I am fortunate enough to work in this area, and have experienced the transition from statisticians to data scientists. Recently graduated and will start to work as a data scientist soon, I decide to put down pieces of thoughts about this sexy job in an organized manner. Hope my personal opinions of the constant battle between statisticians and computer scientists, and my personal experience of working as a data scientist, can be a little help if you want to pursue a career in this area.

By definition, data science is the union or intersection of statistics, computer science, and domain knowledge. Since I am trained to be a statistician, I constantly ask myself are data scientists any different from the old fashion statisticians? Conceptually, there is nothing new. The famous paper Future of Data Analysis published half a century ago by John Tukey, who, by the way, is a statistician, emphasized all three components. The importance of domain knowledge can not be overstated as statistics is by nature an interdisciplinary study area. Although theories are the foundation of data analysis, theories by themselves are often limited in real world. The application of theories based on convenient assumptions that neglect the imperfect reality of the subject may get misleading results. Insights of the domain subject define the direction of the data analysis, and lead to valuable results. Other than domain knowledge, computing is another aspect that Dr. Tukey tried hard to emphasize. He sometimes refer to it as practicality. ‘If data analysis is to be helpful and useful, it must be practiced.’ With that being said, what is expected are engineers, who have the capability to build something that takes the data as the raw material, and provides the results as the output product. Much like Diana Prince must twirl into wonder woman (watch), a data scientist must be able to beautifully perform the instant transformation between the statistician/mathematician and the engineer/programmer.

Since the road map has been laid out half a century ago, is data science really an old gift in a new wrapper? Based on my understanding, it is new, because every challenge we face on a daily basis is unique by nature. Moreover, as the size of the data makes many attempts much time consuming, better understanding of the background theory and the targeting problem turns out to be crucial in order to accomplish the job in time. It is a long way to go from acknowledging the challenge of computing to fully conquer this field.

The battle between statisticians and computer scientists is still ongoing, and probably will not come to an ceasefire in the foreseeable future. The way I see it is the battle between the beautiful complex models and what can be practically implemented. As Tukey mentioned before, practicality is an obstacle statisticians have to overcome. This includes data manipulation and building stable robust models. The tedious process of data cleaning and data organization is a crucial part of the job. It is a conceptually easy task, but could be the Achilles heel of statistics students. A lot of patience and meticulousness are required. An argument made from the standpoint of human resources says that having data scientists spend tons of time organizing and manipulating the data is a waste. I disagree. Not only running the model need organized data the format of which only data scientists are familiar with, but interesting insights of the problem come from exploring the data set, especially when the data is big and complex. The next step, building stable and robust models should be of more fun. However, to make it solid is not easy. It is often the case that the data breaks the very assumption that the models are built upon. If the model built gets run in a system over and over again using updated data set, more challenges come with corner cases and heterogeneity. This is when the intuitive and practical methods proposed by computer scientists swoop in. To work as a data scientist with a background in statistics, solid knowledge of statistical inference, modeling, and algorithms is the prerequisite. On top of that, we need to get familiar with the practical tools in computer science that has been proven to be useful with massive empirical experiences.

On the job market, you may find two types of data scientist jobs, depending on how the analysis results are used. The first type of data scientists produce results that talk to the machines. A major employer is the online advertising industry. The second kind of data scientists produce results that communicate to human beings. Although the two types of data scientists sound dramatically different, they require very similar skill sets, except the former emphasizes solid coding skills while the latter requires good interpersonal skills. It is probably hard to identify which type it is based on the job description, but you will have a clear answer after the phone interview. If your goal of life is to build a successful career, I believe the two types of data scientist jobs are merely two paths that lead to the same destination. Showing the capability of analyzing data can be useful to land a job, while the domain knowledge and the interpersonal skills determine how far you can go down this career path. It is not easy to crunch the numbers, but the time and efforts invested will provide some useful results in return; it is much harder to deal with those parts in the job that do not involve data.