The Ecosystem of Data Jobs – Data Scientist (Definition & Profiles)

Within the ecosystem of data jobs, there is one profile that stands out: the Data Scientist. There is a legitimate criticism that there is an inflated hype in this job title. Now any quantitative job – or anyone doing it – ascribes it because it is the “sexiest job of the 21st century”. However, we argue that the significant advances in the field and the way it has profoundly changed our practical lives deserve the importance it is given. However, some clarity is needed to sketch a deeper definition of Data Science that explains what it encompasses, and that gives and understanding of the specializations that have evolved from it. We are here to provide you with the right definitions and set them into context.

This time, the approach to the topic is to progressively zoom in, go from the most general to the particular applications – and job titles – of Data Science. It is important to emphasized that companies, depending on their field and needs, will seek candidates with different degrees of specialization. Whether you are a candidate looking for a job or a company in need of a new Data Scientist, it is fundamental to understand the level of specialization that best fits you.

This guide will take you through:

A deeper definition of what Data Science is and the conditions for its success

And then we zoom in to the Data Science Job profiles:

This is the second part of a series in which we guide you through the Ecosystem of the data jobs market. If you haven’t read the first part, jump to it to get a full picture of the different profiles of data jobs.

A deeper definition of Data Science

A minimalist definition of Data Science states that:

Data science focuses on the processes and systems that enable us to extract knowledge or insight from data in various forms and translate them into actions.

Source:Realizing the Potential of Data Science

This very succinct definition of Data Science already tells us a lot – and we will come back to it. However, to truly understand Data Science, we need to go beyond. The challenge is that the field of Data Science is quite novel and still developing. How is it possible to define something that is still on gestation? Isn’t that a contradiction?

On the first part (link) of the series we jumped right into the topic without explaining why we define the data job industry – and per extension the field of Data Science – as an “ecosystem”; why we consider it a very productive metaphor. Now is time to develop this idea:

An ecosystem is defined to be a biological community of interacting organisms and their physical environment. This means that to understand an ecosystem, it is necessary to analyze the interactions that occur among its organisms and situate them – both the organisms and their interactions – within a physical environment to see how they are shaped. We propose the following mental experiment: Can we borrow some basic concepts from evolutionary biology to understand the development – especially the fragmentation and specialization – of Data Science and its job profiles?

The first idea that we explore from the definition of ecosystem is that the context or physical environment in which these organisms exist is key to the possibilities that they have to develop, the concrete forms they take, and the profiles that turn out to be most successful. For the case of the Data Science ecosystem, some of the main characteristics of the physical environment in which it developed:

The fact that the mathematical theories underlying the techniques were mature enough to serve as basis for the development of new techniques to extract value from data
Starting with the Industrial Revolution, there was a state of increasing technological development that allowed unprecedented interconnectedness
This interconnectedness started with physical mobility of people, goods and capital but with digitalization it reached a new level in which the immediate exchange of ideas was also made possible.

These unique conditions formed primordial soup containing all the necessary ingredients for life to spark: the innovation loop in Data Science started.

The second important idea to explore is that the interaction among the organisms is a distinguishing feature of the ecosystem. If we consider the Data Science field to evolve as species in the natural world do, can natural selection help us understand their development? And please note that we explicitly avoid equalizing natural selection with the common notion of “survival of the fittest” *. Why? Because in our view, the concept of “survival of the fittest” understates – or even indirectly denies – the cooperative forms of evolution that are very important in any ecosystem.

How would an evolutionary explanation of Data Science look like?

After the innovation loop started, the first proto Data Scientists appeared: Mathematicians, Statisticians, Business Managers, etc. For some time, there was stability, until the conditions of their environment changed: computers appeared. This set up an unprecedented revolution. Data came to be the main nutrient, the thriving factor of this ecosystem. In the beginning, just a few recognized the advantages that data brought; but with time there was widespread understanding of its potential. Organisms in this ecosystem developed data skills that focused on extracting value from data in different ways – with different techniques or different tools. Moreover, some data organisms specialized in processes like creating more efficient ways of storing this information or spreading their applications back to the environment. Advancements in one field made other data jobs more productive, making cooperative interactions very convenient. The ground was ready for the Data Scientist to appear. The newly gained computing capabilities allowed them to apply new methods based on looking into data in a completely different way. They started focusing on predicting future events and they got really good at it. They discovered that these techniques could be applied to almost every aspect of human life: recommending the next movie you should watch or making a more accurate diagnosis of diseases. The digitalization brought an explosion in the amount and diversity of data that was created and fed to the models. The Data Scientists developed ways of extracting value of data formats that were deemed useless until then -text, audio, images, video -. This allowed them to diversify even more the applications that their models had. More specialization started being a very smart thriving strategy. Data Scientists that focused on teaching computers to learn to process and understand human language (NLP-Engineers) or some that taught them to see and recognize images like humans do (Computer Vision Engineers) started having some of the best survival rates – in terms of the paycheck they get at the end of the month -. Now the aim is to recreate a general intelligence – one that does not specialize in just one area – an intelligence like the one humans have or beyond.

Click here for a very eloquent and entertaining explanation of why “survival of the fittest” is very detrimental to the real understanding of how natural selection works

This is just an experiment, but it helps to emphasize the fact that Data Science has developed as an interdisciplinary field. Back in the real data job market, particular skills are expected from Data Scientists that depend in part on their specialization level. A detailed description of them is developed in the next sections.

What do Data Scientists do – what skills do they have that make them so successful?

Paraphrasing the minimalist definition of Data Science introduced in the last chapter, Data Science is a field that extracts valuable insights from various types of data to guide action. The approach and techniques Data Scientists have developed to get value from data deserve the most credit for their success and are therefore the main skill that Data Scientists bring to the data jobs ecosystem.

The skills described in this section for the Data Scientists apply for the rest of the profiles. To avoid repetition, just the specific skills that are needed in each specialization are listed.

Data Scientist Skills

Strong Math knowledge (Probability, Statistics, Linear Algebra, and Optimization Techniques)
Expertise in mining data and exploratory analysis
Programming languages: Python (Tensorflow, Matplotlib, Pandas, and Numpy); R (tidiverse, ggplot2, CARET)
Experience with Big Data tools like Scala, Hadoop, Spark, Cassandra

Data Scientists Job Description:

Perform exploratory analysis in large data sets
Clean and prepare data for modeling
Decide and test the best approach to solve and optimize a given problem
Model creation, deployment, and optimization

Machine Learning Engineer vs Data Scientist – Are they now different job titles for the same?

The main difference between a Machine Learning Engineer and a Data Scientist is that, if a company posts an opening for a ML Engineering, this means that they expect the candidate to predominantly apply ML Algorithms. This means that other methods like Data Visualization and Data Exploration will tend to be subordinated to their usefulness to design better Machine Learning Models.

One of the main distinctions that Machine Learning Engineers do is between two types of ML algorithms: supervised and unsupervised learning techniques. In supervised learning, the models are trained by feeding them data set with tagged observations. This means that each observation from this training data set contains the correct answer of what the computer should learn to classify. Normally, the Machine Learning Engineer defines the features that the computer should use as input to discover the patterns that best predict the information. This strongly determines the success of the model and is why feature engineering is a very important skill that a ML Engineer should have.

Unsupervised Machine Learning Methods are much less popular because they are considered less accurate in making predictions, but there are still some very interesting applications. If they are less accurate, why are they still on the data market? In a lot of cases, it is not feasible to find data sets for training models. ML-Engineers can still gain a lot of knowledge from applying these methods. Some interesting applications of unsupervised methods are clustering techniques (K-means, hierarchical clustering), dimension reduction (Principal Component Analysis, Singular Value Decomposition) and Topic Modeling (Latent Dirichlet Allocation).

Deep Learning Engineer

Within the Machine Learning field, Deep Learning algorithms have gained a lot of attention. So much attention that there is now a job title for people that specialize in applying these algorithms. Why? Because, for certain applications, they have achieved accuracy levels that cannot be achieved with other models.

The most successful and therefore popular type of Deep Learning algorithms is known as Convolutional Neural Networks (CNN). These models set the main difference between other Machine Learning vs Deep Learning algorithms. In CNNs it is no longer necessary for Deep Learning Engineers to define features for the model. By defining a lot of layers, these models learn to detect the most useful features for prediction. For example, while analyzing an image, they can define a layer that specializes in detecting edges. These layers are combined in ways that are incomprehensible for the Engineers, but that give the best accuracy scores in the ML Data industry. Deep Learning Engineers are specialists in building, training and evaluating these types of models.

Although there is no specific job title yet, Reinforcement Learning is a new type of algorithm that is challenging the predominance of Deep Learning techniques in the Machine Learning world. These methods work by trial and error; defining a feedback system in which the machine can learn from its errors to try different approaches.

Natural Language Processing (NLP) Engineers:

One of the capabilities that make humans so interesting is the complex ways of communication they have developed. The most important materializations of this phenomenon are our languages. But languages are much richer than just words put together in a sequential order. Like Ray Kurzweil said “If you write a blog post, you’ve got something to say; you’re not just creating words and synonyms. We’d like the computers to actually pick up on that semantic meaning”. This has turned up to be a very big challenge that is still ongoing. While NLP Engineers have not yet achieved the point in which a computer can understand all the subtleties, they have developed successful models for more modest applications that are already revolutionizing our lives.

Natural Language Processing Examples

Language translation
Smart assistants (Apple’s Siri, Alexa)
Topic Modeling
Autocomplete function in telephones or computers
Spelling and Grammar correction (Grammarly)
Bots for customer support

Computer Vision Engineers

Computer vision is another very good example of how a particular application of Data Science has evolved into creating a new niche. As with language, the human capacity to see images and understand is quite challenging for a computer. One of the most notable examples has been the difficulty for computers to correctly learn 3D shapes. The ability of machines to move in space is strongly determined by their ability to recognize images and shapes and classify them accordingly. Computer Vision Engineers rely heavily on deep learning techniques since these have proven to be the most successful ones.

Computer Vision Examples

Facial Recognition
Healthcare applications (diagnosis from images like X-rays)
Movie restoration
Autonomous cars

Conclusion

In this second part, we develop a deep definition of what Data Science is and, by understanding why there has been increasing specialization, we progressively explore Data Science Job Profiles that have derived from the Data Scientist: Machine Learning Engineer, Deep Learning Engineer, NLP Engineer, and Computer Vision Engineer.

If you missed the first part (link) of the series, it is not too late! There we develop different data job profiles that complement the work of Data Scientists. For example, Data Analysts, Business Analysts, Data Engineers, and Data Architects.

Feeling inspired? Here you can find our available positions

And if you have any comments or questions, we will be happy to help you.