Top 9 skills and tools for data science in 2021

Arturo Gonzalez
14 min readDec 9, 2020
Source: www.informationweek.com

If you are beginning your data science or data engineering career, or if you want to catch up with the latest trends, I hope this post can be useful and help to solve some of your questions.

As some of you may already know, loosely defined, Data Science is the intersection of statistics, computing and domain expertise with the goal of extracting knowledge out of data. Whether in the fields of finance, medicine, engineering or public policy, data science can provide huge value if done correctly.

In this post we will be reviewing the top skills and tools in order to be relevant for a data science job in 2021. We will be reviewing them in a way that allows interested people to have a starting point in the data science world and will evolve to more complex topics.

Disclaimer: Personally, being a data scientist and an engineer myself, I am more inclined towards data engineering, big data architectures, data processing pipelines and machine learning deployment, this post is based on my personal experience in the field and is not intended to be considered as a universal truth.

Without further ado, let’s jump into it !

1. SOLID MATH AND STATS

The most important skill you need to have if you want to be a data scientist worth your salt, you need solid math, ranging from linear algebra all the way to Bayessian stats. The core of a data scientist is the capacity to extract knowledge out of data in such a way that enables better decision making by key players in companies.

A data scientist will get his hands around datasets, understand them using descriptive analytics, try to figure out why are there missing parts and inconsistencies, adjust probability distributions, make machine learning models and present the results to business stakeholders in order to get the proper feedback.

A very important remark that I want to highlight before we go further, is that in order for a company to take advantage of the data scientist work, the company needs to be able to put their results into operation, it makes no sense to have a data science team if their work is going to end in a desk of some clerk and there’s where mlflow enters to the rescue haha (Skill #3).

2. PROGRAMMING LANGUAGE

The second most important skill you need to master is a programming language. By definition, a data scientist is capable of delivering data products, ie. a machine learning model, an api or a dashboard, and to be able to do it, he needs to be skilled in the tools that are better suited for the job. In this case, a programming language provides the data scientist with the capacity to create projects from scratch, let’s take a look at the most relevant programming languages according to the tiobe index.

Tiobe Index November 2020: https://www.tiobe.com/tiobe-index/

We can see in the table above the top 10 most popular programming languages including general purpose and domain specific, and notice how Python, R and SQL are among the top 10. Learning Python or R and SQL are a must for all data scientist.

Python: Object oriented general purpose programming language with lots of built-in functionalities and open source libraries ready for data science and machine learning projects. Some of the most important libraries are pandas for data manipulation, numpy for numerical computation, scikit-learn for common machine learning models and tensorflow (google), pytorch (facebook) and mxnet (aws) and keras for deep neural networks.

Top data science and AI libraries for python.

If you want to learn more about python and it’s ecosystem, check this awesome post by Claire D. Costa.

R: Is a programming language that was originally focused on providing functionalities and libraries for statistical analysis, however, it has evolved to be much more than a statistical language to become a very rich with functionalities for api development, dashboard construction, web scraping, and many other interesting things.

I would like to give an honorable mention to Golang, which is a programming language that is not the best option for data science, however, it is gaining a lot of momentum in the cloud computing community, since all of the most important projects of the cloud native computing foundation (CNCF) are built on it.

But why is this important anyway, you might ask yourself. Well basically what this means is that all the relevant software projects regarding cloud computing infrastructure are built with Golang, and data science has been powered by cloud computing in many ways, so, that’s why Go is relevant to us.

Top CNCF projects.

3. VERSION CONTROL

The third most important skill you need to master is version control in its three parts: code, ml and data.

  1. Code versioning: Refers to code management which allows for software project managers and team leaders to know which users modified what part of the code at what point in time and have the capacity to go back to the previous successful execution state. The most common tool for code versioning by far is git which enables versioning of your projects inside your local environment, however, it has some drawbacks regarding collaboration. To address this needs, the community and enterprises have developed git servers which host git projects and allow for team collaboration and automation.

Github: Is a platform that hosts over 140 million repositories with 40 million users (2020). It was acquired by Microsoft in 2018 and has since developed a very big ecosystem regarding devops, the most important by far being github actions which are a set of community-backed tools for automation and deployment. To get a glimpse of how crucial github is to the software/machine learning world, here is a list of top data science/engineering projects hosted in github.

  1. Apache Spark
  2. Google Tensorflow
  3. Apache Airflow
  4. Apache Superset
  5. Python
  6. R

A thing that I totally recommend is to check github trending repos on a weekly basis, so that you can stay up to date with the most important community-backed projects.

Gitlab: Is an open source git server that you can download, setup and fully manage in your company. It has a lot of built-in functionalities for code versioning and CI/CD.

The three main open source Git servers.

2. Machine Learning versioning and lifecycle management

Historically, code versioning has been used only in application development and not in data science and machine learning since they are relatively the new kids in the block. However, code versioning recently has also been applied in data science but unfortunately it has fallen short in addressing machine learning models versioning and it’s lifecycle management needs.

Let me present to you MLflow, it is a platform that enables the machine learning developer to version control his models, have full traceability of the model’s behavior over time and hyperparameter configurations as well as model deployment in cloud platforms or in docker containers. It is important to mention that there are other platforms as well to address ml lifecycle, ie: metaflow, kubeflow, among others.

3. Data version control

This part is the least mature of the three parts of version control, however, that doesn’t mean that there are not options to address this needs.

Databases Audit Trail: Is the mechanism by which database administrators track changes made to a database. Most database engines have this functionality built-in.

Git LFS: Let me quote the exact words from Git website:

“Git Large File Storage (LFS) replaces large files such as audio samples, videos, datasets, and graphics with text pointers inside Git, while storing the file contents on a remote server like GitHub.com or GitHub Enterprise.”

DVC: Let me quote the exact words from dvc website as well.

“Version control machine learning models, data sets and intermediate files. DVC connects them with code, and uses Amazon S3, Microsoft Azure Blob Storage, Google Drive, Google Cloud Storage, Aliyun OSS, SSH/SFTP, HDFS, HTTP, network-attached storage, or disc to store file contents.”

Databricks Delta: Is an open source project that allows full traceability of files stored in a data lake, as well as crud operations. This project is developed by the Databricks (apache spark developers).

4. DATABASE QUERY LANGUAGES

The fourth most important skill in my list is database query languages. In 8 or 9 out of 10 jobs, a data scientist will be interacting with databases to get the data that he needs to work. In a well designed data architecture, a data scientist will be getting the data from a data lake or a data warehouse (snowflake schema), however, in real life there are a lot of cases where the data scientist has to work with transactional databases, excel files, etc. In either case, the data scientist must know how to fetch the data to do his work.

SQL: Meaning Structured Query Languge, is the language used to interact with relational databases and more recently to data lakes. Is the most common data manipulation language and provides functionality for data grouping, aggregating, filtering, and windowing. SQL is a must for a data scientist. Below, is an example of snowflake schema which is the most common schema in DWH.

https://www.guru99.com/star-snowflake-data-warehousing.html

NoSQL: Meaning “Not Only SQL” are databases that allow storage and retrieval of information that can vary in it’s structure and that can handle horizontal scaling easily. The default format for NoSQL storage is JSON.

Graphs: A relatively new field in data science is graph analysis, and as you can imagine, there are many tools for querying graph databases. Cypher is a graph query language developed for neo4j database engine, however, Apache Spark 3.0 has incorporated it into it’s set of functionalities. Another important tool for graph analysis is Gremlin.

Cypher on Spark 3.0 https://neo4j.com/blog/cypher-for-apache-spark/

5. DATA VISUALIZATION

The fifth most important skill for a data scientist is data visualization. All data scientist at some point must present the results of their work to upper management or to their customers, and what better way than a set of well structured dashboards. For this subject, I totally recommend the book “Story Telling with data” by Cole Nussbaumer Knaflic.

There are many options out there to create dashboards including many libraries for programming languages (ggplot, plotly, bokeh, etc), however, there are software suites and packages that provide more complete tools for data visualization.

Grafana (Open Source)

Apache Superset (Open Source)

Microsoft PowerBI (Enterprise)

Tableau (Enterprise)

6. BIG DATA PROCESSING

The sixth most important skill for a data scientist is to understand how big data processing works, in order to take full advantage of parallel computing and be able to design models that maximize accuracy while minimizing overall expenditure.

Hadoop: Is a distributed file system that allows distributed storage and computing on a cluster allowing a set of disks to be managed as a single virtual storage unit. However on it’s beginning it wasn’t only a distributed file system, it also comprised a parallel processing platform with map reduce paradigm, however, map reduce became deprecated in favor or spark due to spark’s better performance and scalability.

HDFS (Hadoop File System) has been the default underlying technology for storage on data lakes for many big data projects and it is still relevant in 2020, however, since the dawn of cloud computing, other storage technologies and architectures have emerged such as Azure Blob Storage and Amazon S3.

Hadoop Architecture: https://data-flair.training/blogs/hadoop-architecture/

Spark: Is an open source data processing platform, with horizontal scaling capabilities of up to thousands of nodes, support for python, r, sql and scala and a vast community, it is by far the most popular big data platform in use today (2020). Spark’s core features include batch processing, real time processing, machine learning, as well as serving as a database due to the fact that it counts with jdbc connection. Spark is available for download and setup on premises if wanted, however, this requires a deep understanding of the core of spark, personally I recommend to go for a managed service.

Spark runtime is available in most of the cloud providers today, however the most important provider of a managed Spark is by far Databricks, a company founded by the Spark creators.

Kafka: Is an open source project that enables stream processing for real time analytics and provides the capability to integrate with many processing engines including Spark. The main provider for Kafka managed service is Confluent. However, as well as spark, kafka can be downloaded and configured on the customer’s premises, however, as well as Spark, I personally would recommend to go for the managed service.

7. ORCHESTRATION

The seventh more important skill is to understand data processes orchestration. Loosely defined, data orchestration refers to how the company’s data processes fit together, how they are linked with each other and the technology that enables them.

Strictly speaking, this topic may exceed the scope of a data science, however, I find it really important for the data scientist to understand the data flows at the company to see where his work fits in the broader picture.

In real life, the data maturity of companies varies very broadly, the more lagging companies may have a broad disparate set of data products that don’t interoperate with each other, many employees working on manual processes that are suitable for automation, and many other opportunity areas. On the other hand, more advanced companies will have standardized and automated processes and a set of data products with standardized protocols that enable interoperability.

In order for a data scientist to be able to propose projects that are technically and economically feasible giving the existing conditions, a data scientist must strive to understand data orchestration.

There are important projects in the field that allow data orchestration at a company scale, to name a few:

Airflow: Developed at AirBnb, and later open sourced, is an excellent project that allows data process orchestration at scale with many built-in connectors to different on-premise and cloud components (postgres, databricks, adls, s3, among others). Airflow is built on python. Many companies have adopted Airflow as their orchestration framework.

Azure Datafactory: Is a proprietary data orchestration framework available in Azure Cloud Platform, it enables the graphic design of ETLs and data orchestration processes. It provides both a pipeline designer which is built having sophisticated developers in mind, or, data flow can be used, which is a drag and drop easy to use wizard that allow non-technical users to design their data processes without much struggle.

Credit to: https://www.delorabradish.com/azure/azure-data-factory-adf-pipeline-runs-query-by-factory-avoid-concurrent-pipeline-executions-blog-post-1-of-2

8. CLOUD COMPUTING

The eight skill a data scientist needs to have is cloud computing knowledge. Cloud computing approach is a must if a business want’s to stay relevant in the future, whether full cloud computing approach for startups or hybrid approach for companies that already have an installed IT infrastructure and operations. As I previously mentioned, cloud computing has been enabling data science in ways never seen before.

But some of you might wonder, what is cloud computing ?

The best definition I have found so far is the one from Amazon which I quote:

“Cloud computing is the on-demand delivery of IT resources over the Internet with pay-as-you-go pricing. Instead of buying, owning, and maintaining physical data centers and servers, you can access technology services, such as computing power, storage, and databases, on an as-needed basis from a cloud provider like Amazon Web Services (AWS).”

In order to provide a better understanding of cloud services capacities, let me give some examples of what a user can do with a cloud services subscription and a credit card in hand; a user can launch a spark cluster for big data processing, deploy a webapp for users to visit, launch a relational database, and many other things and everything in a matter or minutes and with a pay-as-you-go pricing model. As you can imagine, there are many benefits when adopting a cloud services strategy, let me mention just a few:

  1. Cost savings
  2. Flexibility
  3. Reliability
  4. Security
  5. Etc.
Gartner Magic Quadrant for Cloud Computing Platforms

As you can see in the Gartner diagram above, AWS is the leading cloud services platform followed closely by Microsoft and Google, which for some of you may not be a surprise, however, a surprise indeed is to have 2 Chinese companies (Alibaba and Tencent) being in the top 7 players of cloud computing.

As a side note, China’s economy is totally booming this 2020 despite COVID-19, you can see it in the cloud computing field, but this year, was the year of Chinese EVs, check this note.

But I haven’t answered why it’s important for the data scientist, well, I think that you already figured it out, haha !

9. DOMAIN EXPERTISE

The ninth skill you need as a data scientist is to have a solid background in engineering, finance, marketing, medicine or any STEM fields. It’s true that data science can be applied to many industries or scientific fields, however, having any of the mentioned backgrounds will ease the ride in your data science career.

Being an engineer myself, I would recommend studying engineering since it enables you to develop useful projects from a very early phase of your career in contrast to other more formal backgrounds.

CONCLUSION

Achieving excellence in any field requires discipline, dedication and talent, and as you can imagine data science is no different. If you want to be a top data scientist, you need to be constantly challenging yourself in many ways, and developing new skills from the ones mentioned in this post to many others that apply in other fields.

So, you might wonder, how would you recommend to challenge myself ?, well, there are many ways, get a data science/engineering degree, participate in kaggle competitions and hackatons, read data science papers and blog posts, read about how data science is applied to other fields rather than your own fields of interest to get new ideas, give lectures at local meetups about your projects in order to foster the data science community and share your knowledge, is up to you !!

As Malcolm Gladwell says in his book Outliers and I quote:

“The key to achieving world-class expertise in any skill, is, to a large extent, a matter of practicing the correct way, for a total of around 10,000 hours”

I hope that you find this post of help in clarifying doubts of what skills you need as a data scientist and helps motivate you to learn all the amazing things that out there.

--

--