Understanding Data Science

Understanding Data Science

Understanding Data Science 1

Most of us are new to this field called Data Science In my way to becoming a Data Scientist, I became a little bit confused on the differences some terms had. I did not know how to define myself. Am I a Data Analyst? Am I a Data Scientist? Jobs descriptions did not help me to get out this confusion. The terms seemed to blend or merge with other like Data Architect, Full Stack developer, Web developer, organic Analyst, Business Analyst, Big Data Developer/analyst…and many others.

If you are in a similar cross path as mine. Let me simplify things. A Data Scientist is someone who works with data in a scientific way. Nothing else to add.

I guess my RLadies colleagues who are mostly coming from the Science field did not have any hesitation on being called “Scientist”. But for some of us, coming from the business side of life, we need some explanation on what is what.

Data science is a new field and for that, there are still some forth and back on its definition. With the new increased demand for a data scientist, there will be more movement to create standardized skill set. Traditional ways of education, like on site Universities are still confectioning its curricula on the subjects and tools that a DS must be skilled.

One of the best ways to think about DS is to focus on the second part of the term. So, on the science, and not on the data. As I recall being told in elementary school, the scientific methodology means experimentation. Therefore, it is to use the experimentation to build knowledge with the same forth and back that its own definition is having. So a DS must experiment using an empirical approach gaining insights and knowledge but reacting to the data through experiments and questions. A DS must daily use this skill.

A common approach

Although that the exact agreed definition is still in the limbo. At least, there are some common tools and techniques in which people agree with.

We need,

1.- Software that holds the data. There are spreadsheets, databases, and key/value stores.

2.- Tools used to scrub the data. Meaning making data easier to work with by modifying or amending data or deleting the duplicate, incorrectly formatted, incorrect, or incomplete data.

3.- Statistical packages to help analyze the data. The most popular are the open source software environment: R, Python, IBM, SPSS.

Holding the data

Big Data is one of these terms people get confused with. The term is referring on large datasets that cannot fit in regular systems such a personal computer. Data Science and Big Data appeared at the same time, and because of that, usually, people mix together on the conversations.

Before I did write short and concise about the definition of Data Science as a Science methodology. So, now, it is easier to understand how both terms are linked.

Nevertheless one of the most active areas on the subject is big data, and therefore there is specific software to handle it. The open-source Hadoop is currently the most popular. And the most popular tools to compute on the cloud relied on its architecture and cluster methodology. I must confess that understanding what is Hadoop and how it works, was my very first battle on the matter. I did simply did not understand its role in all of the Data Science world. After several weeks of training and doing, I could finally understand its implication.

As simplifying a bit. Hadoop uses a Distributed File System (SDF) to store data in a different group of servers called cluster. The cluster also distributes tasks on the server so that you can run applications on them as well. The two most common processes that are run on Hadoop Clusters are Map Reduce and Spark. Map Reduce compiles the data in batches and Spark can process the data in real time.

Scrubbing the data

One of the first things you will hear once you start learning DS, is that 80% of your effort/time is put on this phase, the understanding and cleaning your data. This is not a very pleasant job for any data scientist, but it is sure one of the most important parts, if not, the most, on analyzing a data set (no matter how big it is).

Analyzing the data

This third part of the process is, in my opinion, one of the most enjoyable times. The tools that are mostly used are R and Python. I personally, learnt how to use both of them. And my recommendation for those starting is that you should learn both, in parallel, and not bias your work in any of them. They have different pros and cons, and you must be able to choose the right tool for the right moment. But for that, keep your DS toolbox updated and filled in.

These two tools are not the only ones. There are multiple commercial solutions that can be used for the same purpose. But it is almost impossible to know all of them, especially if they have a cost and they are not open-source like R and Python.

At the end, you must focus on building the analytical skill set. No matter what tool is used, what remains immutable is the ability to give insights and create knowledge from your findings.

Giving insights

The ultimate objective of a DS is drawing conclusions.

One of the best insights I got during my MBA was given to me in the very first week of the course. The professor explained to us how he had a business idea and he failed on convincing his other business partners that they should jump into it. With the time some other people did it and become very successful.

He confessed us that, at the beginning, he thought how foolish his partners were. Not foreseeing, like him, the greatness of this business!

But what he finally, learnt from this story is that it was him the biggest failure. He hadn’t been good enough good on argumentation and communication to convince the others about such greatness. He did not present properly the data to them. Therefore, it was the same result as if he had not spoken out.

Giving insights and being able to transmit others it is an important part of the process and the job description of being a Data Scientist. Unfortunately, if a Data Scientist fails on that last step, time, money and who knows what else have been lost in the process. Do not underestimate this part and try to incorporate the skills you need to fulfil that.

That means to understand your audience: are they visual? learn good data visualization, are they figures driven minds? learn how to present economic impacts to the audience. Storytelling is always a good way to articulate your speech. But we will talk about it later on…

1. Bibliography: go