Data: where is it coming from?

Data: where is it coming from?

What is all about this hype with Big Data?

I guess, you already know data has been here between us, human beings, since the origin of times. What has been happening in the last years, with our evolution, is that the scale in which we can generate and collect data has been exponentially increasing day after day.

IMB dares to quantify this amount of data generated daily in 2.5 quintillion bytes of data. Most of you, like myself, are not able to write down this number with its complete figure. For a long time, we didn’t even manage this quantity in a fiction world. It is usually said that Big Data (BD) is cooked in machines, the people’s hands and organisations. So I will write a bit using this same structure.

1. Data generated from machines

Did you ever though how much data a plane can gather in a short flight? And can you even consider how much in a transoceanic flight?

Planes have multiple sensors. Some models, like A350, can accommodate up to 6,000 sensors across all the different pieces of the plane, that update the flight team and the ground one. They can generate 2.5 Terabytes of data every single day. Can you imagine the future aircraft how much data they will be able to collect?

In our daily life we coexist with many devices and machines (big and small), that similarly to planes generate data every day. Mobiles and its apps, health garment like FitBit, cars, smart air purifiers and A/C, etc. In general, all so-called smart devices can collect data from the environment they exist (i.e an air purifier in large polluted cities such as Beijing, can collect the pollution levels of a particular home); they can also, communicate to other devices or networks (such as other air purifiers in the city/neighborhood); and they also can execute services autonomously (such as switching on automatically if the pollution levels in the house, reached some hazardous levels).

Have you ever heard of Internet of Things (IoT)? That’s it! Smart devices using the network to collect and send data, that is used to trigger services, improve the system and make (somehow) our lives easier.

Most of the data collected by machines are what is called structured data. Everything will be collected following a structure of data type and position, as it follows a predefined data model. The computation is done, normally, where the data it is produced. We bring the code to the data!

2. Data generated from people

We are generating data continuously! only in Facebook and Twitter we can collect tera, peta, zetabytes of data every day. This data has the form of text (comments, blogs or tweets), images (picture from my meal or last party), video (cute little cats playing around the house), documents (that pdf you attached on your last e-mail), and audio (last podcast uploaded from your favorite weekly show).

Our brain is complex and so it is shown in all of our tiny daily decisions. We don’t write, speak or think following an exact pattern. So, this is reflected in the changing way we have to generate data. Do the maths! Multiply each action taken by each person living in this world, and each piece of hair in each one of our heads. Now you have the approximate number of possible combinations of data models we are able to generate. Exactly, that much!

The lack of structure generates what we call, unstructured data. There is not data model like when machines generate data.

Is in this kind of data where the challenge of a data science stands on. Artificial Brain Thinkers such as Neural Networks algorithms are the main tools we have to deal with that. Are machines able to think as we, humans, do?

3. Data generated from companies

I worked in a company with data for 12 years and what I learned is that I don’t know anything about data.

Don’t get me wrong. I did a really good job analysing our data. But I knew I did not have access to all the data of the organisation. Each department had their own little silo. We were the same company, but not the same people and with slightly different objectives while analysing the data. At the end, we all learned we needed to agree on the number we were going to show to the CEO.

The good thing about data being collected from a company is almost always related to the business. There is not a unique data model, but at the end, all data models are good if they fit the company size and purpose.

The Big Data and new techniques have brought to the organisations the ability to perform tasks that before we absolutely impossible or take a huge amount of time and money. I am talking things such a sentiment analysis of the brand; enrich actions were taken (or not taken) by customers with external data, like merging the sales of umbrellas with weather data, or sending the product newsletter according to local events/weather/history purchased…

Did you ever think companies have, now, the ability to “stole” some demographic data when we log in through Facebook or Google account? Before turning your head towards data science maybe you though it was for your convenience and was very considerate from their side, right? Welcome!