Is data preparation similar to food preparation?

Introduction to Data Preparation

Tejali Gangane
Towards Data Science

--

Food preparation
Photo by Brandless on Unsplash

Yeah I know, the title is a bit weird! You must think why am I relating data to food? LIKE DOES THAT MAKE SENSE?

Well for me, it does.

I know for you the above statement is not convincing enough. Give me a chance to explain and I’ll make sure you would believe in it too!

Let’s jump straight to the point where I shall explain the reason behind me relating data preparation to food preparation.

Food preparation is gathering all the necessary ingredients/tools required to make the food edible. It may include the use of the tools to manage these items or the mixing of different ingredients to create taste.

There are different ways in which this can be achieved. Suppose, I would be washing a potato, using a peeler to peel off its outer skin or maybe use a knife to cut it into small pieces. And also later, adding spices to make it more flavorful and then maybe boil it to soften it or bake it at a certain temperature. (I MEAN OF COURSE, WHY WOULD ANYONE EAT A RAW POTATO? DUH!) And then finally eating it.

So these were all the things I had to go through before actually eating. There might be others who might like to have it some other way(like fries, mashed potatoes and what not!). But as I said, there are different ways. But the bottom line is, before eating any kind of food certain steps are to be taken or processed so, in the end, we can eat anything without having to go through the trouble of having our teeth fallen out. :P

Same goes with data preparation.

In data preparation, we perform cleaning, transformation, etc. on the raw data (which can come from any source, be it a document, an excel file, simple text file, database, from the Internet, etc). Cleaning generally means filling out missing values/finding out rows with empty values or checking for duplicate records, etc; transformation is converting data from one format to another (more likely from format of the source system to the format of the destination system). And there are many more things that can be performed on this raw data.

So now relating why food and data preparation is the same.

In food preparation: according to the earlier stated example, I washed (cleaned) the potato, because there might be a possibility that the potato might have dirt on it. So it’s better to wash it and then eat it. And later on, I boiled it (transformed) i.e. turning it into a soft form from hard form, etc. And so on. Therefore, there are many ways in which certain things can be processed. Same goes for data preparation, there exists many ways and algorithms in which you can process the data.

Enough of blabbering! Let’s get into the deep idea about data preparation.

Most of the data scientists spend 70–80% of time on data preparation.

Steps in data preparation:

Source : talend.com

Step 1: The very first step is to gather data/information. This includes extracting the information from any source, datasets, etc. (You can find various free available datasets on websites like kaggle, fivethirtyeight, buzzfeed, etc.)

Step 2: This step includes discovering the data; meaning, understanding the data and understanding how the data can be processed.

Step 3: Next step is to clean the data. As I mentioned before, the cleaning of data includes filling out missing values, removing out the outliers, privatizing sensitive data, etc. It is done to avoid any further errors that might occur while later stages of data processing.

Step 4: The fourth step is to transform the data, as in converting the data from one format to another. Since the data coming from source can be of any format, it is necessary to ensure that the contents of the data which is to be processed are in the same format. Eg: there are various formats in which the date can be stored, like dd/mm/yyyy or mm/dd/yyyy or dd/mm/yy, etc. To ensure that no problem occurs while processing such data, storing all the data in a particular format is important.

Step 5: Enriching data is the process of enhancing or refining the raw data to produce more informed decisions/insights.

Step 6: Once the data is prepared it is stored or sent to a third-party application or tool to begin the analysis.

Let us list out a few of the main benefits of data preparation:

> Can help you catch errors before the actual processing begins.

> Can help you produce a better quality of data as a result.

> Processing of better quality of data means better insights.

> Better insights will lead to better decisions by organizations.

We have understood what is data preparation, its process and its benefits. Now the main question here is: Why do we exactly need it? What if we just skip this part?

Need for data preparation:

Data preparation is an important part of the data science process. Yes, it is a tedious task, but an important one. It is necessary to have your data prepared before actual processing. Since you might have combined the data from different sources and you want to generate overall insights, there might be some duplicate records, or outliers or all the attributes that are included, are not the ones you are expecting to use for analysis. Thus, if the data is not prepared then it will just become some huge amount of data with no importance or meaning.

Better prepared data will be a help in generating a better quality of insights thereby helping the organization to make better decisions (as previously mentioned in the benefits of data preparation).

According to Gartner’s research, poor quality or bad quality of data costs an average organization $13.5 million every year, which is too high of a cost to be bear by the company.

Also, if the data is not prepared the dataset would just be quantitative in terms of data and not qualitative.

Similar to food preparation, as I mentioned earlier it would be difficult for us to eat raw potato, but it would be easier to eat if it is either baked or boiled.

References:

Talend, Digitalvidya, Dataquest.io, Dezyre,

--

--