There are approximately 7.9 billion people in the world. On average 1.7MB of data is generated per person every day. That’s 13 petabytes of data per day, or 13 followed by 15 zeros. For context, if you record something 24/7 in full HD, for 3.5 years, you will end up with 1 petabyte of data. That is 400 thousand hours of full HD recording being produced everyday.
We are at a stage in this world where we are producing a lot of data and barely consuming it. In a time where data is power, it is important to harness as much of it as possible. This is where data science comes in. As the name suggests, it is the science of utilising and making sense of data, especially large volumes of data.
Data science is the process of collecting/building, cleaning, structuring, analysing and gaining insights from data and lastly utilising the data to make smart business decisions. When talking about data science, the data being referred to is, more often than not, “Big Data”, i.e., high volume, high velocity and high variety of data. Let’s try and see more about the general processes that constitute data science.
The first and foremost step for data science is gathering data. Depending on the use case it is done in different manners. The main classification is based on the data source.
Primary Data the data directly collected by the entity wanting to use it. For example, an e-commerce website collecting user interaction data on its platform is considered primary data. This is first hand information and generally the most reliable form of data. This data can also be collected through mailed questionnaires, surveys, personal interviews, telephonic interviews, case studies, and focus groups etc. This data is collected with a more specific use case in mind and hence will have all required attributes.
Secondary data is data that is collected by third parties and sold to all interested buyers. This type of data will be more structured and closer to ready to use. However, the purpose for collecting the data may be different than the one that the user/consumer is looking for, so this may lead to some disconnect on how to structure it for a more specific use case.
This next step is the most important, and often the most time consuming as well. Once the raw data is acquired it is absolutely necessary to clean it up and structure it for a particular use case. This involves cleaning the data and getting rid of irrelevant or missing data points and re-structuring the data such that it is trimmed down to only relevant information for a particular use-case.
For example, if one wants to analyse user interactions on an e-commerce website they would need to clean up the data and filter out only relevant events, like clicks, purchases etc. In case of secondary data, the data will already be cleaned but one might need to change its structure, or even do some more filtering based on the use-case.
Your data is now all cleaned up and ready-to-use. Now, before you can jump into the cool stuff you will need to analyse your data to understand the underlying patterns and behaviour of the data. For example, users could be most active in the evening from 6-9 on weekdays and in the mornings from 10-12 on the weekends. Similarly it could be the case that your website sees more interactions from anonymous users as compared to users who are signed into your platform.
Doing such analyses helps you understand your data and user behaviour and enables you to make smart decisions along with deciding what problem you’ll need to solve, or what optimisations can help drive up profits.
Now for the fun part. You have collected data, cleaned it up and also done an initial analysis. At this point you know what problems you want to solve and you have some basic knowledge about your scenario. Time to get your hands dirty!
The first step is to properly frame your problem. What do you expect as an output from your AI. After this, comes the exploration and discovery phase. There are a ton of questions one needs to ask during this phase. Some of them are:
This is also a very time-consuming step, but is extremely rewarding once you start seeing results and underlying patterns in your data that you may have missed in the previous analysis step.