“What is data science?”
Most people hyping data science have focused on the first word: data. They care about volume and velocity and whatever other buzzwords describe data that is too big for you to analyze in Excel. This hype about the size (relative or absolute) of the data being collected fed into the second category of hype: hype about tools. People threw around EC2, Hadoop, Pig, and had huge debates about Python versus R. But the key word in data science is not “data”; it is science.Data science is only useful when the data are used to answer a question. That is the science part of the equation.
Content Overview
What is Structure of a Data Science Project?
Before knowing about critical activities of a data scientist it is mandatory to understand various phases of a Data Science Project. And what is the output of a Data Science Experiment?
There are mainly the five phases of a data science project
1. Question
2. Exploratory data analysis
3. Formal modeling
4. Interpretation
5. Communication.
Output of a Data Science Experiment
The outputs of a data science experiment are actually pretty much limitless. However, there are mainly four general types of outputs that we use most frequently.
- Reports
- Presentations
- Interactive web pages
- Data Product or Data Apps
What Does a Data Scientist Do?
Now you know what is data science and what is structure of a data science project and finally what is output of a data science experiment. So, a Data Scientist do certain core activities which are really involved with the data analysis epicycle. Few of them include;
1. Define the question
The first step is setting expectations. This include what question I am going to answer for my business? Define that question first and later on try to find answer through various mechanism.
2. Defining the ideal data set for the experiment
Next step is find out what kind of data can you need to answer the question. In this step a data scientist usually figure out a ideal data set for his experiment.
3. Get the data
You know what kind of data can answer your question . Now, go go ahead and collect data from diverse sources.
4. Clean the Data
In real world sometimes the data you’re analyzing is too messy and it hasn’t been well-maintained and difficult to work. A data scientist take part in data cleanliness and making them useful for the analysis.
5. Do exploratory analysis to understand the data more and more
Do some exploratory analysis to understand the data and get some more insights. Often times, presentation of data in a pictorial or graphical format so it can be easily analyzed.
6. Perform features engineering / features selection
Feature engineering is the process of creating new feature or selecting appropriate features using domain knowledge of the data that helps machine learning algorithms to work perfectly.
7. Do Prediction/ modeling
The next step is building model. There might be lots of model created by a data scientist but choosing the right statistical model from a set of candidate models is called model selection. And data scientist is also responsible to picking the appropriate model the the analysis.
8. Interpreting the results
Analyzing the data and interpreting results is another important part of the data science process.
9. Create dashboard
Visualizing and communicating data is really important. So creating report and dashboard helps people to understand data-driven decisions.
10. Show the result to the other people
Now, show the result to world. It’s important that your manager or VP or colleagues understand what insights you have derived from that data and why that is important. Sometimes, poor communication may fails to convince people that will make the difference between action and inaction on your analysis.
Sometimes these steps are not orderly followed. It is possible to go back and forth to get better result.
Skill Matrix for a Data Scientist
Also read Data Scientist’s Toolkits