Basic Statistics for Data Analysis
Why Statistics?
Statistical methods are mainly useful to ensure that your data are interpreted correctly. And that apparent relationships are really “significant” or meaningful and it is not simply happen by chance. Actually, the statistical analysis helps to find meaning to the meaningless numbers.
So, a “statistic” is nothing but some numerical value to that can describe certain property of your data set. There are few well know statistics are the average (or “mean”) value, and the “standard deviation” etc. Standard deviation is the variability within a data set around the mean value. The “variance” is the square of the standard deviation. The linear trend is another example of a data “statistic”.
Content Overview
Steps in the Data Analysis Process
Before staring Data Analysis pipeline you should know there are mainly five steps involved into it.
Step 1: Decide on the objectives or Pose a Question
The first step of the data analysis pipeline is to decide on objectives. These objectives may usually require significant data collection and analysis.
Step 2: What to Measure and How to Measures
Measurement generally refers to the assigning of numbers to indicate different values of variables. Suppose, through your research you are trying to find if there was a relationship between height and weight of human, it would make sense to measure the height and weight of dogs using a scale.
Step 3: Data Collection
Once you know what types of data you need for your statistical study then you can determine whether your data can be gathered from existing sources/databases or not. If data is not sufficient the you have to collect new data. Even if you have existing data, it is very important to know how the data was collected? This will helps you to understand you ca determine the limitations of the generalizability of results and conduct a proper analysis.
The more data you have, the more better correlations, building better models and finding more actionable insights is easy for you. Especially data from more diverse sources helps to do this job easier way.
Step 4: Data Cleaning
This is another crucial step in data analysis pipeline is to improve data quality for your existing data. Too often Data scientists correct spelling mistakes, handle missing values and remove useless information. This is the most critical step because junk data may generate inappropriate results and mislead the business.
Step 5: Summarizing and Visualizing Data
Exploratory data analysis helps to understand the data better. Because a picture is really worth a thousand words as many people understand pictures better than a lecture. Likewise, Measures of Variance indicate the distribution of the data around the center. Correlation refers to the degree to which two variable move in sync with one another.
Step 6: Data Modeling
Now build models that correlate the data with your business outcomes and make recommendations. This is where the unique expertise of data scientists becomes important to business success. Correlating the data and building models that predict business outcomes
Step 7: Optimize and Repeat
The data analysis is a repeatable process and sometime leads to continuous improvements, both to the business and to the data value chain itself.
Now you know steps involved in Data Analysis pipeline. Before advancing to more sophisticated techniques, I suggest starting your data analysis journey with the following statistics fundamentals –
Here is a road map for getting started with Data Analysis. Before starting any statistical data analysis, we need to explore data more and more. To explore data below topics are very useful.
Basic Statistics
-
Cases, Variables, Types of Variables
-
Matrix and Frequency Table
-
Graphs and Shapes of Distributions
-
Mode, Median and Mean
-
Range, Interquartile Range and Box Plot
-
Variance and Standard deviation
-
Z-scores
-
Contingency Table, Scatterplot, Pearson’s r
-
Basics of Regression
-
Elementary Probability
-
Random Variables and Probability Distributions
-
Normal Distribution, Binomial Distribution & Poisson Distribution
Inferential Statistics
-
Observational Studies and Experiments
-
Sample and Population
-
Population Distribution, Sample Distribution and Sampling Distribution
-
Central Limit Theorem
-
Point Estimates
-
Confidence Intervals
-
Introduction to Hypothesis Testing