Data Management
In this example, I will show some basics of Data Analysis and Data Engineering. Setting aside missing data, coding valid data and recoding values, creating secondary variables, Grouping values within individual variables helps you to make and implement even more decisions with data. Statisticians often call this task ‘data management’, while data scientists like the term ‘data munging’.
Download the code
Open the Jupyter Notebook file
Data Management : Module 1
Import all libraries first
"""Created on Sun Sep 3 01:17:41 2017 @author: Kalyan """ import pandas as pd import numpy as np from pandas import DataFrame import os
Execute Python Online
Load the Gapminder Data Set
data= pd.read_csv("https://makemeanalyst.com/python-data-science/gapminder.csv", low_memory=False) print(data.head()) print("No of Records : ",len(data)) #This will show number of observations print("No of Features : ",len(data.columns)) #This will show number of Features
Check the types of the variable
data.dtypes
Setting the variables you will be working with to numeric
data['incomeperperson']=data['incomeperperson'].convert_objects(convert_numeric=True) data['urbanrate']=data['urbanrate'].convert_objects(convert_numeric=True) data['employrate']=data['employrate'].convert_objects(convert_numeric=True) print(data.dtypes)
First I want to see are there any countries where incomeperperson, urbanrate and employrate all three features are NA values
sub1=data[np.isnan(data['incomeperperson']) & np.isnan(data['urbanrate']) & np.isnan(data['employrate'])] print(sub1)
#Now I made the income variable as categorical. I made them into three categories
print ('counts for original incomeperperson') c1 = data['factor_income'].value_counts(sort=False, dropna=False) print(c1) #Here you will get the frequncy of each catagory as well as number of NA values print(data['factor_income'].describe())
Data management for urban rate; I will use quartile split (use qcut function & ask for 4 groups – gives you quartile split)
data['factor_urbanrate']=pd.qcut(data['urbanrate'],4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"]) print ('counts for original incomeperperson') c2 = data['factor_urbanrate'].value_counts(sort=False, dropna=False) c3 = data['factor_urbanrate'].value_counts(sort=False, dropna=False,normalize=True)#It will display percentage of data #falls in each catagory print(c2) #Here you will get the frequncy of each catagory as well as number of NA values print("Percentage for each catagory\n",c3*100)
Now create a new employrate variable which is categorical and cretae three categories as Low, average, high
data['EmpRateCatogory'] =pd.cut( data['employrate'],[0,50,70,data.ix[:,['employrate']].max()], labels=['Low','Average','High']) c4 = data['EmpRateCatogory'].value_counts(sort=False, dropna=False) print(c4)
Here income perperson is numerical variable so I want to make it catagorical to see the Frequency distribution. I make them into three catagory Lower Income, middle Income and upper Income. If range is 0:1000 the Lower Income, if 1000 to 12735 then Middle income and if it is higher than 12735 then Higher Income.
Similarly, I made “ urbanrate” variable as four category. In this case I used categorization as 4.
On the other way, “employmentrate” has been made into three groups. 0 to 50 means Low employment rate, 50 to 70 means average employment rate and more than 70 is high employment rate.
Try it Yourself