Making Data Management Decisions
This example is almost similar to the previous example. Here, I will show some basics of Data Analysis and Data Engineering. Setting aside missing data, coding valid data and recoding values, creating secondary variables, Grouping values within individual variables helps you to make and implement even more decisions with data. Statisticians often call this task ‘data management’, while data scientists like the term ‘data munging’. Whatever you call it, it is a vital and ongoing process when working with data.
Download the code
Open the Jupyter Notebook file
Data Management : Module 2
First Import all libraries
"""Created on Sun Sep 3 01:17:41 2017 @author: Kalyan """ import pandas as pd import numpy as np import os from pandas import DataFrame
Execute Python Online
Data Analysis and Data Engineering; Bug fix for display formats to avoid run time errors and Load the Data Set
pd.set_option('display.float_format', lambda x:'%f'%x) data= pd.read_csv("https://makemeanalyst.com/python-data-science/gapminder.csv", low_memory=False) print(len(data)) #Get no of rows in the data set data.head()#Just print fews rows to see how the data looks like
Now check the types of each columns
print(data.dtypes)
Make all the Columns as Numeric Except Country Column as it is categorical data
colnames=data.columns.values.tolist() #Get all the column names colnames2=colnames[1:len(data.columns)] for i in colnames2: data[i]=pd.to_numeric(data[i], errors='coerce')
Now explicitly make the country column as categorical
data[colnames[0]]=data[colnames[0]].astype('category') print(data.dtypes) #Now you will get all the column datatypes as your choice
I will work with three variables for this example. Now try to find no of NA values in my varible of interest
print(data['incomeperperson'].isnull().sum()) #No of NA is 23 print(data['urbanrate'].isnull().sum()) #No of NA is 10 print(data['employrate'].isnull().sum()) #NO of NA is 35
You can get number of NA values in the full data set
print(data.isnull().sum())
Now I want to drop all rows where both the three variables are NA
data=data.dropna(subset=['incomeperperson','urbanrate','employrate'], how='all') print(len(data))
Above code will drop 4 rows where both ‘incomeperperson’, ‘urbanrate’, ’employrate’ are NA
Now find the mean value of all there columns. I will replace rest of the NA values with the mean value of that column
mean=data[['incomeperperson','urbanrate','employrate']].mean() mean=np.round(mean, decimals=2) print(mean)
Replace the missing data with the mean of each three columns respectively
data['incomeperperson']=data['incomeperperson'].replace(np.nan,mean[0]) data['urbanrate']=data['urbanrate'].replace(np.nan,mean[1]) data['employrate']=data['employrate'].replace(np.nan,mean[2]) data=data.iloc[:,1:] data=data[['incomeperperson','urbanrate','employrate']] print(data.head()) print(data.shape)
Now I made the income variable as categorical. I made them into three categories
data['factor_income']=pd.cut(data['incomeperperson'],[0,1000,12735, data.ix[:,['incomeperperson']].max()], labels=['Lower Income','Middle Income','Upper Income']) print ('counts for each incomeperperson') c1 = data['factor_income'].value_counts(sort=False, dropna=False) print(c1) c5 = data['factor_income'].value_counts(sort=False, normalize=True) print(c5)
Data management for urban rate. Do quartile split (use qcut function & ask for 4 groups – gives you quartile split)
data['factor_urbanrate']=pd.qcut(data['urbanrate'],4, labels=["1","2","3","4"]) print ('counts for each urban rate') c2 = data['factor_urbanrate'].value_counts(sort=False, dropna=False) c3 = data['factor_urbanrate'].value_counts(sort=False, normalize=True)#It will display percentage of data falls in each catagory print(c2) #You will get the frequncy of each catagory as well as number of NA values print("Percentage for each catagory\n",c3)
I will create new employrate variable as categorical i.e. Low, Average, High
data['EmpRateCatogory'] =pd.cut( data['employrate'],[0,50,70,data.ix[:,['employrate']].max()], labels=['Low','Average','High']) c4 = data['EmpRateCatogory'].value_counts(sort=False, dropna=False) print(c4) c6 = data['EmpRateCatogory'].value_counts(sort=False, normalize=True) print("Percentage for each catagory\n",c6)
At first I have Made all the Columns as Numeric Except Country Column as it is categorical data. Then change the type of the country as categorical type. As, I will work with three variables for this assignment so, try to find no of NA values in my variables of interest. Then I have dropped all rows where both the three variables are NA. Now get the subset of data only for ‘incomeperperson’, ‘urbanrate’, and ’employrate’. Now find the mean value of all there columns. Because i will replace rest of the NA values with the mean value of that column.
Here income perperson is numerical variable so I want to make it catagorical to see the Frequency distribution.I will make in three catagory Lower Income, middle Income and upper Income.If range is 0:1000 the Lower Income, if 1000 to 12735 then Middle income and if it is higher than 12735 then Higher Income.
Similarly, I made “ urbanrate” variable as four category. In this case I used categorization as 4. On the other way, “employmentrate” has been made into three groups. 0 to 50 means Low employment rate, 50 to 70 means average employment rate and more than 70 is high employment rate.
Try it Yourself