Making Data Management Decisions | Your Second Program

Posted on Posted in Data Science with Python

This example is almost similar to the previous example.  HereI will show some basics of  Data Analysis and Data Engineering.  Setting aside missing data, coding valid data and recoding values, creating secondary variables, Grouping values within individual variables  helps you to make and implement even more decisions with data. Statisticians often call this task ‘data management’, while data scientists like the term ‘data munging’. Whatever you call it, it is a vital and ongoing process when working with data.

Download the code

 Open the Jupyter Notebook file

Data Management : Module 2

First Import all libraries

"""Created on Sun Sep 3 01:17:41 2017
@author: Kalyan
"""
import pandas as pd
import numpy as np
import os
from pandas import DataFrame

Execute Python Online

Data Analysis and Data Engineering; Bug fix for display formats to avoid run time errors and Load the Data Set

pd.set_option('display.float_format', lambda x:'%f'%x)
data= pd.read_csv("http://makemeanalyst.com/python-data-science/gapminder.csv", low_memory=False)
print(len(data)) #Get no of rows in the data set
data.head()#Just print fews rows to see how the data looks like

Now check the types of each columns

print(data.dtypes)

Make all the Columns as Numeric Except Country Column as it is categorical data

colnames=data.columns.values.tolist() #Get all the column names
colnames2=colnames[1:len(data.columns)] 
for i in colnames2:
    data[i]=pd.to_numeric(data[i], errors='coerce')

Now explicitly make the country column as categorical

data[colnames[0]]=data[colnames[0]].astype('category') 
print(data.dtypes) #Now you will get all the column datatypes as your choice

I will work with three variables for this example. Now try to find no of NA values in my varible of interest

print(data['incomeperperson'].isnull().sum()) #No of NA is 23
print(data['urbanrate'].isnull().sum()) #No of NA is 10
print(data['employrate'].isnull().sum()) #NO of NA is 35

You can get number of NA values in the full data set

print(data.isnull().sum())

Now I want to drop all rows where both the three variables are NA

data=data.dropna(subset=['incomeperperson','urbanrate','employrate'], how='all')
print(len(data))

Above code will drop 4 rows where both ‘incomeperperson’, ‘urbanrate’, ’employrate’ are NA

Now find the mean value of all there columns. I will replace rest of the NA values with the mean value of that column

mean=data[['incomeperperson','urbanrate','employrate']].mean()
mean=np.round(mean, decimals=2)
print(mean)

Replace the missing data with the mean of each three columns respectively

data['incomeperperson']=data['incomeperperson'].replace(np.nan,mean[0])
data['urbanrate']=data['urbanrate'].replace(np.nan,mean[1])
data['employrate']=data['employrate'].replace(np.nan,mean[2])

data=data.iloc[:,1:]
data=data[['incomeperperson','urbanrate','employrate']]
print(data.head())
print(data.shape)

Now I made the income variable as categorical. I made them into three categories

data['factor_income']=pd.cut(data['incomeperperson'],[0,1000,12735, data.ix[:,['incomeperperson']].max()], 
labels=['Lower Income','Middle Income','Upper Income'])
 
print ('counts for each incomeperperson')
c1 = data['factor_income'].value_counts(sort=False, dropna=False)
print(c1)
c5 = data['factor_income'].value_counts(sort=False, normalize=True)
print(c5)

Data management for urban rate. Do quartile split (use qcut function & ask for 4 groups – gives you quartile split)

data['factor_urbanrate']=pd.qcut(data['urbanrate'],4, labels=["1","2","3","4"])
print ('counts for each urban rate')
c2 = data['factor_urbanrate'].value_counts(sort=False, dropna=False)
c3 = data['factor_urbanrate'].value_counts(sort=False, normalize=True)#It will display percentage of data falls in each catagory
print(c2) #You will get the frequncy of each catagory as well as number of NA values
print("Percentage for each catagory\n",c3)

I will create new employrate variable as categorical i.e. Low, Average, High

data['EmpRateCatogory'] =pd.cut( data['employrate'],[0,50,70,data.ix[:,['employrate']].max()],
labels=['Low','Average','High'])

c4 = data['EmpRateCatogory'].value_counts(sort=False, dropna=False)
print(c4)
c6 = data['EmpRateCatogory'].value_counts(sort=False, normalize=True)
print("Percentage for each catagory\n",c6)

 

At first I have Made all the Columns as Numeric Except Country Column as it is categorical data. Then change the type of the country as categorical type. As, I will work with three variables for this assignment so, try to find no of NA values in my variables of interest. Then I have dropped all rows where both the three variables are NA. Now get the subset of data only for ‘incomeperperson’, ‘urbanrate’, and ’employrate’. Now find the mean value of all there columns. Because i will replace rest of the NA values with the mean value of that column.

Here income perperson is numerical variable so I want to make it catagorical to see the Frequency distribution.I will make in three catagory Lower Income, middle Income and upper Income.If range is 0:1000 the Lower Income, if 1000 to 12735 then Middle income and if it is higher than 12735 then Higher Income.

Similarly, I made “ urbanrate” variable  as four category. In this case I used categorization as 4. On the other way, “employmentrate” has been made into three groups. 0 to  50 means Low employment rate, 50 to 70 means average employment rate and more than 70 is high employment rate.

Try it Yourself