Data Management

In this example, I will show some basics of  Data Analysis and Data Engineering.  Setting aside missing data, coding valid data and recoding values, creating secondary variables, Grouping values within individual variables  helps you to make and implement even more decisions with data. Statisticians often call this task ‘data management’, while data scientists like the term ‘data munging’.

Download the code
 Open the Jupyter Notebook file

Data Management : Module 1

Import all libraries first

"""Created on Sun Sep 3 01:17:41 2017
@author: Kalyan
"""
import pandas as pd
import numpy as np
from pandas import DataFrame
import os

Execute Python Online

Load the Gapminder Data Set 

data= pd.read_csv("https://makemeanalyst.com/python-data-science/gapminder.csv", low_memory=False)
print(data.head())
print("No of Records : ",len(data)) #This will show number of observations
print("No of Features : ",len(data.columns)) #This will show number of Features

Check the types of the variable

data.dtypes

Setting the variables you will be working with to numeric

data['incomeperperson']=data['incomeperperson'].convert_objects(convert_numeric=True)
data['urbanrate']=data['urbanrate'].convert_objects(convert_numeric=True)
data['employrate']=data['employrate'].convert_objects(convert_numeric=True)
print(data.dtypes)

First I want to see are there any countries where incomeperperson, urbanrate and employrate all three features are NA values

sub1=data[np.isnan(data['incomeperperson']) & np.isnan(data['urbanrate']) & np.isnan(data['employrate'])]
print(sub1)

#Now I made the income variable as categorical. I made them into three categories

print ('counts for original incomeperperson')
c1 = data['factor_income'].value_counts(sort=False, dropna=False)
print(c1) #Here you will get the frequncy of each catagory as well as number of NA values
print(data['factor_income'].describe())

Data management for urban rate; I will use quartile split (use qcut function & ask for 4 groups – gives you quartile split)

data['factor_urbanrate']=pd.qcut(data['urbanrate'],4, labels=["1=0%tile","2=25%tile","3=50%tile","4=75%tile"])
print ('counts for original incomeperperson')
c2 = data['factor_urbanrate'].value_counts(sort=False, dropna=False)
c3 = data['factor_urbanrate'].value_counts(sort=False, dropna=False,normalize=True)#It will display percentage of data
#falls in each catagory
print(c2) #Here you will get the frequncy of each catagory as well as number of NA values
print("Percentage for each catagory\n",c3*100)

Now create a new employrate variable which is categorical and cretae three categories as Low, average, high

data['EmpRateCatogory'] =pd.cut( data['employrate'],[0,50,70,data.ix[:,['employrate']].max()],
labels=['Low','Average','High'])
c4 = data['EmpRateCatogory'].value_counts(sort=False, dropna=False)
print(c4)

Here income perperson is numerical variable so I want to make it catagorical to see the Frequency distribution. I make them into three catagory Lower Income, middle Income and upper Income. If range is 0:1000 the Lower Income, if 1000 to 12735 then Middle income and if it is higher than 12735 then Higher Income.

Similarly, I made “ urbanrate” variable as four category. In this case I used categorization as 4.

On the other way, “employmentrate” has been made into three groups. 0 to 50 means Low employment rate, 50 to 70 means average employment rate and more than 70 is high employment rate.

Try it Yourself

Installing Python in Windows

Variable Names and Keywords