Step1 : Define a Question and Getting Your Data Science Project Started

Posted on Posted in Data Science with Python

A typical data science project will be structured in mainly five different phases.

Structure of a Data Science Project | Different Phases in Data Science Project

The first phase is always the most important phase, and that’s the phase where you ask the question and you specify what is it that you’re interested in learning from data.I would like to examine the Gapminder Dataset and  how gross domestic product (GDP) is related to urbanization? As income per person may depend on urbanization and employment rate. Alternatively GDP may lost when unemployment rate is high. So, basically here I would like to explore the relationship between income per person and two other variables:

1.      Urbanization

2.      Employment Rate

Data Science Questions:

1.      Is GDP associated Urbanization(Urban Rate)?

2.      Is GDP has any relationship with employment rate? Is a bigger GDP implies a higher employment rate?

My hypothesis is a positive answer to this two questions.


My variable of interest are:

There are so many variables in the data set but I am interested only in three variables given below.


My hypothesis is a positive answer to the above two questions. I think it may be feasible to hypothesize that both “employrate” and “urbanrate” variables are positively associated with “incomeperperson”.


1.  “Causal relationship between construction activities, employment and GDP: The case of Hong Kong”, Y.H. Chiang, Li Tao, Francis K.W. Wong, Volume 46, April 2015, Pages 1–12.

2.  “An integrated approach to climate change, income distribution, employment, and economic growth”,    Lance Taylora, Armon Rezaib,Duncan K. Foleya, Ecological Economics, Volume 121, January 2016, Pages 196–205.

3.“The Urbanization Process and Economic Growth: The So-What Question”. Vernon Henderson, Journal of Economic Growth, Volume 8, March 2003, Issue 1, pp 47-71.

4. “Difference among the Growth of GDP and Urbanization of the Provinces and the Cities in West China since the Reform and Opening-up”, Li Zhena,Yang Yongchuna,Liu Yuxianga, China Population, Resources and Environment, Volume 18, October 2008,Issue 5, Pages 19–26.

Download and Learn about Gapminder Dataset

For the purpose of Data Science with Python tutorial, I would like to work with a data set called Gapminder and I will provide some sample python codes for learning data analysis fundamentals. This portion of the GapMinder data includes one year of numerous country-level indicators of health, wealth and development.


Download GapMinder Data Set : gapminder.csv


Visit for more information

GapMinder Codebook

Founded in Stockholm by Ola Rosling, Anna Rosling Rönnlund and Hans Rosling, GapMinder is a non-profit venture promoting sustainable global development and  achievement of the United Nations Millennium Development Goals. It seeks to  increase the use and understanding of statistics about social, economic, and
environmental development at local, national, and global levels.  Since its conception in 2005, Gapminder has grown to include over 200 indicators, including gross domestic product, total employment rate, and estimated HIV prevalence. Gapminder contains data for all 192 UN members, aggregating data for Serbia and Montenegro. Additionally, it includes data for 24 other areas, generating a total of 215 areas. GapMinder collects data from a handful of sources, including the Institute for Health  Metrics and Evaulation, US Census Bureau’s International Database, United Nations  Statistics Division, and the World Bank.