5 Core Activities of Data Analysis | Epicycles of Data Analysis

Posted on Posted in Data Science

“If you torture the data long enough, it will confess.”
-Ronald Coase, Economist

Data analysis is an iterative process.  This process is  applied to all steps of the  analysis and it can be considered as an epicycle. Now the question is what is epicycle? An epicycle is a small circle whose center moves around the circumference of a larger circle.  Some data analyses appear to be fixed and linear. An example could be be  algorithms embedded into various software platforms, including apps. However, these algorithms are final data analysis products that have emerged from the very non-linear work of developing and refining a data analysis so that it can be “algorithmized.

A study includes;

  • the development of a hypothesis or question
  • the designing of the data collection process (or study protocol)
  • the collection of the data
  • and the analysis and interpretation of the data.

Because a data analysis presumes that the data have already been collected, it includes development and refinement of a question and the process of analyzing and interpreting the data. It is important to note that although a data analysis is often performed without conducting a study, it may also be performed as a component of a study.

There are 5 core activities of data analysis:

1. Stating and refining the question
2. Exploring the data
3. Building formal statistical models
4. Interpreting the results
5. Communicating the results

These 5 activities can happen at any point of time; for example, you my get all these 5 activities in a single day. Sometime you may go through them over a couple of moths because you might be  dealing with a very large project. But it is  will  important to first understand the overall framework used to approach each of these activities.

Although there are many different types of activities that you might engage in while doing data analysis, every aspect
of the entire process can be approached through an interative process that is call the “epicycle of data analysis”. More specifically, for each of the five core activities, it is critical that you engage in the following steps:

Step 1: Set expectations

First and foremost set an expectation. This is the first duty for your analysis.

Step 2: Test expectations:

Then collecting information or data, comparing the data according to your expectations, and if the expectations match the it fine else if it don’t match then follow the 3rd step.

Step 3:

Revise your expectations or fixing the data so your data and your expectations could match.

Iterating through this 3-step process is what we call the “epicycle of data analysis.” As you go through every stage
of an analysis, you will need to go through the epicycle to continuously refine your question, your exploratory data
analysis, your formal models, your interpretation, and your communication.

Example

For example, lets imagine you are going to give a birthday treat and you may be going out to dinner with friends at a cash-only establishment and need to stop by the ATM to withdraw money before meeting up. To make a decision about the amount of money you’re going to withdraw, you have to have developed some expectation of the cost of dinner. This may be an automatic expectation because you dine at this establishment regularly so you know what the typical cost of a meal is there, which would be an example of a priori knowledge. Another example of a priori knowledge would be knowing what a typical meal costs at a restaurant in your city, or knowing what a meal at the most expensive restaurants in your city costs. Using that information, you could perhaps place an upper and lower bound on how much the meal will cost.

For your question, you collect information by performing a literature search or asking experts in order to ensure that your question is a good one.

Now that you have data in hand (the check at the restaurant), the next step is to compare your expectations to the data. There are two possible outcomes: either your expectations of the cost matches the amount on the check, or they do not. If your expectations and the data match, terrific, you can move onto the next activity. If, on the other hand, your expectations were a cost of 2000 bucks, but the check was 3000 bucks. Here your expectations and the data do not match. There are two possible explanations for the discordance: first, your expectations were wrong and need to be revised, or second, the check was wrong and contains an error. You review the check and find that you were charged for two desserts instead of the one that you had, and conclude that there is an error in the data, so ask for the check to be corrected. One key indicator of how well your data analysis is going is how easy or difficult it is to match the data you collected to your original expectations.

“Data are just summaries of thousands of stories – tell a few of those stories to help make the data meaningful.”

Sources : The ATM example and the Epicycles of Analysis concept was originally discussed by Roger D. Peng and Elizabeth Matsui in a book named as “The Art of Data Science”.