If you are a Data Analyst then you must know that doing data analysis requires quite a bit of thinking. And if you’ve completed a good data analysis project, of course you’ve spent more time thinking and design than doing.
The thinking begins before you even look at a dataset, and it’s well worth devoting careful thought to your question.
Before we delve into stating the question, it’s helpful to consider what the different types of questions are there. There are mainly six basic types of questions and much of the discussion that follows comes from a paper published in Science by Prof. Roger Peng and Jeff Leek.
Understanding the type of question bu yourself you are asking may be the most fundamental step you can take to ensure that, in the end, your interpretation of the results is correct.
The six types of questions are:
Descriptive question is basically about what is happening? Descriptive aims at describing something, mainly functions and characteristics.
An exploratory question is one in which you analyze the data to see if there are;
- or relationships between variables.
These types of analyses are also called “hypothesis-generating” analyses. Because rather than testing a hypothesis as we are looking for patterns that would support proposing a hypothesis.
If you had a general thought that sleeping less was linked somehow to illnesses, you might explore this idea by examining relationships between a range of dietary factors and viral illnesses. You find in your exploratory analysis that individuals who ate a diet high in certain foods had fewer viral illnesses than those whose diet was not enriched for these foods, so you propose the hypothesis that among adults, eating at least 5 servings a day of fresh fruit and vegetables is associated with fewer viral illnesses per year.
An inferential question would be a restatement of this proposed hypothesis as a question and would be answered by
analyzing a different set of data, which in this example, is a representative sample of adults in the US. By analyzing
this different set of data you are both determining if the association you observed in your exploratory analysis holds
in a different sample and whether it holds in a sample that is representative of the adult US population, which would
suggest that the association is applicable to all adults in the US. In other words, you will be able to infer what is true, on average, for the adult population in the US from the analysis you perform on the representative sample.
A predictive question would be one where you ask what types of people will eat a diet high in fresh fruits and vegetables during the next year. In this type of question you are less interested in what causes someone to eat a certain diet, just what predicts whether someone will eat this certain diet. For example, higher income may be one of the final set of predictors, and you may not know (or even care) why people with higher incomes are more likely to eat a diet high in fresh fruits and vegetables, but what is most important is that income is a factor that predicts this behavior
Although an inferential question might tell us that people who eat a certain type of foods tend to have fewer viral
illnesses, the answer to this question does not tell us if eating these foods causes a reduction in the number of viral
illnesses, which would be the case for a causal question. A causal question asks about whether changing one factor
will change another factor, on average, in a population. Sometimes the underlying design of the data collection, by
default, allows for the question that you ask to be causal. An example of this would be data collected in the context of a randomized trial, in which people were randomly assigned to eat a diet high in fresh fruits and vegetables or one that was low in fresh fruits and vegetables. In other instances, even if your data are not from a randomized trial, you
can take an analytic approach designed to answer a causal question.
Finally, none of the questions described so far will lead to an answer that will tell us, if the diet does, indeed, cause a
reduction in the number of viral illnesses, how the diet leads to a reduction in the number of viral illnesses. A question that asks how a diet high in fresh fruits and vegetables leads to a reduction in the number of viral illnesses would be a mechanistic question.
“The Art of Data Science” by Roger Peng, Elizabeth Matsui