Why Do We Need Software Engineering for Data Science Project?

Posted on Posted in Uncategorized

Software is the generalization of a specific aspect of a data analysis.

If specific parts of a your analysis project require implementing or applying a number of procedures or tools together then software is helpful for encompassing of all these tools into a specific module or procedure. These  specific module or procedure can be repeatedly applied in a variety of scenario.  Software allows us to do something systematically and it also helps to standardize certain procedure, so that different people can use it and understand what it’s going to do at any given time.. Software is always useful because it formalizes the things. It also abstracts the functionality of a set of procedures or tools, by developing a well defined interface to the analysis.

Software will provide an easy interface. It makes things easy just by giving some set of inputs and getting a set of outputs that are well understood. People use the software just by providing  the inputs and get the outputs without knowing underneath details. Some user might be interested to know the underlying details, but the application of the software at any given setting will not necessarily depend on the knowledge of those details. Rather, the knowledge of the interface to that software is important to using it in any given situation.

For example, most statistical packages will have a linear regression function. This  linear regression function has a very well defined interface. Typically, you’ll have to input things like the outcome and the set of predictors, and maybe there will be some other inputs like the data set or weights. Most linear regression functions kind of work in that way. And importantly, the user does not have to know exactly how the linear regression calculation is done underneath the hood. Rather, they only need to know that they need to specify the outcome, the predictors, and a couple of other things. The linear regression function abstracts all the details that are required to implement linear regression, so that the user can apply the tool in a variety of settings.

There are three levels of software that are important to consider, going from kind of the simplest to the most abstract.

Level 1

At the first level you might just have some code that you wrote, and you might want to encapsulate the automation of a set of procedures using a loop (or something similar) that repeats an operation multiple times.

Level 2

The next step might be some sort of function. Regardless of what language you may be using, often there will be some notion of a function, which is used to encapsulate a set of instructions. And the key thing about a function is that you’ll have to define some sort of interface, which will be the inputs to the function. The function may also have a set of outputs or it may have some side effect for example, if it’s a plotting function. Now the user only needs to know those inputs and what the outputs will be. This is the first level of abstraction that you might encounter, where you have to actually define and interface to that function.

Level 3:

The highest level is an actual software package, which will often be a collection of functions and other things. That will be a little bit more formal because there’ll be a very specific interface or API that a user has to understand. Often for a software package there’ll be a number of convenience features for users, like documentation, examples, or tutorials that may come with it, to help the user apply the software to many different settings. A full on software package will be most general in the sense that it should be applicable to more than one setting.

“Is the data analysis you’re going to do something that you are going to build upon for future work, or is it just going to be a one shot deal?”

Often you will have to do a certain analysis more than once, twice, or even three times, at which point you’ve reached the threshold where you want to write some code, write some software, or at least a function. The point at which you need to systematize a given set of procedures is going to be sooner than you think it is. The initial investment for developing more formal software will be higher, of course, but that will likely pay off in time savings down the road.

A basic rule of thumb

This rule of thumb was discussed  by Dr. Roger D. Peng in his post.

Rule 1:

If you’re going to do something once (that does happen on occasion), just write some code and document it very well. The important thing is that you want to make sure that you understand what the code does, and so that requires both writing the code well and writing documentation. You want to be able to reproduce it down later on if you ever come back to it, or if someone else comes back to it.

Rule 2:

If you’re going to do something twice, write a function. This allows you to abstract a small piece of code,
and it forces you to define an interface, so you have well defined inputs and outputs.

Rule 3:

If you’re going to do something three times or more, you should think about writing a small package. It
doesn’t have to be commercial level software, but a small package which encapsulates the set of operations
that you’re going to be doing in a given analysis. It’s also important to write some real documentation so that people can understand what’s supposed to be going on, and can apply the software to a different situation if they have to.