The Data Scientist’s Toolbox | Data Science Toolkit


The data scientist toolbox is the collection of tools that are used to store, process, analyze and communicate results of data science experiments. Data are typically stored in a database. For a smaller organization that might be single, small MySQL database. And for a large organization, it might be a large distributed database across many servers.
Usually, most of the analysis that takes place and most of the production doesn’t actually happen in the database—it happens elsewhere. You usually have to use another programming language to pull the data out of that database to
analyze it.There are two common languages for analyzing data.

The first one is the R programming language. R is a statistical programming language that allows you to pull data out of a database, analyze it, and produce visualizations quickly. The other major programming language that’s used for this type of analysis is  Python. Python is  another  language that allows you to pull data out of databases, analyze and manipulate it, visualize it,   and connected to  production. The other thing that you need to do to be able to use these languages is some kind of computing infrastructure that you can use to run those programming languages on. You have the database, which stores the data, and then have the servers which you will    use to analyze the data. One  useful example is Amazon Web Services. This is a  set of computing resources that you can actually rent from Amazon, so many organizations that do data analysis actually just directly rent their computing resources rather than buy and manage their own. This is particularly true for small organizations that don’t have a large IT budget. Once you’ve actually done some low-level analysis and maybe made some discoveries or done some experiments, and decided actually how you’re going to use data to make decisions for your organization you might want to scale those   solutions up. There’s a large number of analysis  tools  that can be used to provide  more scalable   analyses of datasets, whether that’s   in a database or by pulling the data out  of the database. So two of the most popular right now  are the Hadoop framework  and the Spark framework.  And both of these are basically ways to analyze, at a very large scale, data sets. Now it is possible to do interactive analysis with both of these, particularly with Spark. But it’s a little bit more complicated and little bit more expensive, especially if you’re applying it to large sets of data. Therefore, it’s very typical in the data science process to take the database, pull out a small sample of the data, process it and analyse it in R or Python and then go back to the engineering team and scale it back up with Hadoop, or Spark, or other tools like that.The next tool in the data scientist toolbox is actually communication. A data scientist or a data engineer has a job that’s typically changing quite rapidly as new packages and new sort of tools become available, and so the quickest way to keep them up to speed is to have quick communication and to have an open channel of communication.

A lot of data science teams  use tools like  Slack to communicate  with each other, to basically be able to post new results, new ideas, and be able to communicate about what the latest packages are  available. There are a large number of help websites like Stack Overflow, that allow people go out and search for the questions that they need to answer. Even if they’re quite technical, and quite detailed, it’s possible to get answers relatively quickly. And that allows people to keep the process moving, even though the technology is changing quite rapidly. Once the analysis is done and you want to share it with other people in your organization, you need to do that with reproducible or literate documentation. What does that mean? It basically means a way to integrate the analysis code and the figures and the plots that have been created by the data scientist with plain text that can be used to explain what’s going on. One example is the R Markdown framework. Another example is iPython notebooks. These are ways to make your analysis reproducible and make it possible that if one data scientist runs an analysis and they want to hand it off to someone else, they’ll be able to easily do that. You also need to be able to visualize the results of the data science experiment. So the end product is often some kind of data visualization or interactive data experience. There are a large number of tools that are available to build those sorts of interactive experiences and visualizations because
at the end user of a data science product is very often not a data scientist themselves. It’s often a manager or an executive who needs to handle that data, understand what’s happening with it, and make a decision. One such tool is
Shiny, which is a way to build data products that you can share with people who don’t necessarily have a lot of data
science experience. Finally, most of the time when you do a science data experiment, you don’t do it in isolation—you want to communicate your results to other people. People frequently make data presentations, whether that’s the data science manager, the data engineer, or the data scientist themselves, that explains how they actually performed that data science experiment. What are the techniques that they used, what are the caveats, and how can their analysis be applied to the data to make a decision.

 

Also read  Why you should learn R for data science ?