The Data Scientist’s Toolbox | Data Science Toolkit

Posted on Posted in Data Science

Why do data Science?

“It is not the critic who counts; not the man who points out how the strong man stumbles, or where the doer of deeds could have done them better. The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood; who strives valiantly; who errs, who comes short again and again, because there is no effort without error and shortcoming; but who does actually strive to do the deeds; who knows great enthusiasms, the great devotions; who spends himself in a worthy cause; who at the best knows in the end the triumph of high achievement, and who at the worst, if he fails, at least fails while daring greatly, so that his place shall never be with those cold and timid souls who neither know victory nor defeat.” ― Theodore Roosevelt

What  do data scientist do?

  • Define the question
  • Define the ideal data sets
  • Determine what data you ca access
  • Obtain the data
  • Clean the data
  • Exploratory data analysis
  • Statistical prediction / modeling
  • Interpret result
  • Challenge result
  • Write up the result
  • Create reproducible code
  • Distribute result to other people

These are the critical activities a data scientist do every single day. Now, the question is what is the main workhorse of data science?  Where will we work on coding? How to share the result? To answer these questions, often time knowing about the the data scientist’s toolkit is your first steps towards becoming a Data Scientist.

Tools are actually really  important element of the data science and analytics field. Open source community has been developed and continuously contributing to the data science toolkit for years. This enthusiasm  of contribution and hard working of vibrant open source community  has led to major advancements to this field. There has been debate in the data science community about the use of open source technology surpassing proprietary software offered by players such as IBM and Microsoft. In fact, many of the big enterprises are leaders to contribute to open source solutions so they can stay top of mind for users and the data science toolkit has increasingly become one dominated by open source tools.

Since there are a wide variety of open source tools available from data-mining platforms to programming languages, we put together a mix of technology that data scientists could add to their data science toolkit. Here, I have listed few tools that data analysts and data scientists work with. Though the list is extensive, it does not really mean that you have to know all the tools.

These are some tools that a data scientist use for data analysis purpose.
  • Java, R, Python, Clojure, Haskell, Scala…
  • Hadoop, HDFS & MapReduce, Spark, Strom…
  • HBase, Pig, Hive, Shark, Impala, Cascalog…
  • ETL, Webscrapers, Flume, Sqoop, Hume…
  • Knime, Weka, RapidMiner, Scipy, NumPy, scikit-learn, pandas…
  • js, Gephi, ggplot2, Tableau, Flare, Shiny…
  • SPSS, Matlab, SAS…
  • NoSQL, Mongo DB, Couchbase, Cassandra…
  • And Yes! … MS-Excel : the most used, most underrated DS tool.

Except these tools data analysts and data scientists often work with various program like version control, markdown, git, GitHub, R, and RStudio. These things are also considered into data scientist tool box list.