Data Mining Toolbox

I've been meaning to put together a piece with a list on all the good tools you need to torture data. This is the first version. Below you'll find a list of good tools and resources on data mining and it's lower-level scientific brother "machine learning". I've organized them into groups according to the application area. Within each group, the free tools are prioritized. If I did not use the tool, I did not write about it so caveat: I may be missing some really good ones.


Machine Learning / Predictive Analytics

R: I'm not the greatest R user, but I know enough to hand the top position on this list to R, the open source programming language / statistics suite. It's the "go-to" language for statistical computing in scientific circles as well as a few top Silicon Valley companies.

RStudio You will find R is incomplete without this great IDE

The R language can be rather cryptic at times, so if a GUI is your thing, go for rattle

PyStats is the generic name for statistics, scientific computing and machine learning library developed and available in Python. I'm a great fan of Python and a great fan of doing stats in python, mainly through the SciPy Stack.

Anaconda is the greatest collection of scientific Python tools that are tweaked to play well together, offered by Continuum Analytics.

IPython Notebook is the in-browser IDE for interactive Python - IPython, and a great way of publishing your analyses and code.

A good IDE for doing scientific Python work is Spyder and I love it as it provides cell execution like in MATLAB.

I’m in love with this great machine learning library: scikit-learn

If you're willing to work with data but would rather click around with knobs and switches instead of code, I would recommend KNIME. Its interface is simple, intuitive and has a lot of machine learning algorithms built in. It can also play nicely with R.

If you're switching over from MATLAB and you're really going to get a little intimate with the inner workings of your algorithms; Octave is the tool you're seeking.

You're uneasy about working with open source and you really want to deploy to enterprise scale. Your company wants to work with large international vendors. Well of course. Picks for machine learning are IBM's SPSS Modeler (formerly Clementine or PASW Modeler), or SAS Enterprise Miner. Both provide the solid foundation needed for scaling up, and good user-friendly GUIs aimed at statisticians.

 

Data Visualization

Data analytics is never complete without some flashy charts!

If you're on R, you're used to seeing charts in your R world, a library that will produce somewhat more beautiful charts: the infamous ggplot2

If you're on Python and you'd like to produce a couple of notches nicer charts than what matplotlib is readily offering: Seaborn was born in Stanford.

You have a web front-end you should be pushing visualizations out of? D3 is probably your best bet. However, D3.js may be a little hard to get started with, so NVD3 has some reusable charts for you.

You want a little more animation or life in your data visualizations? You'd even be OK with learning a new programming language? Look into Paper.js and Processing

Again, you'd like good visualizations but without the coding part? I find Tableau Public is great, but it is commercial software.

Would like to delve deeper? Datavisualization.ch has a great curated list of tools on data-viz only.

 

Database

The database I'm most experienced with, and provides maximum flexibility: MySQL is the relational database management system the world depends on. MySQL Workbench is the GUI tool where in addition to admin and querying functions, you'll find a great E-R database design feature.

If non-relational is your thing, you have a simple schema with lots of key-value type arrangements, and you're somewhat acquainted to Javascript, go for mongodb good for both your CS101 project and your complex system.

 

Miscellaneous

If you've honed your skills as a data scientist and want to compete with others in the craft: Kaggle is the great community you're looking for.

If you'd like to be on the cutting edge, learn a new syntax; one of the most promising projects out there seems to be the Wolfram Language


About Caner Turkmen

Share this post:

Leave a comment

You must register to leave a comment