One of the common debates for data analysts and data scientists in their initial days is, which one of these two should I start with – R or Python. There are many other tools that are available – Paid ones such as SAS, MATLAB, etc and, in Open Source, R and Python dominate the market but, one other language that is gaining foothold mainly, because Apache Spark is Scala.
While a lot can be written on the different aspects, the article will be focusing on a specific purpose – the usage of these languages for Data Science.
Data Exploration & Visualization:
In a traditional sense, the operations are carried out from structured tables, typically, in rows and columns. From here, when I refer to Dataframe it simply is a table as shown below :
Both Python & R provide robust capabilities in handling the structured tables by using Dataframes. For ‘R’, the most famous and comprehensive of the package that helps in data wrangling, exploration, and visualization of 2D Structures is ‘Tidyverse’ developed by Hadley Wickham. The package is more than one package, it is a set of packages such as
- dplyr, tibbles (is also kind of dataframe), tidyr, readr that covers data wrangling and exploration
- ggplot2: an entire article can be written on this and concept behind it called ‘Grammar of graphics’. But, in short, it is one of the gold standards for charting data in both scientific and aesthetic manner.
- Purrr: is an updated version of ‘apply’ functions. In other words, it provides an ability to loop through you tables faster using a technique called ‘Vectorization’ or ‘Broadcasting’. (More on this )
R, in general, has a lot of other packages which might faster and in some cases, it will be an advantage to use them. But, as a standard, Tidyverse is a better option. The problem with using too many packages is that the syntax changes as it is developed by different people. The natural advantage of chaining operations available in Tidyverse is not available in other packages. While some can be integrated it is very difficult to bring them together in a seamless manner.
Python has followed in R’s footsteps. It has a ‘pandas’ package which is R-equivalent for data manipulation & exploration of Dataframes. In fact, pandas is based on Tidyverse universe. For visualization, Python has matplotlib (based on Matlab’s charts), Seaborn, bokeh, etc. While they are robust, ggplot is more preferred because of its simplicity in approach. Besides, as far as visualization is concerned for general business purposes, it is better to use a drag and drop BI tool. The reason is there might be more control when coding the visualization. But, every visual element needs to be defined in the picture. Coding every adjustment can be painful.
Nearly 70% – 75% of the work involved in Data Science is related to cleaning, transforming, normalizing, standardizing, treating for missing values and running matrix/array operations. The operations such as cleaning and transforming are more about data preparation which both tools are equally adept at. The other operations from normalizing to running calculations are computational in nature.
Among the R packages, ‘Caret’ package by Max Kuhn is a robust tool to handle these computations (also known as preprocessing) except for matrix calculations. However, the matrix operations involves linear algebra which is better handled by Python compared to R. Thanks to ‘Numpy’ & Scipy (Python Package). They are very extensive in their ability to handle multi-dimensional arrays, scientific/technical computing and handling/converting to different distributions.
‘Caret’ Package deserves a special mention for its capabilities are much more than mentioned above. It is more like a framework which can handle preprocessing, running various models and checking with the plethora of cross-validation techniques. It can also integrate models from other packages.
Hash Tables (web APIs):
Traditionally, organizations mostly collected stored and analyzed structured data. By Structured data, I mean data which is clearly organized and easily retrievable for various purposes. Everything else is unstructured data, which includes emails, mobile data, social media, business apps, surveys, logs, web apis, etc.
The current scenario has rapidly evolved and, is evolving so much so that nearly 70% to 80% of the data held by the companies are unstructured data. It’s not that the size of structured data has reduced. It has grown in absolute volumes but as the share in terms of percentage of data available. It has gone down to 20% to 30%. The volumes of unstructured data have been growing exponentially – in some companies, the data generated in a single day is be of trillion of records. The need to store and analyze this massive explosion of data is something that gave birth to Big Data and Data Science.
Big Data is a field in itself which has provided with a number of database options like NoSQL, Graph Databases. The data structure for these databases uses different formats but mostly it is one or other form of Hashtable.
If you take a moment to check about the data stored by your browser, mobile or even logs. You will see it will be some kind of key-value pair. One of the most widely used such format is called a JSON. This hash table is nothing but, the ‘dicts’ that is available in Python.
Compared to R, Python is far ahead in handling JSON formats in its own native structure without having to convert them into a dataframe. This is particularly useful when you are dealing with web data. Much of the web scraping, web apis needs it to be adept in ‘Dicts’ format. It is especially useful in few kind of analysis like Network analysis which requires a hierarchical data structure such as in graph database.
Production – Efficiency, Deployment & Reproducibility:
Other things to note are that both languages are in-memory which can affect the memory usage and performance for mid/large-scale datasets. The memory part, however, can be managed by loading the data in chunks – a few hundreds or thousands of records at a time instead of pushing the entire dataset to memory, which might be of millions of rows. As mentioned above, using traditional ‘For’ loops comes with a penalty. So, both the languages use various concepts such as vectorization/broadcasting to handle the performance.
Having said that, this is where I believe, python is an object-oriented language provides better options such as list comprehensions, generators (more on this : https://www.oreilly.com/ideas/2-great-benefits-of-python-generators-and-how-they-changed-me-forever). Because it’s an object-oriented language it has other advantages such as having a standard syntax irrespective of packages, better readability and techniques like chaining is available across packages.
Further, it is well-integrated and easy to move from web scraping to handling big data to building ML/AI engine to web development using interfaces which are proper frameworks by themselves. Examples are Apache Spark which is a separate tool by itself accessible through python, Deep Learning libraries like Tensorflow, Web development platforms like Django/Flask.
This translates as a capability to deploy codes using various disciplines in a standard that is readable and reproducible.
Which one to use?
My professor in analytics used to quip that there is no single solution in analytics if it was then, it means that our real-world problems have got mundane. The best answer, in that case, is that – it depends.
Yet, if we break it down by roles in a Data Science team the below might help as a starting point:
Data Engineer – Structured Data: Hands down better go with R especially, develop expertise in packages written by Hadley Wickham
Data Engineer – Unstructured Data: Given the data Structure will be predominantly key-value pairs, it is better to specialize in python. One can also assume that this role will also involve web scraping – better packages to master are Selenium and Beautifulsoup.
Advanced Reporting/Visualization – Even though there might be other tools. It will still benefit if the people in these teams learn R. The reason being packaged like Tidyverse, Rmarkdown and Shiny provide robust capabilities to build interactive reports, informative white papers, concept tools and prototypes for dashboards.
Statistician/Statistics Modeler – R is definitely matured in this space with many packages available for creating white box/interpretative models especially, ‘Caret’ package is very useful here. Plus, it is very useful if the role also includes survey analytics, hypothesis creation, etc.
Machine Learning Engineer – Hands down Python for reason that it is an object-oriented language which can be very useful in developing codes for automated parameter tuning, trying out various cross-validation techniques and building engines. Package best to master are scikit-learn, tensorflow, keras and if need be, theano plus, learn to build class objects.
Language, Image, Audio, Video Analytics Specialist – Much of the new developments in these fields are led by tech giants and as programmers they prefer python. For example, NLP has good options like spacy (industrial), NLTK (academic) and other packages like Genism. Further, this role will be an extension of the ML Engineer who has to work in Deep Learning too. So, building DL Architecture in Keras, Caffe, etc will be necessary.
About the Author:
Murali is an HR data scientist working with large financial services. His area of interest includes the application of state-of-the-art advance analytics in solving workforce-related business problems.
68,739 total views, 37 views today