R vs. Python: Analysis Based on Data from Stack Overflow

Try this ShinyApp: R vs. Python.

Both R and Python are widely used programming languages for data analysis ranging from basic statistics and visualization to complex model analysis. I am curious about “which is more popular: R or Python?”. Thus, I created a Shiny App to visualize the popularity of R versus Python based on data from Stack Overflow over the past six years.

Data Source and Processing

Data were from kaggle, R Questions from Stack Overflow and Python Questions from Stack Overflow. Stack Overflow is the largest online community for programmers to learn, share their knowledge, and advance their career. The table below shows the raw data from kaggle. The data of R or Python was organized in three tables: Questions, Answers, and Tags. Full text of questions and answers from Stack Overflow that are tagged with the “Tags“, useful for natural language processing and community analysis.

var_raw

Next, since we only cared about the number of questions and answers for each language, we nailed down to the following variables in each table.

var_processing

In order to separate the “Tags” into “packages” and “topics”, I also scraped a list of R packages from CRAN Packages By Name, and a list of Python packages from PyPI Ranking as well.

 Visualizations

  • Summary of R & Python

    First, let’s look at the aggregated number of users and the number of questions/answers. The bar chart below shows that more users posted questions/answers related to Python than users posting about R. Similarly, the total number of questions/answers of Python is greater than that of R.

    summary-of-r-python

  • Time Trend: R vs. Python

    The video below shows how to check the time trend of R or Python with different items. For example, we can see how the number of users posting questions and the number of questions of  R or Python changed from 1/15/2000 to 9/15/2016. Obviously, both of R and Python was increasing during the past six years, and the number of users asking questions about Python (n=9441) is greater than that of R (n=2612).

  • Topics:  R versus Python

    Next, let’s look at the findings of packages and topics. The most popular topics of R are data frame, plot, loops, regex, and function.

    topics_r

    Based on the graph below, we can see the most popular topics of Python are python 2.7, python 3.x, list, dictionary, and tkinter.

    topics_pythons

  • Packages: R versus Python

    The most popular packages of R are ggplot2, shiny, data.table, dplyr, and list.packages_r

    The most popular packages of Python are django, panadas, numpy, matplotlib, and regex.

    packages_pythons

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s