Rail Pull Request Process Model

I finally completed the restored of the GHTorrent database.  It took 2 days.  Here is the process model for Rail project using combine data of pull_request_history and pull_request_comments tables.

Case Summary

The model is generated with following parameters

  • 100% Activities
  • 80% Path
  • Primary value is Absolute Frequency
  • Secondary: Median duration  (using mean duration the value will be much higher since there are some cases that took years to complete).


Data issue with MSR14

In my study, I need the complete history of pull request comments. In MSR14 data base, there are many missing pieces.  For example, there are no comments for this pull request https://github.com/rails/rails/pull/10673.  The MSR14 database should have this information since the pull request is done during the data collection period.

After looking at the full database at http://ghtorrent.org/dblite/.  I learn that, the id column are created for the database so id in MSR14 and the full database for the same project, pull_request, commits are not the same.

It is unfortunate that I can’t use the small data set.  I take the latest dump of GHTorrent MySQL database 2014-08-18. The MySQL dump is 11GB compressed and 43GB uncompressed.  Restoration is still going on.  I was told that it would take a couple days.  I started around Aug 30 afternoon.



Pull Request Process Model

Here is sample of basic process model for Pull Request.

Data Source: MSR14 Mining Challenge

Project: Rail

Query String: SELECT pull_request_history.created_at, pull_request_history.pull_request_id, pull_request_history.action, pull_request_history.actor_id, pull_requests.pullreq_id FROM pull_request_history inner join pull_requests on pull_request_history.pull_request_id=pull_requests.id inner join projects on pull_requests.base_repo_id=projects.id where projects.id=78852

Process Model

This process model doesn’t contain loop because the pull_request_history table doesn’t capture the review comments.  However, this model show the lead time of pull request. In this model the light grey text represents median duration.  The median duration for accepted pull request are much shorter than those that are not accepted.

Between Open -> Closed or Open -> Merged there are multiple comments and commits.  In order to capture that, we will need to get data from pull_request_comments table.

MSR14 Github Projects

Projects in MSR14 based on language and number of forks


Create EER Schema for MSR14 Dataset

Use MySQL Workbench

  1. From the MySQL Workbench Home Go to the Model Section
    • Click Create EER Model from Database
  2. Follow the Reverse Engineer Database Flow
  • Select Stored Connection
  • HostName: Localhost
  • Username: msr14, Password msr14
  • Continue
  • Select msr14 schema
  • Continue
  • Continue
  • Execute
  • Continue
  • Close

Rahman2014-MSR: An Insight into the Pull Requests of GitHub

Author: Mohammad Masudur Rahman Chanchal K. Roy University of Saskatchewan, Canada


Given the increasing number of unsuccessful pull requests in GitHub projects, insights into the success and failure of these requests are essential for the developers. In this paper, we provide a comparative study between successful and un- successful pull requests made to 78 GitHub base projects by 20,142 developers from 103,192 forked projects. In the study, we analyze pull request discussion texts, project specific in- formation (e.g., domain, maturity), and developer specific information (e.g., experience) in order to report useful in- sights, and use them to contrast between successful and un- successful pull requests. We believe our study will help de- velopers overcome the issues with pull requests in GitHub, and project administrators with informed decision making.


Data Set: MSR 2014 Challenge – Github

Techniques: Latent Dirichlet Allocation (LDA) for Topic Modeling
Tools: JGibbLDA, a LDA implementation that uses Gibbs sampling, http://jgibblda.sourceforge.net/

Methodology: Extract 100 topic and select top 5.


  • Label
  • Programming Languages
  • Project Age & Maturity
  • Project Developers & Experience


each topic is more prevalent in the discussion of the unsuccessful pull requests than that of the successful pull requests except a dominant topic{ Actor Model.

In case of 24 GitHub base projects using three program- ming languages{Ruby, Java and JavaScript, average num- ber of unsuccessful pull requests per month is exceptionally higher than that of successful pull requests.

Exploring MSR 2014 Challenge data with Tableau

Set up Tableau on Mac

  1. Get Tableau Desktop for Student here http://www.tableausoftware.com/academic/students.  Tableau provides a free license for Full-time student to use.  It has capability to connect to various databases including MySQL that store the Github data.  List of all available databases connections are available here http://www.tableausoftware.com/support/drivers.
  2. Download and install the MySQL ODBC driver.
  3. Open Tableau Desktop and connect to MySQL.
  4. Use following input
    • Server: localhost, Port: 3306
    • Username: msr14
    • Password: msr14


Next step is to create the worksheet from existing table or joining them.