Impact of bots on github communities

I’ve been digging into contributor statistics for various communities on github as part of my work on FOSS Heartbeat, a project to measure the health of open source communities.

It’s fascinating to see bots show up in the contributor statistics. For example, if you look at github users who comment on issues the Rust community, you’ll quickly notice two contributors who interact a lot:

rust-bots

bors is a bot that runs pull requests through the rust continuous integration test suite, and automatically merges the code into the master branch if it passes. bors responds to commands issued in pull request comments (of the form’@bors r+ [commit ID]’ by community members with permission to merge code into rust-lang/rust.

rust-highfive is a bot that recommends a reviewer based on the contents of the pull request. It then add a comment that tags the reviewer, who will get a github notification (and possibly an email, if they have that set up).

Both bots have been set up by the Rust community in order to make pull request review smoother. bors is designed to cut down the amount of time developers need to spend running the test suite on code that’s ready to be merged. rust-highfive is designed to make sure the right person is aware of pull requests that may need their experienced eye.

But just how effective are these github bots? Are they really helping the Rust community or are they just causing more noise?

Chances of a successful pull request

bors merged its first pull request on 2013-02-02. The year before bors was introduced, only 330 out of 503 pull requests were merged. The year after, 1574 out of 2311 pull requests were merged. So the Rust community had four times more pull requests to review.

Assuming that the tests bors used were some of the same tests rust developers were running manually, we would expect that pull requests would be rejected at about the same rate (or maybe rejected more, since the automatic CI system would catch more bugs).

To test that assumption, we turn to a statistics method called the Chi squared test. It helps answer the question, “Is there a difference in the success rates of two samples?” In our case, it helps us answer the question, “After bors was used, did the percentage of accepted pull requests change?”

rust-bors-merged

It looks like there’s no statistical difference in the chances of getting a random pull request merged before or after bors started participating. That’s pretty good, considering the number of pull requests submitted quadrupled.

Now, what about rust-highfive? Since the bot is supposed to recommend pull request reviewers, we would hope that pull requests would have a higher chance of getting accepted. Let’s look at the chances of getting a pull request merged for the year before and the year after rust-highfive was introduced (2014-09-18).

rust-highfive-merged

So yes, it does seem like rust-highfive is effective at getting the right developer to notice a pull request they need to review and merge.

Impact on time a pull request is open

One of the hopes of a programmer who designs a bot is that it will cut down on the amount of time that the developer has to spend on simple repetitive tasks. A bot like bors is designed to run the CI suite automatically, leaving the developer more time to do other things, like review other pull requests. Maybe that means pull requests get merged faster?

To test the impact of bors on the amount of time a pull request is open, we turn to the Two-means hypothesis test. It tells you whether there’s a statistical difference between the means of two different data sets. In our case, we compare the length of time a pull request is open. The two populations are the pull requests a year before and a year after bors was introduced.

rust-bors-pr-open

We would hope to see the average open time of a pull request go down after bors was introduced, but that’s not what the data shows. The graph shows the length of time actually increased, with an increase of 1.1 days.

What about rust-highfive? We would hope that a bot that recommends a reviewer would cause pull requests to get closed sooner.

rust-bors-pr-open

The graph shows there’s no statistical evidence that rust-highfive made a difference in the length of time pull requests were open.

These results seemed odd to me, so I did a little bit of digging to generate a graph of the average time a pull request is open for each month:

rust-pr-open-trend

The length of time pull requests are open has been increasing for most of the Rust project history. That explains why comparing pull request age before and after bors showed an increase in the wait time to get a pull request merged. The second line shows the point that rust-highfive was introduced, and we do see a decline in the wait time. Since the decrease is almost symmetrical with the increase the year before, the average was the same for the two years.

Summary

What can we conclude about github bots from all this statistics?

We can prove with 99% confidence that adding the bors bot to automatically merge changes after it passed the CI tests had no impact on the chances of a random pull request getting merged.

We can prove with 99% confidence that rust-highfive increases a Rust developer’s chances of getting code merged, by as much as 11.7%. The bot initially helped lower the amount of time developers had to wait for their pull requests to be merged, but something else changed in May 2015 that caused the wait time to increase again. I’ll note that Rust version 1.0 came out on May 2015. Rust developers may have been more cautious about accepting pull requests after the API was frozen or the volume of pull requests may have increased. It’s unclear without further study.

This is awesome, can I help?

If you’re interested in metrics analysis for your community, please leave a note in the comments or drop an email to my consulting business, Otter Tech. I could use some help identifying the github usernames for bots in other communities I’m studying:

This blog post is part of a series on open source community metrics analysis:

Part 1: Measuring the Impact of Negative Language on FOSS Participation

You can find the open source FOSS Heartbeat code and FOSS community metrics on github. Thank you to Mozilla, who is sponsoring this research!

4 thoughts on “Impact of bots on github communities

  1. If bors had a permission to automatically merge pull requests that pass tests without human permission, why would the wait time increase over time? Because there are more tests to run or more pull requests to test?

    1. There may have just been more pull requests to review as Rust got more popular. It’s possible that the small number of people who can merge code in was the bottleneck, but I can’t tell without further analysis.

  2. Excellent analysis — it’s great to see real math being applied to some effects that I’d observed anecdotally!

    However, I disagree with your implication that Bors’s primary purpose is to speed up the landing of PRs, which I inferred from your statement “We would hope to see the average open time of a pull request go down after bors was introduced”. Based on Graydon’s writings about Bors’s original design [1], I believe that Bors’s primary purpose is to automatically maintain a repository of code that always passes all the tests.

    It might even follow that an increase in time-to-landing after the introduction of Bors is a side effect of Bors succeeding at his task: If we assume that some percentage of PRs contain errors which result in test failure, and any PR that would have landed before Bors will also eventually land afterward, then every PR with an error must (after Bors) remain open for at least the time it takes to run the test suite twice (once to catch the bug and again to verify that it was fixed). To put it another way, Bors cannot do his job without incidentally adding a lower bound to PR cycle time, which is the time it takes to run the test suite.

    Sadly, I’m not aware of any data on the amount of time for which the master branch had test failures before and after Bors. I am also not aware of any consistent “this commit fixes a test failure introduced by other code that landed” tagging scheme from before Bors was introduced. Without those, I can’t currently think of a way to mathematically calculate whether Bors is succeeding at the task of automatically maintaining a repository of code that always passes all the tests.

    Thank you again for doing science on these robots!

    [1] http://graydon.livejournal.com/186550.html

Leave a Reply

Your email address will not be published. Required fields are marked *