Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Probability of acceptance of pull requests (paulmillr.com)
127 points by paulmillr on May 28, 2013 | hide | past | favorite | 56 comments


The title is a much more cynical take on the data than I have looking at it. I have some familiarity and experience with Django, one of the projects that has a pretty low pull request acceptance rate from his sample. At 29% I was impressed that many pull requests were accepted, frankly! I think the signal to noise is pretty high on pulls to high profile projects.

It's easy, and common I think, to throw a small change into a pull request and fire it off without confirming it's desired or within the objectives of the maintainers. I've done this at least once w/ Django personally. My thought process was "This may be undesired, but if it is desirable the chances of it making it in without bikeshedding in a mailing list is way higher if I just lead with a pull request."

In my case it wasn't desired (I think it was a settings.py var that allowed users to configure auto-import of models in the shell, dropped cause it's implicit over explicit and adds to settings.py).

Add to that all of the pull requests that are effectively proof of concepts because the implementation fails to take some major stuff into account, and an acceptance rate of almost a third starts to sounds pretty good to me.


It's so much better to send these sorts of proposal via e-mail. It's much easier to review the patches via e-mail, and everyone on the development community can make suggestions. Even so, it's not uncommon for patches to require three or four versions, and not just by first-time contributors. There are patches written by folks who have been working on ext4 for many years which still require three or four or more revisions after being subjected to multiple rounds of peer review.

If you show me a git project where most of the pull requests are accepted, I'll show you a git project whose quality assurance is probably completely out of control.


I think that depends a lot on the project. Projects that heavily use github tend to prefer a github pull request.

It makes it easy to comment on the diff and integrates with Travis CI (integration tool) and other tools.

It is easy to make multiple changes to the pull request before it is accepted so nothing about this system implies that pull request are always accepted without changes.


It's easy for the original author to make changes to the PR, but not easy for the maintainer or a third party to make changes. The micro-pedantry associated with spelling errors, tweaking commit messages, and trivial spacing/formatting issues can be more work for everyone when communicated through comments on PRs with the original author expected to apply and re-roll.


That's only partially true.

Pull requests on GitHub have to have an associated branch or fork in another repo. Even if you're not the original author, if you have access to it, you can do "git checkout" or "git clone", then you make the required changes and then on push the commits are automatically included in the pull request.

The only problem is that the maintainers do not have write access to the forked repositories of new or casual contributors. So if you're in a rush to change something, you have to fork that branch, make changes and then merge it, while ignoring the original pull request (although if you place the issue number in the commit messages you push, they'll get referenced on the associated page).

However, the commits logs made by the wannabe contributor are still in the history, so proper credit/blame is given where it is due, unless you rebase of course. And this isn't something you can say about diffs sent by email.

Also, GitHub sends you emails on pull requests. And you can reply by email too.

What I like about GitHub Pull requests is that you have a link you can give to people for review or that you can post wherever, a link that gives you a descriptive diff of what happened. Plus, the discussions around that pull request are left there, included in that link, left for everybody to see, instead of turning into a private conversation that nobody else will know about (incidentally, it's also the reason why I think native apps will never beat web apps, as long as native apps don't have URLs for referring to content or state within the app). This to me is useful enough to prefer it over diffs by email.

EDIT: there is one thing about GitHub I don't like - you cannot give write-access to people, while preventing them from pushing to master. If this where possible, you could give everybody access to push, except for master, in which case the above mentioned issue with pull requests would be a non-issue!


The problems with keeping the original commit and then putting the fixes on top are, (a) it makes the history harder to follow, (b) it makes it easy for someone to screw up a cherry-pick of a bug fix to a maintenance brance, since now you have to cherry-pick the bug fix plus the fix(es) to the bug fix, and (c) if the original patch had a potential bug that might cause the system to crash or otherwise malfunction, even if you fix it in the subsequent commit, it makes "git bisects" more likely to yield false positives, or at least make things more confusing and more difficult than it has to be.


Even when sending patches via email, I've always been asked by maintainers to do the trivialities myself and re-submit. I suspect at least part of the reason was education.


Does this hold true for you on smaller projects too? Ruby gems, for instance...?

I personally strongly prefer to see a diff via Github, especially for smaller features on a small library / gem -- although on larger & more popular projects (lots of pull requests) it definitely makes some sense that you'd prefer to have a 'heads up' of sorts via email first to confirm that it's even a desirable feature.

Really just wondering here, I recently started semi-blindly submitting pull requests with feature additions or small bug fixes (with very verbose pull request messages + solid but concise commit messages) to various library-type projects so I'm legitimately curious as far as how far to take this logic...


I think email is a better forum if everyone involved is disciplined about factoring commits well, writing good messages, using an email client that formats properly, and is fluent with the tools. PRs are more intuitive, more tolerant of workflows, and more accessible to newcomers. At this point, nothing matches the archival quality of mailing lists. When PRs are rebased/refactored as a result of reviews, the original versions generally become orphaned, so the discussion vanishes.

When there are multiple independent discussions spawned from different parts of single patch, as is common in less disciplined communities, the granularity of email is too coarse, and threaded line-comments are much better. On the other hand, patches on mailing lists frequently spin off into higher level design/philosophy discussions, which isn't appropriate in line comments.


well, I guess adding anything to settings.py will get red flags instantly because everyone is sick of settings, that is common knowledge when you follow django-dev and con vids.

my almost first contribution to django was recently hijacked by a better one, bummer :)


Yeah, it was a long shot. I thought maybe the fact that it was entirely optional and not on by default might help my case, but the reasons for it being turned down were totally reasonable and consistent with the project & python idioms.

Still, I ended up building our own shell command that included the change because fuck typing that much.


I can't speak for the other projects, but the data is completely wrong and misleading for jQuery. It looks like this code only counts a pull request as merged if it's landed with a merge commit. It's much more common for us to touch up the pull request (for example, fixing small style or commit message issues), then rebase, possibly squash some commits, and land via a fast-forward commit.

So this is "accepted": https://github.com/scalatra/scalatra/pull/267

But this is not: https://github.com/jquery/jquery/pull/1255


The hosting sites need to start providing a way for a maintainer to update or supersede a PR, as well as a way to track prior versions of commits in a PR after the PR has been updated post-review.


Vaguely related anecdote:

I had a really bad experience trying to use node.js for a project. Eventually I figured out that the module I needed was just broken, so I decided to at least update the documentation. As it turned out, not only was it not documented how to update the documentation (...) but I had to go back and forth on undocumented style guide rules (for documentation that gets displayed in the browser anyway!) and then sign the CLA, all to update a single line of text in the docs to say 'don't use this module, it's broken'. It literally would have taken less time for them to make the change themselves.

This soured me even more on node.js than actually using it had. It's bizarre that project maintainers would actively put so many walls between themselves and valuable (or not-valuable, for that matter) code contributions.

I can only assume that the legal downside to not having a CLA signed for every single-line commit is enormous. Are there previous court cases in the US where not having a CLA got someone destroyed?


> I can only assume that the legal downside to not having a CLA signed for every single-line commit is enormous. Are there previous court cases in the US where not having a CLA got someone destroyed?

I don't know of any court cases, but I do know of several projects that were not able to change their license because they didn't have a CLA and couldn't contact all previous contributors. Thats the right that section 1 of the nodejs CLA gives them.

Fun fact: nodejs reserves the right to use your contribution under "(b) binary, proprietary, or commercial licenses".


The vertical labels on the charts reduce readability greatly! Perhaps angle them a bit and sort from highest to lowest so you can just scan down the labels and know where they stand without having to look back up at the bar height. Or simply flip the axis.


I'm surprised the acceptance rate is so high.


thats what I was thinking! I suspected many more of them to be junk


Dear Mods: This time you've edited the title to something other than the post title. Can you clarify what the rule is? Should the title be informative, or the title of the post?


What is a CLA and why would a project want one / not want one? They seem like another hurdle for contributors?


It's a contributor license agreement. "Contributor Licence Agreements (CLAs) can be used to enable vendors to easily pursue legal resolution in the case of copyright disputes"[1].

Licensing/copyright can be a messy thing - especially with multiple authors involved. I assume the legal clarity it provides is seen as worthy of the extra hurdle to some of the more risky and/or cautious projects. Plus, depending on the agreement, it allows more freedom for change of license terms without getting all existing contributors to agree - the agreements I've signed have often contained a clause declaring implicit permission (or waiving rights completely).

[1]http://en.wikipedia.org/wiki/Contributor_License_Agreement


We require contributors to sign a CLA before accepting pull requests to RethinkDB (http://github.com/rethinkdb/rethinkdb) for several (excellent) reasons, explained here: http://www.rethinkdb.com/community/cla/

We've found that it's a low hurdle for contributors-- it's a common attribute of most open source projects and helps avoid headaches of ownership later on down the road.

Essentially, no project wants the question of who owns contributed code to come up. The SCO-Linux controversy would have been avoided if a CLA had been in place, and since then they've become standard practice.


Searching for "sign cla", it appears to be a Contributor's License Agreement.



That is such a lazy and non-helpful answer, especially considering that the poster didn't just ask what it was.


The curious question is why certain projects have significantly lower pull-rates; I doubt the quality of the pull requests vary that widely across projects.

Is it that project leadership has a clear vision for a product and feel pull requests are a distraction? If so, share the vision and enlist willing developers to help achieve that vision.

Is it that project leadership has a high-standard for the codebase? If so, explicitly state the standard and provide constructive feedback to pulls that don't quite measure up.

I guess my point is: project leadership has the ability to leverage a tremendous pool of talent that is interested in their project using the pull request mechanism. It does take effort and planning but the net result is so worth it to all parties involved.


This is a complicated problem, and one that I'm excited to think could be improved in the coming years as Github et al have the time and resources to invest in it.

I have a few thoughts on the topic and on your comments:

1. I actually DO think that the quality of pull requests varies widely from project to project. You dismiss this in your first sentence but I don't think you should. I think there are a lot of reasons for this, but many can be reduced to: Some projects attract more inexperienced developers than others.

2. I don't think you can simply indict the project maintainers for not optimally leveraging the talent pool of volunteer contributors. While Open Source has been validated a million times over as a truly successful movement/concept/etc, nobody has ever accused it of being efficient. Specifically, most pull request submitters act on their own, without coordination between submitters or with project maintainers. It's not just an issue of coding standards, it's also things like the maintainers having a roadmap that includes bug fixes, refactoring, new features, etc, and that may conflict with Joe in Idaho and his pull request that he did to scratch his own specific itch.

In many ways community submitted pull requests are a "million monkeys at a million typewriters."

3. The most successful way of getting a pull request accepted, I think, is to first raise a discussion on the issue on the projects mailing list or issues forum. Link to your in-progress patch, say "I have this fix here for problem X, can anybody take a look and tell me if I'm on the wrong track with this?" So many of us are code-first, conversation-later types and that constantly conflicts with the social nature of our job.

I guess my takeaway here is that there is a lot of room for interesting technology solutions to this problem, in the areas of:

1. Tools that help maintainers cultivate the submitters talent pool on their project.

2. Tools and workflows that help submitters through coordination. Totally a half baked idea but suppose if I'm fixing a bug in a function somewhere, Github can help me discover what is happening to that function elsewhere. Maybe it's been radically changed in a feature branch owned by the project maintainers, maybe it's already been fixed nearly the same way by another submitter who has an open pull request, maybe it was already submitted by somebody and rejected.

3. Static analysis that can power those tools and more. Would love a meaningful quantative quality score on the pull request submission page. Like an additional tab beside the "Commits" and "Diff" tabs. Maybe it could run various analysis on the code and show formatting issues, known vulnerabilities, known bugs, etc. Obviously the abilities here vary from language to language.


> Some projects attract more inexperienced developers than others.

Fair.

> In many ways community submitted pull requests are a "million monkeys at a million typewriters."

I strongly disagree with this assertion. To argue that community cannot be lead or directed toward a common goal is very pessimistic. Sure, Joe in Idaho has a specific itch, but there's a good chance there is an opportunity to capture Joe's enthusiasm and willingness to do work on the project by working with him instead of treating him like a code-monkey at a keyboard. Community relations is hard because dealing with people is hard.

> 1. Tools that help maintainers cultivate the submitters talent pool on their project.

Yes, I agree that this is someplace where some tools would help tremendously.

> Github can help me discover what is happening to that function elsewhere.

Great idea. I'd love to see a "related diffs" tool for a given piece of code.

> So many of us are code-first, conversation-later types and that constantly conflicts with the social nature of our job.

I totally agree. Ultimately, tools will only get us so far; people skills must fill in the rest. Making "Joe" feel welcome and a part of the creative process requires measures of patience, kindness, gentleness and mentorship.


> The curious question is why certain projects have significantly lower pull-rates; I doubt the quality of the pull requests vary that widely across projects.

The reverse is more prevalent. The quality of projects varies wildly. The higher quality the project, the less likely that you can accept a pull request as-is.

> Is it that project leadership has a clear vision for a product and feel pull requests are a distraction? If so, share the vision and enlist willing developers to help achieve that vision.

Most developers seem to expect to throw code over the wall in a pull request, and do not follow up to requests for improvement to meet the project's requirements, eg:

- Testing requirements

- Documentation requirements

- Code style

- Ensuring that the implementation fits into the broader project road map

- Code quality

- Avoiding code duplication

- Avoiding code-to-the-goal solutions (see also: project road map).

In my experience, most pull requests take more time to review and correct than it would take to write the code in question from scratch.

> It does take effort and planning but the net result is so worth it to all parties involved.

I don't think the economy is there, actually. Either people are going to read your contribution guidelines and provide quality material, or they're not. Most of the time they're not.

Given my (15 years of) experience in open-source, I'm highly suspect of code quality those projects that have acceptance rates above 50% from people other than core developers.


Is it possible to switch the axis and set the maxima at 100%? I have to re-calibrate for each graph, or 30% almost looks like 50%. And it's hard to read the rotated labels, but the percent on the horizontal axis would not need to be rotated.


As for maxima — yeah, definitely. Done.


Great.

Just to expand on the switching of axis. It's harder to compare data between graphs, but upon second inspection, it appears the colors from the first graph are not used in the secondary graphs.

Second, trying to get to the projects in the first graph, it's easier to exit the graph from the bottom. Otherwise, short bars (eg. node) are hard to "select" for the "Current" line, and in that case, I have to make a longer trip around the graph to get to the link. Maybe link the labels? Move the "Current" line underneath?

I enjoyed it nonetheless. If you're considering expanding or having a follow-up piece, I'd enjoy combining the three graphs with the format / filtering at http://www.techempower.com/benchmarks/.


The conclusions don't seem reasonable to me. Higher acceptance rate is not necessarily better; it can also be that the quality of pull requests is too low: that the developers are instead discriminating in what they let into their project, and that the project benefits as a result of this.

Using it as a project quality metric is in this way similar to lines of code: higher doesn't necessarily mean better.


What would be interesting to see would be merged vs. open vs. closed (as in rejected), and also the typical delay between the opening of the PR and the merging/closing.

Some projects are fast to close PRs if they don't fit their goals, some don't attend to the PRs at all and you have PRs lingering in the queue forever.

Two projects could have a similar acceptance rate with very different attitude: one could be on top of things but picky about what they merge in, while another is just not attending to the list except for the odd PR now and then. It seems that a project like the former would still be more interesting to participate in just because you could expect more feedback on what is wrong.

Regarding the presented data, I'm also surprised that it's that high for some project. With the way GitHub works, it's easy to create a pull-request without consulting anybody in the original project, so it's surprising that so many would match the projects' intentions.


What would increase the accept rate greatly would be if there was a function to accept-pull-into-new-branch, which creates a new branch and applies. The maintainer can then apply the necessary changes/corrections/documentation updates, get the branch tested, then merge it back if all goes OK.


    $ git fetch url-or-remote their-branch-name:my/better-branch-name
    $ git checkout my/better-branch-name
    $ hack, commit, etc


I'm aware you can do that with native git, of course - but that leaves the original pull request on github appearing "unapplied", doesn't it?

I meant that it would be useful to add a function for that workflow in the github UI.


If you edit commits in the branch, then yes, it appears as declined in the UI, which looks hostile when they are notified. I write a friendly message every time I do this, but it sucks to have to give that explanation and the UI metadata is still wrong. I asked for this on bitbucket, though it's currently on hold [1].

If you don't edit the existing commits, then it will show up in the github/bitbucket UI as merged when you eventually merge, regardless of what games you play with branches.

[1] https://bitbucket.org/site/master/issue/6704/supersede-pull-...


One way I've worked around this is to merge the original pull request, then immediately force-push the fixed version. Still doesn't fix all of the UI metadata, but at least the pull request gets marked as merged.


Only works when you have push access to the repository that is the source of the PR. Normally maintainers don't have access to the repositories for all contributors.


Er, no? You merge the unfixed version via the web UI (to get the PR marked as merged), then immediately force-push the fixed version to your own repo, not their's.


Oh, I thought you meant updating the pull request by force-pushing on their branch. Your solution is racy, thus only acceptable for projects small enough that the likelihood of someone pulling during that window is sufficiently low.


It's definitely not an option for very large projects, but projects small enough to have a very low chance of someone pulling in a several second window describes the vast majority of projects (and in general by the time it gets unsafe I'd expect fixing up incoming pull requests to be impractical time-wise anyway).


You can do that. Just see what branch the pull request is on, then git pull it, followed by a merge of FETCH_HEAD.


Very great analysis! I remember reading an article about an experiment someone took in bringing up the contribution in his project - And that was to add users as collaborators any time someone made a successful contribution.

I've been doing this with all of my projects so far (Granted, I do not lead any interesting projects) but a small one I re-did recently has already gotten additional activity from one contributor since I added him on.


The question this raises for me is what the acceptance rate is as a function of the number of pull requests, or of the number of stars. A scatter chart would be cool here. I strongly suspect the acceptance rate goes down as the project gets bigger. That's some combination of the developers having less time to field all the requests and the more popular repos getting a lot of junk of PRs (anecdotal observation).


There's nothing bad about about a high rejection rate. Most likely there is a strong correlation between the quality of code and a high rejection rate.


Great analysis. Just a couple of data visualisation niggles:

1. Please sort the bars in your barplots by descending y-value. This makes it much easier to see ranking and to compare any two bars. In plots where bars are coloured by group, it also gives an intuitive impression of group performance.

2. Please include error bars (sem) on the plots where bars represent the mean, so we can understand how representative the mean is of the group.


It would also be interesting to track the distribution of those pull-requests, with respect to user who submitted it. With the current setup, projects that are only willing to accept pull-requests from known committers, and those who are more open to new comers look identical.


great analysis. I thought big projects have a lower acceptance rate than smaller ones in general, but when you take a look at django vs flask to rails vs sinatra, it doesn't seem to be the case.

one nitpick: next time rotate the diagrams +90' please :)


> one nitpick: next time rotate the diagrams +90' please :)

What? No. The x-axis is the independent variable and the y-axis is the dependent variable.


The x-axis data is not numerically ordered, so there's no good reason to keep it oriented that way.


tell that to my neck.


Now you can tilt the other way --> :)


switch X and Y? I can't read this without turning my head 90 degrees


that's a fantastic write up, thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: