Thought Experiment: GitHub Community View

June 10, 2015Giles Bowkett

There’s a major error in GitHub’s conceptual model for repos: a presumption that the initial authorship of a project is inherently superior.

Say I’ve got a project and I want to see which fork is most active. I can’t just click and get a graph of which fork has the fastest time between receiving a PR and closing or merging it.

Say that I have time to manage a fork, and more time for it than the project’s creator or core team has. I could do all that work, yet the project’s progress overall would still be a matter of when the creator and/or core team had time for it.

For my fork to really matter, the canonical repo would either have to take a bunch of PRs from it, or manually redirect people to it with a line of text in the README. Otherwise the world of the project is splintered, and the only way around its fragmentation is word of mouth.

Why This Matters

Justin Searls did a good presentation about how open source projects decay over time. A community view like the one I’m describing would, I think, mitigate that effect.

GitHub’s understanding of the canonical repo is never a problem at the start of a project, but it can be a problem later on. You don’t want to make your engineering decisions around rumors and word of mouth, but when a project’s community’s fragmented across a few different forks, you kind of have to.

The Fundamental Error

GitHub is centralized even though git is distributed. A completely flat, sprawling web of commits back and forth between forks is entirely possible within git. It would need no discernible center. master is just a name.

Panda Strike’s CEO, Dan Yoder, told me a story about a startup where he served as CTO. GitHub went down, and the CEO came into his office, saying, “we have to move everything off GitHub!” He was upset because no programmer could do any work that day. Except, of course, they could. You choose one person’s repo as the new, temporary master version of master, fire up sshd, and replace GitHub with your laptop until their servers come back online.

GitHub’s illusory centrality is a semantic mismatch with git’s decentralized nature, but it’s usually not a problem. Because GitHub’s a great site. It adds a lot of value to git. You get a lot of benefits by adding a center back to the model.

But GitHub is also hierarchical even though git is flat. If GitHub only added a center to git, it would have commits flow to the center of a web of repos. But that’s not how it works. On GitHub, pull requests have to flow upwards to attain lasting impact.

It’s this:

Not this:

A canonical repo is not intrinsically necessary to the git model. But if you want a canonical repo, and you’re adding a center back to a decentralized technology, then it would make sense to put a canonical repo at a center. GitHub puts it at a top.

The Center vs. The Top

This blog post started as a rant on a ticket for the hubot project. At the time, GitHubber Brandon Keepers said:

How would you handle distribution of releases in this world where there isn’t a canonical repository? NPM and almost every other package manager also has the concept of a canonical source. There is value in distribution of stable releases. It would be very difficult for something like hubot to exist in a world where package authors couldn’t rely on a canonical version of the core.

I understand your point and agree that a canonical repo is not necessary in git’s model. However, source code is only one part of running a successful project and I think those other aspects would get very complicated without a canonical source.

I agree with every word of that, except this sentence:

It would be very difficult for something like hubot to exist in a world where package authors couldn’t rely on a canonical version of the core.

I think we’re going to see that happen one day, and it’ll be very interesting. I don’t think it’s actually going to be as problematic as it might sound. But I could be wrong, and it’s really a completely different (and much more speculative) discussion. TLDR: Yes, adding the center back to a decentralized model can definitely have very useful benefits. Making dependencies easier to deal with is always a good thing.

But let me illustrate why I think establishing the first repo as the canonical repo — irreversibly, permanently, and structurally — is a huge mistake.

To see the other forks of hubot, you click its fork count:

This is what you get:

But here’s how it looks for a project with a more reasonable number of forks:

This graph assumes that the most interesting thing about a fork of the project is how many commits it’s contributed to the project’s original version, and when. But that criterion only makes sense when a project hasn’t been abandoned.

In real life, projects get abandoned all the time. And often, when a useful project gets abandoned, its downstream forks turn out to be better at merging pull requests than its original version.

In this graph, the canonical repo is at the top. And the canonical repo can only ever be that original repo, even though original and canonical are not intrinsically synonymous in real life.

But here’s a simple sketch of a forks explorer which puts different repos at the center, depending on different criteria. The data’s made up, and obviously the design’s flawed. It’s just to illustrate the idea.

In this screen, the user’s seeing the default: a chronological ordering. This allows you to prioritize the originator, the same way GitHub does now. The original repo’s at the center, with new repos added over time. The size of the circles could represent commits merged upstream, so you preserve everything the existing graph does.

But in the next screen, the user’s selected “most active.” This allows you to prioritize active development.

Now the user can see that the samaaron fork is more active than the overtone fork. (In real life, Sam Aaron wrote Overtone, but again, this is just an example.) Obviously, in this one, the center belongs to the single most active repo, and if there are other forks seeing a lot of activity, those go near the center too.

A “most current” or “most recent” ordering would probably be quite similar, but not identical.

In the screen below, the user wants to see repos which have merged a particular pull request.

Maybe a dropdown isn’t the best way to do it. And it assumes a kind of alternate universe, where pull requests live in a pool shared by a community, rather than belonging to any single given fork. But that’s completely feasible within git.

In day-to-day work, a lot of programmers spend a lot of time doing this kind of analysis and investigation manually. But these analyses, fundamentally, are about counting things and comparing them. The work is repetitive and easily quantified. That’s what computers are for.

Also, from a design perspective, this approach emphasizes the community, and the user’s goals, over the originator’s prestige, while still allowing you to place the originator at the center (not the top) by default. And keep in mind: if something hands you prestige, it hands you responsibility as well.

I think this error in GitHub’s fundamental design contributes to open source developer burnout, and that’s a serious problem. Recent research suggests that burnout might even be [just another word for clinical depression](http://en.wikipedia.org/wiki/Burnout_(psychology)).

But Who’s Going To Build It?

I totally get that adding new features to GitHub requires time and energy, while GitHub provides a terrific API, which might support this use case. So maybe we at Panda Strike will build this out, as an API-based experiment — but I’m not making any promises. For now, this is just an idea.

And I’m not saying that GitHub’s failure to exist in an ideal, perfect state is an unforgivable crime. Please don’t get your Twitters in a knot. Obviously GitHub’s a very successful business and a very useful system. I’m saying that while GitHub is good as-is, this would be an improvement.

Open source is about communities more than originators. It makes perfect sense for the original repo to also be the canonical one, by default, on day one of a project. But it doesn’t make sense for the original repo to be the only repo which can ever be canonical. And there are times when the original repo is not actually the most useful one, or the logical one to consider canonical.

I think you can find a lot of validation for what I’m saying here if you consider that there already exist Stack Overflow questions where people are asking which fork to use, and an app designed to figure out which forks of a given project are actually still alive.