HTTP Made Simple, Part 3: URLs Are Identifiers

In part 1, we said that HTTP views the Internet as a giant, distributed key-value store. In part 2, we reviewed the semantics of HTTP's methods, with GET, PUT, and DELETE acting as the main interface, and POST acting as a fallback for things that don't fit key-value store abstraction. In this article, we're going to explore the utility of the ubiquitious URL.

Recall that URLs are the keys in our global key-value store and are therefore obviously pretty crucial. However, like much of the world's most viral protocol, they're often misunderstood.

URLs Are Opaque

A lot of virtual ink has been spilled on the design of URLs. That is, how to make them "pretty," or human-readable. While this is a reasonable objective in many cases, it tends to obscure a fundamental property of URLs: opacity.Which is a fancy variation of information hiding and encapsulation, which, as software engineers, we know are Good Things. Right? Right?!

An easy way to think about this is to imagine you're reading a Web page. You see an interesting link. You click on it. You don't think too much about the URL—it's the link, and the content it refers to, that matters to you.

Without this property, things like search engines wouldn't be possible (or, at least, they'd be considerably more difficult). Niether you, nor a search engine, need to care about the structure of each site's URLs. A site's pages can be crawled, in a sense, blindly. And then a search results page, consisting of URLs from many different sites, each with a different structure, can be presented to you, without you ever knowing, or even thinking about their structure.

Put another way, without the opacity property of URLs, the Web as we know it would be impossible. In short, the structure of a URL is an implementation detail, just like, say, a database identifier.At least, it can be considered an implementation detail. And generally, the less the client and server need to know about each other, the better. That is, there's no reason for the client to depend on the structure of the URLs. That doesn't mean you can't make them pretty. It just means the client should not depend on precisely how they're pretty.

Parameterized URLs

Developers tend to overlook the opacity thing, or dismiss it as impractical, because they think of the URL as something along the lines of a function name. They want to call a function on a server. The URL itself serves as both the name of the function and, in many cases, a way to encode arguments to the function.

Let's consider an example of updating an employee record. Suppose we've read parts 1 and 2, and so we use PUT to send the new modified record to our server:

PUT /employees/1234

with the body being the JSON describing the employee. That's pretty reasonable, right? In fact, there's nothing at all wrong with this. But…there's a better way to do this.

Let's try the same thing, but using query parameters.By the way, query parameters aren't some second-class citizen in the world of URLs. They are absolutely part of the URL. A resource is identified by a URL, including the query paremeters. If the query parameters have different values, you may well be referring to a different resource.

PUT /employees?id=1234

This is somewhat easier to understand, but more importantly, it's extensible and I can create new URLs dynamically from existing ones. That is, I don't need to know the URL structure to put this together. I just need to know the names of the query parameters.In terms of our function call analogy, it's like having keyword arguments and first-class functions.

Suppose we get a new requirement to allow us to find an employee by email. It's pretty easy to use this same approach, right?

GET /employees?email=dan@pandastrike.com

What if we want all employees whose birthday is today?

GET /employees?dob=2014-01-28

And we can combine parameters without having to think about structure at all—order doesn't matter anymore:

GET /employees?hired-after=2013-01-01&status=exempt

That's the extensibility part. But where this is really useful is that we don't even need to know the URL we're building from because the query parameters are appended to another base URL.

Think about how you browse Web pages again. You don't think about the URL. We can do that as developers, too.

Let's imagine we have a object in a variable, named employees. That variable has a property named url. That's the URL for the employees resource. In CoffeeScript, we can construct the URL above like so:

"#{employees.url}"?hired-after=2013-01-01&status=exempt"

At this point, we no longer have anything in our code that "knows" about the structure of the URLs. We do know the names of some arguments that can help us construct new URLs, but that's it.

But how did we get that employees variable in the first place?

Discoverability

We can discover useful URLs the same way we use search engines—with a well-known starting point. For example, you could just start with a get request to /. This root resource can then return a description of the available resources, each with URLs. In CoffeeScript, the resulting object might look like this:

resources:
  employees:
    url: '/employees'
 ...

Now, the API client no longer needs to know anything about the URL structure of the API except /.Again, this is all basically just encapsulation, applied to the world of distributed computing. Along with opacity, distributed computing geeks will sometimes talk about loose coupling, but these are just specific variations of encapsulation.

URLs Are Persistent

One misconception about discoverability is that it's slower because you have to construct a chain of requests to get to the URL you want instead of just hardcoding it and making a single request. If this was true, it would be a big problem, but, fortunately, it's not. URLs can be stored on the client once they're discovered. You don't need to keep rediscovering them.

If a URL does change, the server can issue a 301 Moved Permanently, which also returns a Location header, so we can update our local reference. But even if we, as the API designer, completely change our URL structure, so long as we issue 301s, no one using the API will know the difference, provided they used discovery to initially get ahold of the URLs.

URLs Scale

HTTP URLs also include the name or address of the server that can find the resource being requested.This mirrors what DHTs do when they hash the key to find the server that has the key's value. The "hash" in HTTP is obviously just to parse out the name or address of the server from the URL. HTTP URLs harness the existing DNS system to accomplish the hashing step. This means we can resolve any resource in just two steps. One DNS call and one call to the server. In practice of course, the server may use additional steps to resolve the resource. But conceptually, you can already see how this can scale. IPv4 allows for 4.3 billion hosts, which DNS resolves in millseconds. Each of these, in turn, can resolve millions or even billions of resources. That's 10^18 resources that can be found, typically in less than a second.

Until Next Time…

In a nutshell, URLs provide a highly scalable, well-defined, and robust key scheme, featuring opacity, discovery, and persistence. You don't have to take full advantage of these features, but you can, and they're a big part of what makes the Web scale so well.Later on, we'll talk about the trade-offs associated with using them. For now, suffice to say that URLs are very cleverly designed and it's useful to know that you can take advantage of that.

Again, URLs are the keys in our key-value store. But, by themselves, they don't get us a value. They merely reference a resource. In our next installment, we'll talk about how to go from a resource to a representation, which is the actual value we want, and why that distinction is so useful.

Notes