HTTP/REST Data APIs

October 25, 2015Dan Yoder

Data-oriented API frameworks, like Relay and Falcor, are trying to solve problems for which HTTP APIs appear, even to experienced developers and architects, to be poorly suited. Let’s put aside spurious claims about HTTP and REST, and explore these concerns.

The most obvious difficulty is dealing with ad hoc queries. Semantically, a query ought to be a GET request. But HTTP caching relies first and foremost on the URL for caching. And mapping the semantics of a sufficiently rich query language into a URL seems both contrary to the notion of opaque URLs (and thus REST) and just plain awkward. We can, of course, use POST and send the query in the body of a request, but we lose the benefits of HTTP caching in the process. (You can, in certain circumstances, cache POST responses, but that won’t help us here.) At first glance, then, if we want ad hoc queries with cacheable responses it appears we cannot rely on HTTP.

Let’s try a little harder, though.

Query Resources

Databases allow you to compile a query before running it. Let’s borrow this idea and adapt it to our needs. We’ll introduce a new kind of resource, a query resource. A query resource encapsulates a query. As before, we can send a POST request with the query in the body. This will create a new query resource. The location header in the response will be its URL.

Request

POST <query-url>
<query-body>

Response

201 CREATED
location: <query-resource-url>

To “run” the query, we simply GET the query resource.

Request

GET <query-resource-url>

Since we’re now using GET requests to run the query, we can fully utilize HTTP caching.

Response

200 OK
etag: <query-result-etag>
cache-control: max-age <query-result-max-age>
<query-result-body>

Of course, the drawback is that we have two round trips. One to create the query resource, and one to actually run it. Consequently, it’s tempting to discard this approach. However, as it turns out, we rarely need to make two round trips.

Storing And Reusing Query Resources

To see this, it’s helpful to divide our ad hoc query scenario into three variations. We have the multitude of clients scenario, where each client knows the queries ahead of time. But there are simply too many clients to anticipate each of their needs. For example, we might have a popular API used by hundreds, or even thousands, of applications.

We also have the multitude of queries variation. In this scenario, even the clients cannot anticipate the queries ahead of time. However, the client can potentially reuse queries. For example, we might have an application that allows people to save frequently run searches.

Finally, we have the truly ad hoc queries variation. In this scenario, queries are rarely, if ever, repeated. However, we can discard this scenario because, by definition, caching is of no use since we never (or rarely) repeat the same query. So we can simply use POST to make the query.

This leaves us with the multitude of clients and the multitude of queries scenarios. In the multitude of clients scenario, we create query resources at build time and load the query resource URLs from a database or configuration file. In effect, we’re allowing each client to construct custom endpoints for their application. Yet we’ve introduced no unnecessary coupling in our API. In the multitude of queries scenario, the client can still store the URLs of query resources, either in local storage or along with session or identity information.

In both scenarios, we store the query resource URLs. When we store the URLs, we don’t need to recreate query resources every time we run a query. And the cost of the extra round-trip approaches zero, since we only do it once. In fact, in the multitude of clients scenario, we only do it at build time.

Parameterized Queries

We can convert select multitude of queries scenarios into multitude of clients scenarios by borrowing another idea from database systems, the idea of parameterized queries. Parameterized queries are like compiled queries, except they accept parameters. As with conventional parameterized queries, we can indicate parameters with a syntactic convention in our query language. For example, we might prefix parameters in our query language with a $.

Creating the query resource works the same as before, except, of course, we have parameters in the query body.

Request

POST <query-url>
<query-body-with-parameters>

Response

201 CREATED
location: <query-resource-url>

When we make GET requests to the query resource, we provide values for the parameters, just like we would when using parameterized queries with a database. This fits HTTP like a glove, since we already have a way to parameterize an otherwise opaque URL.

Request

GET <query-resource-url>?<parameter-name>=<parameter-value>&...

And, of course, we can take advantage of HTTP caching, just as before.

Response

200 OK
etag: <query-result-etag>
cache-control: max-age <query-result-max-age>
<query-result-body>

Query Resource Limitations

Two problems remain. The first is mainly a problem in the multitude of queries scenario. We may not always know that a given query already has a corresponding query resource. We can solve this problem if we always construct our queries in such a way that the same query is always represented the same way. We should be able to perform a simple string comparison to determine whether two queries are identical. That is, we need to be able to normalize our queries. If we can do this, we can simply hash them and maintain a dictionary mapping queries to query resource URLs.

Psuedo-Code For Query Lookups

queryResourceURL = queryResourceDictionary[normalize(query)]
if !queryResourceURL?
  result = http.POST query
  # save the query resource for later reuse
  queryResourceURL = result.headers.location
  queryResourceDictionary[normalize(query)] = queryResourceURL
http.GET queryResourceURL

The second problem is cache invalidation. However, any API supporting queries will have this problem. That is, it isn’t limitation of REST. So we won’t address it in this post, just like we’re not discussing which query language we’re using. We can use any number of query languages or libraries. Cache invalidation requires some analysis of the queries and a way to determine which database updates affect which query resources. Of course, if it isn’t critical to have realtime updates, we may be able simply stamp responses with freshness information and avoid dealing with cache invalidation altogether.

Data Synchronization

We can also take advantage of the push feature in HTTP/2 to push query responses before we request them. This allows us to keep our queries simple, increasing the likelihood of a cache hit, without incurring extra round trips. For example, we might have a list of films, each with ratings and reviews. We can create separate queries, one to retrieve the list, and one to retrieve a film’s ratings and reviews. By the time we make a GET request to retrieve the ratings and reviews, the server has already pushed them to the client.

We can take this even further by using server-side events to effectively subscribe to the film listings and ratings and reviews data. Our client can now simply access local data structures synchronized from the server. By layering this synchronization on top of our query resources, each client is free to get exactly the data they need without introducing additional coupling between the client and server. This wasn’t possible until a couple of years ago, but that’s kind of the point. The Web, and HTTP along with it, are constantly improving.

Hyperlinked Queries

Of course, it has always been possible to link queries together, without push or server-side events, by using hyperlinks. This imposes some non-trivial requirements on the query language and I’m not aware of any query language that has done this. At the same time, adapting an existing query language to support hyperlinks was perhaps a reasonable alternative to abandoning REST.

HTTP/REST Can Handle Ad Hoc Queries

The query resource pattern supports ad hoc queries in HTTP APIs:

for arbitrary query languages
with minimal overhead
fully utilizing HTTP caching
consistent with REST constraints
with only two additional endpoints

We’ve even described how we can build on this foundation to support real-time data synchronization.

Of course, data-oriented API frameworks do other things. But we can confidently say that the need to support these features is not, by itself, a reason to recommend them.