Dynamic HTTP Routing in Go With Travel

One thing I found interesting while using the Pyramid web framework was the idea of “traversal” request routing. I’ve implemented an HTTP router in Go that offers similar functionality called travel.

If you read that traversal link above it may seem complex—it is, sort of—but let me break it down a bit and explain why I think it’s cool:

Using travel, your job as the author is to provide the router with a “root tree” object1. This is a nested map[string]interface{} object that represents a tree of routable resources. Travel tokenizes the request URL and does a recursive lookup on the root tree. If it fails anywhere2, travel automatically returns an appropriate error to the client (404, etc). If it succeeds, it calls one of your named handlers according to a “handler map” you provide3.

The nice things about this approach (vs. static route patterns) are:

  • Routing is dynamic: arbitrarily nested structures only known at runtime can be expressed in the URL and routed appropriately.
  • Less boilerplate code for RESTful endpoints: if your handler is called, the resources in question must exist. No need to do full table scans to, eg, check that user_id or whatever exists.
  • Separation of data concerns: since routing is done with a tree structure, you could separate out the root tree from the data it references (putting the root tree in a fast RAM-backed store for example) and only load data from the database for successful requests.

To demonstrate some of these benefits I’ve put together a couple of example applications. The first of which is a nested HTTP key/value store—implemented in ~150 lines of Go—whose datastore backend is a single JSON file. There’s a Vagrantfile in the root of the repository which makes it easy to run and test these applications.

JSON Key-Value

This program supports GET, PUT and DELETE verbs which retrieve, create/modify and remove keys, respectively. It expects JSON in the request body of a PUT request.

When you first start it up the store is empty:

1
2
3
4
5
6
7
$ http GET http://192.168.10.10:8000
HTTP/1.1 200 OK
Content-Length: 2
Content-Type: application/json
Date: Fri, 20 Mar 2015 15:34:26 GMT

{}

We can create a simple key:

1
2
3
4
5
6
7
8
$ echo '"bar"' |http --body PUT http://192.168.10.10:8000/foo
{
    "success": "value written"
}
$ http --body GET http://192.168.10.10:8000
{
    "foo": "bar"
}

(Note that we have to wrap a string in double quotes for it to be valid JSON.)

We can also create a complex object:

1
2
3
4
5
6
7
8
9
10
11
12
$ echo '{"phone": "111-222-3333", "age": 32}' |http --body PUT http://192.168.10.10:8000/mary
{
    "success": "value written"
}
$ http --body GET http://192.168.10.10:8000
{
    "foo": "bar",
    "mary": {
        "age": 32,
        "phone": "111-222-3333"
    }
}

We can create new nested values:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ echo '"123 Main St."' |http --body PUT http://192.168.10.10:8000/mary/address
{
    "success": "value written"
}
$ http --body GET http://192.168.10.10:8000
{
    "foo": "bar",
    "mary": {
        "address": "123 Main St.",
        "age": 32,
        "phone": "111-222-3333"
    }
}

We can alter existing nested values:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$ echo '{"home": "999-999-9999"}' |http --body PUT http://192.168.10.10:8000/mary/phone
{
    "success": "value written"
}
$ http --body GET http://192.168.10.10:8000
{
    "foo": "bar",
    "mary": {
        "address": "123 Main St.",
        "age": 32,
        "phone": {
            "home": "999-999-9999"
        }
    }
}

Of course we can access any nested value directly:

1
2
$ http --body GET http://192.168.10.10:8000/mary/phone/home
"999-999-9999"

…and we can delete any value directly:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ http --body DELETE http://192.168.10.10:8000/mary/phone/home
{
    "success": "value deleted"
}
$ http --body GET http://192.168.10.10:8000
{
    "foo": "bar",
    "mary": {
        "address": "123 Main St.",
        "age": 32,
        "phone": {}
    }
}

We can’t do this:

1
2
3
4
$ http --body GET http://192.168.10.10:8000/mary/phone/home
404 Not Found: [mary phone home]
$ echo '"bob"' |http --body PUT http://192.168.10.10:8000/this/path/does/not/exist/yet
404 Not Found: [this path does not exist yet]

Looking at the code:

  • Every request is processed by a single handler.
  • Nowhere does it check if a key exists, nor does it recurse through the data structure at all. That is all handled by travel.

It initializes the travel router using the options of “strict traversal”—essentially this means that the resulting handler name can either be “” (the empty string) if traversal fully succeeds or the first failed token if it does not fully succeed. It also sets the DefaultHandler to “” (again, the empty string) and sets a maximum subpath—any tokens remaining if traversal does not fully succeed—to one only for PUT requests (that’s the only case where we want a successful request—key creation—if that path does not exist yet). Basically we’re forcing all requests to either fail or call the default handler (“”).

It passes travel a simple function that reads and deserializes the JSON file. Travel calls this at the start of every request. Using the result returned from that function as the root tree, it performs traversal and either calls the appropriate handler if the request succeeds or the error handler if not.

Pretty cool that we can do this in ~150 lines. One big issue though is that this program is not safe for concurrent requests. Multiple simultaneous requests could overwrite each other’s data or potentially corrupt the JSON file. How could we address this?

One way is to use PostgreSQL’s JSON storage type instead of a simple file. The next example application is essentially identical to the one described above except for the functions that deal with root tree retrieval and storage. Everything else works exactly the same, but we make these modifications:

  • Instead of a JSON file, we have a table called “root_tree” with three columns: sequential primary key, created timestamp, jsonb tree.
  • Rather than mutate a single row in that table on every request, if the request modifies the tree we simply insert a new row. The table serves as a history of all modifications going back in time.
  • The function that fetches the root tree at the start of each request does a simple SELECT on the table in descending order by primary key (which is sequential) with LIMIT 1. That way we get the latest version of the root tree.

The root tree is fetched at the start of every request, so within the body of the request it should always be considered ‘stale’ with respect to modification. To address this, we use an advisory lock after the modification request has been validated. We then use the context Refresh() method to re-run traversal and give us the updated root tree. Since this thread holds the lock for modification, we are free to modify and write back the root tree to a new row. After we’re done we release the lock (in theory it would be implicitly released anyway when the transaction is committed or rolled back, but I like to be explicit).

This program is now safe for concurrent requests, and this can be proven by running GOMAXPROCS=2 go test. This will run 50 concurrent requests and make sure they all succeed without data loss.

There are probably more clever ways of constructing the root tree that would not involve explicit locking—perhaps by building a tree structure on the fly from a set of resources. To me the interesting part of traversal is dynamic request routing without annoying boilerplate. Travel is a more-or-less direct translation of the idea into Go, but it’s certainly possible that it can be improved.

I’m open to any feature ideas or—preferably—simplifications. Feel free to open a Github issue and thanks for reading!

  1. Actually you provide a callback function that fetches the root tree object.
  2. This isn’t strictly true—using traditional traversal semantics the lookup can fail in certain instances but still invoke a handler. The details are not terribly intuitive but can be found in the original documentation.
  3. In the Pyramid/Pylons implementation, the author would provide a nested dictionary and would use decorators on “view” (handler) functions/classes to indicate their “name”. We obviously cannot do this in Go, so the “handler map” is a map[string]func that provides similar functionality.

Git Is Actually OK for (Some) Binaries

In hacker circles there’s a common urban legend that it is somehow “bad” or inefficient to version binary files with git (“DVCS systems are not good candidates for storing binary files”, “Don’t ever commit binary files to git”). In one recent Hacker News post, a commenter went so far as to write “Anyone checking in 500MB artifacts into git is almost certainly refusing to use git correctly”. While 500MB may be a little on the big side for my taste, as long as they are the right type of binaries it should not be an issue.

So let’s discuss the types of binaries you might consider versioning within git1.

Binary files that fit these constraints are good candidates for git versioning:

  • Localized, relatively small changes between revisions (this mostly implies compressibility).
  • Proportionality between source changes and resulting binary changes (ie, if the asset is produced by some process—compilation, etc—a small change in the source will result in a proportionately small change in the binary)
  • ≤ 150MB in size per file2

The reason these types of files work well in git is because git stores revisions as binary deltas, which are precisely what we want. In fact that’s exactly how Amazon CodeDeploy and my similar open source project Elita manage binary code deployments — by committing compiled files into a git repository, pushing to a central repository and then git pulling in the changes on the target servers.

Let’s do a little test.

I have a simple Go program that I want to compile and version in git:

1
2
3
4
5
6
7
package main

import "fmt"

func main() {
  fmt.Printf("Hello, world!\n")
}

Compile it:

1
2
3
4
5
6
$ cd ~/go/src/gittest
$ go build
$ ls -lh
total 3776
-rwxr-xr-x  1 bk  staff   1.8M Mar  4 08:17 gittest
-rw-rw-r--  1 bk  staff    75B Mar  4 08:16 test.go

Now we’ll create a git repo and check these files in. Go programs compile to native machine code so this would be similar if the program was written in C or C++.

1
2
3
4
5
6
7
8
$ git init
Initialized empty Git repository in /Users/bk/go/src/gittest/.git/
$ git add -A
$ git commit -m "initial version"
[master (root-commit) 9a381bd] initial version
 2 files changed, 7 insertions(+)
 create mode 100755 gittest
 create mode 100644 test.go

Now lets make a small change to our source code:

1
2
3
4
5
6
7
package main

import "fmt"

func main() {
  fmt.Printf("Hello, Arbre!\n")
}

Compile, commit it:

1
2
3
4
$ go build
$ git add -A && git commit -m "second version"
[master a00b091] second version
 2 files changed, 1 insertion(+), 1 deletion(-)

Cool. Now let’s generate a binary patch for the executable and see how big it is:

1
2
3
4
5
6
7
$ git diff --binary master~1:gittest gittest > ./diff.patch
$ ls -lh
total 3784
drwxr-xr-x  12 bk  staff   408B Mar  4 08:34 .git
-rw-r--r--   1 bk  staff   404B Mar  4 08:39 diff.patch
-rwxr-xr-x   1 bk  staff   1.8M Mar  4 08:34 gittest
-rw-rw-r--   1 bk  staff    75B Mar  4 08:34 test.go

404 bytes! Nice. Here’s what it looks like:

1
2
3
4
5
6
7
8
9
10
diff --git a/gittest b/gittest
index 77923533b0c042ef0c45dda307bbc9c938070a87..ec7b9ebaf7c0e486a2faca8c55143824a1473016 100755
GIT binary patch
delta 107
zcmWN=w+Vni00mIYIp?s9s|YS|VsG)m)?ow}xmnKU3jc$5%j1^E2?Hi9*l^&&gO30q
mBE([email protected]*e<OSPC4V83og0hnj3Dp<DLf|d3v>)>-`7IJUHI~

delta 107
zcmWN=xe0(k00cn%|Ns0g&rlE)F*R5)wHU!hu9h<~a0lj++a<RH8gv*iVZnw27an{B
l2=C}s);=%Ocz!D4m=jJp<D3f?Tyn)VH{5c^{qw0otUue?IU4`~

That looks like a nice clean binary delta to me. Certainly it isn’t an entire copy of the second version of the file.

Now let’s check out the original version of the binary:

1
2
3
$ git checkout master~1 2> /dev/null
$ ./gittest
Hello, world!

…and apply the patch:

1
2
3
$ git apply ./diff.patch
$ ./gittest
Hello, Arbre!

As you can see the patched binary runs perfectly!

One final thing. Let’s check the size of the git metadata:

1
2
3
4
5
6
7
8
$ ls -lh
total 3784
drwxr-xr-x  12 bk  staff   408B Mar  4 08:41 .git
-rw-r--r--   1 bk  staff   404B Mar  4 08:39 diff.patch
-rwxr-xr-x   1 bk  staff   1.8M Mar  4 08:42 gittest
-rw-r--r--   1 bk  staff    75B Mar  4 08:41 test.go
$ du -hs ./.git/
1.3M  ./.git/

1.3MB is less than the size of a single copy of the original binary (likely due to compression). This proves that git is not storing our binary revisions as entire copies of the file but as much more efficient deltas.

For which types of binary files would git not be appropriate?

Anything where changes between revisions are scattered throughout the file—for example, anything encrypted with a modern cipher will likely be almost completely different for even the most tiny of changes to the original plain text. Similar results can occur from compression.

It will take some experimentation to see if your particular binary assets will work efficiently. Compiled executable binaries work particularly well. I’ve used git for both native machine code binaries and .NET CLR binaries—I haven’t tested JVM binaries specifically but imagine they would be similar. Images, music, etc may also be worth experimenting with.

Thanks for reading and feel free to leave a comment with any corrections or additional info!


  1. I say git specifically because it is relatively smart about versioning binary assets as shown in this post, versus something like Mercurial which—last I knew—simply stored binary file revisions as flat blobs of the full file.

  2. This may actually be an urban legend as well. I’m not familiar with git internals but my understanding is that the binary diffing algorithm gets inefficient above this threshold. I would love to hear if that is incorrect.

Elita 0.59 Released

I’m proud to announce the first public release of Elita, the continuous deployment engine I wrote for ScoreBig in Python.

There still are a bunch of features I want to implement but we are using it routinely on a daily basis and it’s pretty stable so far.

Find it on PyPI: http://pypi.python.org/pypi/elita

The docs are here: http://elita.readthedocs.org (a little outdated but nothing major)

Feel free to ask any questions!

Runtime Type Verification in Python

In this post I advocate for a particular style of Python coding which I call “Runtime Type Verification” (RTV), in order to help you write code that has clearer intent, fewer implicit assumptions, and—hopefully—fewer bugs.

Just to be clear: Python doesn’t need type checking like a statically-typed language. If you are coming to Python from another language with static typing, please don’t try to force those idioms on Python. However I think it’s useful to deal with types explicitly when they matter which, as we will see, is a lot of the time.

The Problem

In a nutshell: most (or all) of the methods you write have implicit assumptions about the parameters they accept.

For example, function/method parameters (I’ll use the term “function” to mean both functions and class methods) by default will happily accept NoneType objects (as would be expected). However in a lot of cases—probably the majority—functions aren’t designed to deal with None, resulting in the familiar “TypeError: ‘NoneType’ object has no attribute [foo]” exceptions. This is sort of Python’s version of a null reference exception.

Typically people ignore the possibility of None with the rationale that the code will break somewhere anyway and some exception will be thrown somewhere. However we want to fail as early as possible, and RTV helps to make sure that parameter assumptions are enforced.

Another example might a function that expects a dictionary with specific set of keys, or a list of length between X and Y. The possibilities go on. It’s quite unusual to write a function that has zero knowledge of the arguments passed to it.

You might have a function or method like the following:

Example (rtc_func_ex1.py) download
1
2
3
def NewUser(name, categories, attributes):
    user_obj = User(name, categories, attributes)
    return save_to_database(user_obj)

What implicit assumptions does this function make?

  • ‘name’ exists (is not None) and is a string (or string-like)
  • ‘categories’ exists (is not None) and is an iterable like list or tuple. (You might extend the assumption to say that the iterable contains objects of type str or even valid categories that exist)
  • ‘attributes’ is also a container type of some sort (in this case we will say that the function expects it to be a dict) but may be empty or None.

A Solution

Let’s encode all these assumptions in the preamble to the function (and add a docstring while we’re at it):

Typed Example (rtc_typing_example.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
def NewUser(name, categories, attributes=None):
    '''
    Create a new user.

    :param name: username
    :param categories: iterable with categories the user belongs to
    :param attributes: optional dictionary of attributes
    :return: boolean indicating database write success
    '''
    assert name and categories
    assert isinstance(name, str)
    assert hasattr(categories, '__iter__')
    assert not attributes or (hasattr(attributes, '__getitem__') and hasattr(attributes, '__setitem__'))

    user_obj = User(name, categories, attributes)
    return save_to_database(user_obj)

Notice the following:

  • We assert that the mandatory arguments exist (this will catch any arguments that are None). The first assert guarantees that both arguments are not None and that empty strings/iterables will be caught.
  • We assert that they have the interface/methods that we expect (more on that below).
  • We allow an optional argument which can be None or a dictionary-like object but nothing else.

Notice in the above example, I did not do either of the following:

1
2
assert isinstance(categories, list)    #BAD
assert isinstance(attributes, dict)    #BAD

Why not?

Assert Behavior (or Interface), Not Identity

One of the many beautiful things about Python is that we don’t usually need to care what exact class a given object is, as long as it exposes the methods/behavior (aka interface) that we need. If I wrote the bad example above and a subsequent user of the function passed in a duck-typed dict-like object, the function would fail. That would suck and is unnecessarily restrictive.

Instead, assert the presence of the methods we require. For most uses, the minimum interface of a dictionary-like object is the ‘__getitem__’ and ‘__setitem__’ methods, so we’ll make sure they exist and nothing else. Similarly, the minimum interface of an iterable is the ‘__iter__’ method. We assert the existence of both of those above.

You could create helper functions to make the code a bit more concise:

Typed Example 2 (rtc_typing_example2.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def is_dict_like(obj):
    return hasattr(obj, '__getitem__') and hasattr(obj, '__setitem__')

def is_iterable(obj):
    return hasattr(obj, '__iter__')

def is_optional_dict(obj):
    return not obj or is_dict_like(obj)

def NewUser(name, categories, attributes=None):
    '''
    Create a new user.

    :param name: username
    :param categories: iterable with categories the user belongs to
    :param attributes: optional dictionary of attributes
    :return: boolean indicating database write success
    '''
    assert name and categories
    assert isinstance(name, str)
    assert is_iterable(categories)
    assert is_optional_dict(attributes)

    user_obj = User(name, categories, attributes)
    return save_to_database(user_obj)

Arguably if you want to be more correct you could use the abstract base classes defined in the collections module:

1
2
3
4
import collections
#[...]
assert isinstance(categories, collections.Sequence)
assert isinstance(attributes, collections.Mapping)

You’ll notice that in the earlier example I am explicitly testing that “name” is of class str, contradicting the rule. For the base types str, int , float, etc., I don’t see a problem with testing the class directly. There could be instances where this would be wrong (if you’re doing something funky with integer methods for example). YMMV.

Redundant Verification

Some might argue that if you follow the RTV pattern religiously you will have a lot of redundant verification going on. If Foo() calls Bar() which calls Baz(), passing certain common parameters down, why bother to check them in all three functions?

The reason is that you want to the failure to be caught as early in the call stack as possible after the data error occurs. Bad data will always cause a failure somewhere even with no verification at all. The whole point of RTV is to surface the cause more easily by failing fast.

The other reason is that maybe Foo() and Bar() will be decoupled at some point in the future. You want to make sure those parameters are always verified for all users of the functions.

Taking It Further

Since we are using asserts to verify function parameter types, why not also use them inside function bodies (or at the end, before returning values)?

Let’s modify our example function slightly:

Typed Example 3 (rtc_typing_example3.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def is_dict_like(obj):
    return hasattr(obj, '__getitem__') and hasattr(obj, '__setitem__')

def is_iterable(obj):
    return hasattr(obj, '__iter__')

def is_optional_dict(obj):
    return not obj or is_dict_like(obj)

def NewUser(name, categories, attributes=None):
    '''
    Create a new user.

    :param name: username
    :param categories: iterable with categories the user belongs to
    :param attributes: optional dictionary of attributes
    :return: dictionary containing the user object fields if successful or None
    '''
    assert name and categories
    assert isinstance(name, str)
    assert is_iterable(categories)
    assert is_optional_dict(attributes)

    user_obj = User(name, categories, attributes)
    result = save_to_database(user_obj)
    assert not result or (is_dict_like(result) and 'name' in result and 'categories' in result and 'attributes' in result)
    return result

Notice we’ve changed the return type of save_to_database() to now return a dictionary of user object values if successful or None if there was a failure. Rather than return the value without interrogation, we assert that the return value fits the structure we are expecting.

Note that I’m not saying to follow this exact pattern in every circumstance, there are certainly places where it would be redundant and verbose:

1
2
3
#stupid and unnecessary
list_of_stuff = list("foo", "bar", "baz")
assert "foo" in list_of_stuff and "bar" in list_of_stuff and "baz" in list_of_stuff

I do think it is worth verifying the results of certain functions/methods if the results are structured, at least moderately complex and failure is a possibility. Especially third party ones where the return type might change unexpectedly.

Other Solutions

Some Pythonistas might point out that optional type checking already exists in Python 3 in the form of function annotations. This allows you to specify function parameter types in function and method definitions. With them you could use a module like typeannotations which raises a TypeError exception in the event of a type mismatch.

There’s also MyPy but it’s not really Python per se, it’s “an experimental Python variant” which supports optional static typing.

I have no problem with these solutions, but I like RTV better.

  • Explicit asserts are more flexible. We don’t just care about class type, we also care about things like “is integer in valid range”, “is string of length N”, “is iterable > N items”, etc. All these assumptions should be asserted.
  • See Assert Behavior section above. Most of the time we don’t want to lock parameters to just one explicit class.
  • No need for third party modules.
  • Works in Python 2.x
  • Explicit asserts double as documentation and make code intent more clear. They are right there underneath the docstring and not off in some decorator definition somewhere.

Too Slow?

I don’t think this argument holds much water. If asserts are too slow you are using the wrong language for your project. That said, you can turn asserts into no-ops by passing the -O flag to the Python interpreter. I think this defeats the purpose of writing the type verification in the first place, but it’s an option.