Tuesday, February 08, 2005

From Nine to Five...

...microseconds, that is.

Just when I thought I couldn't possibly pull any more optimization tricks in pure Python, I found a couple more tonight. First, it occurred to me that because generic function wrappers use a dispatcher instance's __getitem__, I could pre-compute the __getitem__ access, accomplishing two useful things. The first of course is saving the attribute access, but I had actually forgotten that Dispatcher was a classic class, so I wasn't thinking about the attribute access when I got the idea.

Instead, I had remembered that CALL opcodes in the Python interpreter have some special fast-path handling for instance methods and for functions written in Python. None of that fast-path code applies when doing __getitem__, because it doesn't pass through a CALL opcode. So I figured, why not change the generated wrapper to call __getitem__ directly, so the Python interpreter's fast path would apply?

Between the attribute access and the fast path, that took off almost a microsecond, just by making very minor changes to two lines. I was now down around 8.2 microseconds execution time. Whee!

The next thing I did was a data structure change, in order to drop another pair of attribute accesses. The original dispatch tree nodes were instances of a dictionary subclass (DispatchNode) with a lazy loading facility that populates their contents when first used. The dispatch loop checked whether each node was a DispatchNode instance, and continued the loop if so. It also did an attribute access to check whether the current node was initialized, so that it could populate it if not. And another attribute access, to identify the current argument/expression to be examined and the indexing technique being used.

So, I hacked on that a bit, and replaced the whole thing with a list that contained all the data needed, so that the dispatch loop could just do a sequence unpack to get at all the attributes, which is pretty darn fast. By using a list, I could also populate the node when initialization is required. I ended up making DispatchNode not lazy anymore; instead the list element is None until initialization is required, at which point the dispatch loop creates a node instance and slaps it into the list. This also has the interesting side benefit of reducing the memory usage for unexpanded nodes, because a small list uses a lot less memory than even an empty dictionary.

This didn't require too much hacking, although tests were failing all over the place at first. It was certainly a lot more work than the first trick, but in the end it got the runtime down to about 7.5 microseconds.

Oddly enough, though, once the dust had cleared, I actually ended up with a cleaner-looking algorithm, apart from the occasional presence of "magic numbers" as list offsets. Specifically, I was able to use a special list configuration to indicate a terminal node in the dispatch tree, which got rid of the type checking I was doing. Before, a terminal node was any object that wasn't a DispatchNode, but now, all nodes are lists. It's just that a terminal node has None in one of the list elements that would otherwise be filled with a needed value, and another element contains the terminal value. That is, the value to be returned from the dispatcher, or the method(s) to be executed.

This fixed a previous limitation of the implementation that you couldn't actually have a dispatcher return a value of None, because None meant "no matches found". You also couldn't store DispatchNode instances in a dispatcher, because the dispatcher would have tried to dispatch on them. Now, there's no more level confusion, due to the extra level of indirection. It's most unusual to get an actual improvement out of a performance tune-up like this, so it's another pleasant surprise.

At this point, the algorithm now looked low-level enough to be worth another crack at a Pyrex port, so I did the grunt work of doing so. I was then treated to a runtime of over 25 microseconds, almost four times slower!

I was flabbergasted. I went through all the usual list and tuple tricks, and it was still taking over 20 microseconds. After considerable debugging, I finally discovered the problem: I wasn't using the C version of some other routines!

Yep, in the process of splitting out a new extension module specifically for the generic function stuff, I had managed to break an import in one of the Python modules, which was now falling back to using the Python versions of the inequality dispatcher and the Min/Max objects. D'oh! That took two seconds to fix, and the runtime dropped to around 12.5 microseconds, a little less than twice as slow as the Python version.

It took some more debugging to find that the other problem was that the Pyrex version never used its fast-path version of the code, due to an attribute mixup between the C base class and the Python subclass. A quick tweak and bam, we were down to just over 6 microseconds -- the first run that was actually faster than the pure-Python version.

There were still several tweaks left to go; at one point I even tried removing the fast-path code from the Pyrex version, and just using a single all-purpose path. With some tuning, it was just as fast as the pure-Python version, which means that even the "slow path" in C is pretty darn fast.

When I wrapped work for the night, the runtime of the torture test function was 6.01 microseconds, which is just under 5.25 microseconds of overhead compared to the hand-tuned pure Python version of the same function.

It's been a long road, but I think a worthwhile one. There are other aspects of the code that will probably still need some performance tuning, but I'm going to wait until I have more demanding generic function examples to test. At some point, I plan to refactor peak.web to use generic functions for views and traversals, and there should be some interesting and complex examples arising out of that work that I can use for benchmarking.

In the meantime, I'll turn my attention to adding other features and cleanups I'd like to do before PyCon, like getting some examples of simple method combination together, and improving the diagnostic output that shows up when an error occurs. I'd also still like to add pattern matching, and the ability to generate a GraphViz rendering of a dispatch tree, as well as an option to remove rules from an existing generic function. I just don't expect to get to any of those things before PyCon.

One thing I did get to this past weekend, though, was an indexer for 'is' tests. I got the idea from the other dispatch package out there, that's used to registering listeners for arbitrary events from arbitrary senders. So, I thought it would be cool to actually allow generic functions to also dispatch on the basis of an object's identity, but without forcing the object to continue to exist. So now, as long as an object is weak-referenceable, PyProtocols' generic functions can contain rules with 'anArg is someOb' conditions, and the generic function won't hold a reference to 'someOb'. (If someOb isn't weak-referenceable, the rule will still work, but it'll keep the object alive as long as the generic function lives.)

It turns out that this is a cool feature for peak.web and other PEAK facilities that might want to use generic functions to determine rules for things, because it means you don't need to worry as much about garbage collection. Right now, however, the actual rules still stay lodged in the generic function, so you don't want to just keep adding rules indefinitely. In fact, there's no way right now to remove rules at all, which I do want to fix. It would also be nice if the invalidation of a required condition (like an 'is' criterion for an object that no longer exists) would cause the rule to be removed automatically. But those features are a bit further down the road, I think.