Sunday, February 06, 2005

Optimization Surprises

This weekend, I took another crack at trimming microseconds off the common-case path for generic function execution, and succeeded in dropping the excution time from 13.2 microseconds to just over 9.8. (Which is about 9 microseconds overhead added relative to a hand-optimized Python version of the same trivial function.) Along the way, however, I realized a couple of surprising things about Python performance tuning.

The first thing I did was to move thread locking out of the criticial path, by using double-checked locking (which is another one of those things that doesn't work in Java, but does in Python). Then, I moved a closure definition so it didn't occur on the critical path, and finally I simplified the while condition in the dispatch loop, reducing the time for each argument in exchange for a slightly slower exit case. These three things trimmed about 2.5 microseconds off the critical path, about half from moving the locking, and half from the other things.

At this point, I was absolutely certain that there was nothing more to be gained by shuffling the Python parts of this code, and that the only remaining gains would come from writing C. So I began doing some refactoring to prepare for that. My plan was to move Dispatcher.__getitem__ into a base class that would be in Python if the C version wasn't available, and which would be in C otherwise. In order to do that, I removed some private name mangling from the locking stuff, so that the base class would be able to access it, and I began the process of lifting out the inner closure, converting it to a class, because Pyrex doesn't support closures.

After I made these changes, I ran my performance test in order to see whether moving the __getitem__ method was detrimental to performance. To my exteme surprise, the code was faster, by over .8 microseconds.

It took me quite some time to figure out what was going on. At first, I just put in and backed out the current patch, rerunning my timings, because my first reaction was that it was some kind of timing anomaly. But after rerunning both versions several times, I knew that something was really different, and of course it was in the last place I looked. I tried just doing the name mangling, and just the method move, but these had no effect on the timing. It turned out to be the removal of the closure that did the trick.

Now the really puzzling thing about that was, I had already moved the closure definition off the critical path. So in theory, it shouldn't have mattered what code was in that block; the block was never executed by the performance test! (Which is why it took me so long to see what was happening.)

Except... the difference between the function as a whole, before and after, is that a function with a closure has "cell variables", and a function with no closures has only "fast local" variables. Could this be the difference? It turned out that the closure shared four variables with the enclosing function, and on the critical path those variables were set four times and read twice, which added up to almost 9/10ths of a microsecond of additional overhead, compared to fast local variables. (Cell variables involve an extra indirection step to read or write.) So, even though the closure was never created or executed on the critical path, the outer function was still paying for it!

Now, you shouldn't get the idea from this that closures are slow. 99% of the time when I use a closure, it's effectively to clone a function from a template, producing a specialized version of the function, and the outer function's performance is irrelevant. About the only time the issue I found would be relevant is in a case like this one, where you're trying to optimize a fast path through the outer function at the expense of less-common paths. Creating an object instead of a closure adds significant overhead to the paths that previously would have used the closure, so this is a net loss along that execution path. (I hope to eventually make up for that in C by replacing the object with a stack-allocated object array, but that's a long way away right now.)

But this was only my first surprise. My next step was to take the new base class and the closure-replacement class, and port them to Pyrex, since they no longer had any unsupported constructs. I then ran the performance test, and performance got worse by almost as much as I had gained by removing the closure!

Now, I already knew that simply compiling a Python function or class with Pyrex does not make it faster. Pyrex doesn't really generate very efficient code; its emphasis is on interfacing with C code, not on being an optimizing Python compiler. Pyrex generates lots of extra incref/decref code, and it actually performs worse than the Python interpreter at looking up globals or builtins. It also doesn't know anything about the types of non-C variables, so it uses generic API calls where specific ones might be more efficient. For example, if you're writing straight C code, and you know you have a tuple, there are tuple-specific macros you can use to access items quickly, using integer offsets. (For that matter, you can use PySequence_GetItem, which doesn't have to translate the integer into a Python object first.)

But Pyrex doesn't know any of this, so it just uses the safest (and slowest) API calls available, which requires frequent creation of new integer objects, for example, when you're working with numbers. Also, the Python interpreter eval loop has lots of type-checking optimizations built into it. For example, when you add two numbers, the interpreter doesn't just call PyNumber_Add, it first checks to see if the numbers are both integers, and inlines that special case to speed it up. Pyrex would generate huge amounts of code if it tried to do this by typechecking, so it can't really optimize this case.

Having already encountered these issues in previous use of Pyrex, I was not surprised that the time went up. But I expected it to be concentrated in a few relatively obvious code generation gaffes, easily replaced by swapping out generic Python with specific Python/C API calls. One of the first things I noticed was that a type check that occurred on the critical path was doing a global variable lookup, so I tweaked that, expecting it to be a big part of the overhead. But it barely changed a thing.

So I started looking through the generated C code, trying to see what else might be up. Pyrex was doing a comparison that I knew was numeric, using PyObject_Cmp. So I declared both variables to be integers, but that didn't really do anything noticeable. After more struggling with the generated code, I gave up. I know that this code could be made to work much faster in C, but Pyrex wasn't yielding to my usual techniques.

Of course, the problem with those techniques is that they basically amount to writing C code, but with Python syntax, and with a bunch of extra stuff going on behind your back. Usually, the extra stuff is helpful, because it's making the reference counts correct for you, handling garbage collection support, and all that other good stuff. But it gets in the way when you are trying to call Python/C API functions in order to use more efficient type-specific APIs. So, it may be that if I want to tackle optimizing this one, I'll need to leave the safety of Pyrex behind and work directly in C.

It's a pity, though, because Pyrex would be really cool if it were actually a Python compiler, and you could declare Python types like list and tuple and when you used len() or aList[offset] it would translate those into the right API calls. And it would be even cooler if it used Guido's type declaration syntax, which to me is less awkard than Pyrex's ugly cdef declarations and casting.

Of course, Pyrex is an extremely valuable piece of work that makes it easy to create Pythonic wrappers for C code. But at this point what I'm wishing for is an actual Python compiler, with type-specializing code generation. I suppose Psyco might help a lot, and it would probably speed up lots of other parts of the system, but somehow relying on it seems unfair for comparison purposes. If I ran Psyco on the generic function, I'd have to also run it on the hand-tuned version, so wouldn't the relative speeds remain the same? Actually, Psyco might be able to do more with the hand-tuned version, so the comparison might turn out worse. On the other hand, decreased overhead is decreased overhead.

In the long run, I'd like to write a Python compiler, ideally with backends for both C and Java, and options to include debugger calls (so you could step through the generated code using Python's debugger), and full support for closures and generators as well as type-specialized C API usage and support for foreign types. I've done a lot of thinking about the architecture of such a beast, and have concluded that its implementation would mostly consist of the tedious business of implementing the front end and the back end. Which sounds sort of obvious, but what I mean is that there isn't really much of an architecture. You need back-end objects that understand the structure of a C module definition, class definitions, functions, and so forth. And the front-end needs generic functions to match AST node patterns and tell the back-end what to generate. The tedious part is defining all those back-end code generation patterns, and all the front-end matching patterns.

The only real "architecture" part is the type system, which is to say keeping track of what types expressions are, managing imported types, and possibly some limited type inference, although for my purposes I'd be fine with peppering a routine with type declarations as long as it would make the routine faster. I don't intend to routinely compile Python code, just performance-critical code and code that needs to interface with other languages.

But right now I don't have time to do either the tedious parts or the architecture parts before this year's PyCon, so I suspect I'm going to rest on my laurels at this point and settle for 9.8 microseconds. More specifically, the overhead is really only about 9 microseconds, because the hand-tuned version takes just under .8 microseconds to execute. 9 microseconds is not a lot of execution overhead to pay for a trivial generic function that was three times easier to write than its hand-tuned counterpart.

The main point of generic functions is to save you from having to think about how dozens or hundreds of rules interact, and to allow you to compose functions modularly from rules spread out across your code base. And as with an optimizing compiler, it may not be faster than what you could write by hand, but it's probably faster than what you actually would write by hand. Fast enough, in other words, to let you focus your attention elsewhere.

(P.S. Is there any more performance I could wring out of the critical path in Python? Actually, probably yes, but maybe not very much compared to what I'd have to do to get the speedup. I could clone the routine's contents within the routine, thus getting rid of a state variable using the program counter. This would slash two local variable assignments and a try/finally block setup off the fast path. I'm just not sure the gain would be worth duplicating 14 lines of code within a function that's already 31 lines long. On the other hand, the Pyrex version would have been even more duplication, so maybe I shouldn't care. Update: I tried it, the duplication's not too bad, and it took another .4 microseconds off the fast path. Lesson: there's nearly always something you can do in Python to speed things up!)