dirtSimple.orgwhat stands in the way, becomes the way

Optimization Surprises

This weekend, I took another crack at trimming microseconds off the common-case path for generic function execution, and succeeded in dropping the excution time from 13.2 microseconds to just over 9.8. (Which is about 9 microseconds overhead added relative to a hand-optimized Python version of the same trivial function.) Along the way, however, I realized a couple of surprising things about Python performance tuning.

The first thing I did was to move thread locking out of the criticial path, by using double-checked locking (which is another one of those things that doesn’t work in Java, but does in Python). Then, I moved a closure definition so it didn’t occur on the critical path, and finally I simplified the while condition in the dispatch loop, reducing the time for each argument in exchange for a slightly slower exit case. These three things trimmed about 2.5 microseconds off the critical path, about half from moving the locking, and half from the other things.

At this point, I was absolutely certain that there was nothing more to be gained by shuffling the Python parts of this code, and that the only remaining gains would come from writing C. So I began doing some refactoring to prepare for that. My plan was to move Dispatcher.__getitem__ into a base class that would be in Python if the C version wasn’t available, and which would be in C otherwise. In order to do that, I removed some private name mangling from the locking stuff, so that the base class would be able to access it, and I began the process of lifting out the inner closure, converting it to a class, because Pyrex doesn’t support closures.

After I made these changes, I ran my performance test in order to see whether moving the __getitem__ method was detrimental to performance. To my exteme surprise, the code was faster, by over .8 microseconds.

It took me quite some time to figure out what was going on. At first, I just put in and backed out the current patch, rerunning my timings, because my first reaction was that it was some kind of timing anomaly. But after rerunning both versions several times, I knew that something was really different, and of course it was in the last place I looked. I tried just doing the name mangling, and just the method move, but these had no effect on the timing. It turned out to be the removal of the closure that did the trick.

Now the really puzzling thing about that was, I had already moved the closure definition off the critical path. So in theory, it shouldn’t have mattered what code was in that block; the block was never executed by the performance test! (Which is why it took me so long to see what was happening.)

Except… the difference between the function as a whole, before and after, is that a function with a closure has “cell variables”, and a function with no closures has only “fast local” variables. Could this be the difference? It turned out that the closure shared four variables with the enclosing function, and on the critical path those variables were set four times and read twice, which added up to almost 9/10ths of a microsecond of additional overhead, compared to fast local variables. (Cell variables involve an extra indirection step to read or write.) So, even though the closure was never created or executed on the critical path, the outer function was still paying for it!

Now, you shouldn’t get the idea from this that closures are slow. 99% of the time when I use a closure, it’s effectively to clone a function from a template, producing a specialized version of the function, and the outer function’s performance is irrelevant. About the only time the issue I found would be relevant is in a case like this one, where you’re trying to optimize a fast path through the outer function at the expense of less-common paths. Creating an object instead of a closure adds significant overhead to the paths that previously would have used the closure, so this is a net loss along that execution path. (I hope to eventually make up for that in C by replacing the object with a stack-allocated object array, but that’s a long way away right now.)

But this was only my first surprise. My next step was to take the new base class and the closure-replacement class, and port them to Pyrex, since they no longer had any unsupported constructs. I then ran the performance test, and performance got worse by almost as much as I had gained by removing the closure!

Now, I already knew that simply compiling a Python function or class with Pyrex does not make it faster. Pyrex doesn’t really generate very efficient code; its emphasis is on interfacing with C code, not on being an optimizing Python compiler. Pyrex generates lots of extra incref/decref code, and it actually performs worse than the Python interpreter at looking up globals or builtins. It also doesn’t know anything about the types of non-C variables, so it uses generic API calls where specific ones might be more efficient. For example, if you’re writing straight C code, and you know you have a tuple, there are tuple-specific macros you can use to access items quickly, using integer offsets. (For that matter, you can use PySequence_GetItem, which doesn’t have to translate the integer into a Python object first.)

But Pyrex doesn’t know any of this, so it just uses the safest (and slowest) API calls available, which requires frequent creation of new integer objects, for example, when you’re working with numbers. Also, the Python interpreter eval loop has lots of type-checking optimizations built into it. For example, when you add two numbers, the interpreter doesn’t just call PyNumber_Add, it first checks to see if the numbers are both integers, and inlines that special case to speed it up. Pyrex would generate huge amounts of code if it tried to do this by typechecking, so it can’t really optimize this case.

Having already encountered these issues in previous use of Pyrex, I was not surprised that the time went up. But I expected it to be concentrated in a few relatively obvious code generation gaffes, easily replaced by swapping out generic Python with specific Python/C API calls. One of the first things I noticed was that a type check that occurred on the critical path was doing a global variable lookup, so I tweaked that, expecting it to be a big part of the overhead. But it barely changed a thing.

So I started looking through the generated C code, trying to see what else might be up. Pyrex was doing a comparison that I knew was numeric, using PyObject_Cmp. So I declared both variables to be integers, but that didn’t really do anything noticeable. After more struggling with the generated code, I gave up. I know that this code could be made to work much faster in C, but Pyrex wasn’t yielding to my usual techniques.

Of course, the problem with those techniques is that they basically amount to writing C code, but with Python syntax, and with a bunch of extra stuff going on behind your back. Usually, the extra stuff is helpful, because it’s making the reference counts correct for you, handling garbage collection support, and all that other good stuff. But it gets in the way when you are trying to call Python/C API functions in order to use more efficient type-specific APIs. So, it may be that if I want to tackle optimizing this one, I’ll need to leave the safety of Pyrex behind and work directly in C.

It’s a pity, though, because Pyrex would be really cool if it were actually a Python compiler, and you could declare Python types like list and tuple and when you used len() or aList[offset] it would translate those into the right API calls. And it would be even cooler if it used Guido’s type declaration syntax, which to me is less awkard than Pyrex’s ugly cdef declarations and “ casting.

Of course, Pyrex is an extremely valuable piece of work that makes it easy to create Pythonic wrappers for C code. But at this point what I’m wishing for is an actual Python compiler, with type-specializing code generation. I suppose Psyco might help a lot, and it would probably speed up lots of other parts of the system, but somehow relying on it seems unfair for comparison purposes. If I ran Psyco on the generic function, I’d have to also run it on the hand-tuned version, so wouldn’t the relative speeds remain the same? Actually, Psyco might be able to do more with the hand-tuned version, so the comparison might turn out worse. On the other hand, decreased overhead is decreased overhead.

In the long run, I’d like to write a Python compiler, ideally with backends for both C and Java, and options to include debugger calls (so you could step through the generated code using Python’s debugger), and full support for closures and generators as well as type-specialized C API usage and support for foreign types. I’ve done a lot of thinking about the architecture of such a beast, and have concluded that its implementation would mostly consist of the tedious business of implementing the front end and the back end. Which sounds sort of obvious, but what I mean is that there isn’t really much of an architecture. You need back-end objects that understand the structure of a C module definition, class definitions, functions, and so forth. And the front-end needs generic functions to match AST node patterns and tell the back-end what to generate. The tedious part is defining all those back-end code generation patterns, and all the front-end matching patterns.

The only real “architecture” part is the type system, which is to say keeping track of what types expressions are, managing imported types, and possibly some limited type inference, although for my purposes I’d be fine with peppering a routine with type declarations as long as it would make the routine faster. I don’t intend to routinely compile Python code, just performance-critical code and code that needs to interface with other languages.

But right now I don’t have time to do either the tedious parts or the architecture parts before this year’s PyCon, so I suspect I’m going to rest on my laurels at this point and settle for 9.8 microseconds. More specifically, the overhead is really only about 9 microseconds, because the hand-tuned version takes just under .8 microseconds to execute. 9 microseconds is not a lot of execution overhead to pay for a trivial generic function that was three times easier to write than its hand-tuned counterpart.

The main point of generic functions is to save you from having to think about how dozens or hundreds of rules interact, and to allow you to compose functions modularly from rules spread out across your code base. And as with an optimizing compiler, it may not be faster than what you could write by hand, but it’s probably faster than what you actually would write by hand. Fast enough, in other words, to let you focus your attention elsewhere.

(P.S. Is there any more performance I could wring out of the critical path in Python? Actually, probably yes, but maybe not very much compared to what I’d have to do to get the speedup. I could clone the routine’s contents within the routine, thus getting rid of a state variable using the program counter. This would slash two local variable assignments and a try/finally block setup off the fast path. I’m just not sure the gain would be worth duplicating 14 lines of code within a function that’s already 31 lines long. On the other hand, the Pyrex version would have been even more duplication, so maybe I shouldn’t care. Update: I tried it, the duplication’s not too bad, and it took another .4 microseconds off the fast path. Lesson: there’s nearly always something you can do in Python to speed things up!)

Join the discussion
  • “””Could you generate if else code from your rule definitions?”””

    In principle, but in the general case it could take a very long time, because the decision tree would have to be completely expanded. Right now, only the used parts of the tree are expanded. Also, the mechanism for incorporating new kinds of indexes would have to be changed in order to support generating code for each kind of indexing. Most all of the current index types use hash tables, and Python doesn’t have a way to branch within a single routine based on hash tables. And if I include extra function calls so that hash tables can be used in generated code, that’s going to eat up the gains in places that don’t do calls now. (I.e. walking the dispatch tree is iterative right now, not recursive.)

    “””Do you have code around for others to play the optimizing game with?”””

    The routine I’ve been hacking is Dispatcher.__getitem__ in the dispatch.functions module of PyProtocols CVS.

    “””I wonder if boost::python would allow straighter porting from the original python code, or if it would prove a cure worse than the disease…”””

    I don’t really want to be dependent on C++; it’s hard enough to get C extensions built on Windows, without bringing C++ into it.

    What I’m more likely to do is dust off a copy of FunnelWeb, a literate programming tool with parameterized macro support. Before I discovered Python, it was my favorite programming tool, and I’ve used it once before to write a C extension for Python, as well as using it for a few years prior to that for other things. I plan to use FunnelWeb to create an extension or two, building up a library of macros for doing Python extensions, then at some point in the future when I get time to work on the compiler, I’ll convert the macros into a compiler back-end.

    I thought about trying Leo, which is a Python literate programming tool with its own outliner/editor, but it doesn’t appear to support parameterized macros or anything equivalent. If I’m wrong about that, maybe somebody can give me a pointer as to how, though. Maybe it allows extension in Python that would do what I want, I don’t know. If it can be extended from within a literate document, that could be pretty cool.

  • the pypy project (implement python in pure python) is structured
    as a translation framework, that can convert python into other
    forms. they have backends for pyrex, c (i think) and common lisp (this is cool), and more coming …


    armin rigo of pysco is one of the pypy leads and they have
    recently obtained funding in europe so active development
    is happening.

    it’s a great opportunity for all python compilation/translation
    effort got concentrated in one project for maximum impact.

  • inside of an hour I have no doubt you will be able to do
    in Leo anything you can imagine. grab a cvs or latest v4.3
    instal PMW, enable a few plugins and look at the code.
    templates or macro plugin might be a starting point.
    post/search on the forums for any questions.
    freenode #Leo, leo.sf.net

    there are simple scripts to call timeit or profile from buttons,
    see test.leo
    I look forward to your first and latest impressions.


  • I played around with it a bit, but quickly found that the features I thought I could hack to make parameterized macros were too far off in form.

    Specifically, Leo doesn’t seem to be able to relax the requirement that chunks are only looked up in children of the current node. That makes it impossible to do any serious tangling of the kind I’d do with FunnelWeb, which seems kind of broken for an LP tool.

    Maybe there are plugins that solve this, but I don’t even know where to look for ’em. So, it’s probably simplest for me to go with FunnelWeb if I do go the route of pure C. In any case, if you see my later post I did in fact finally get the Pyrex code to be faster than the Python code, and I don’t think there’s much more I could strip out now by going to C along that particular path, so I probably don’t need to pull out the FunnelWeb just yet.

  • “””it’s a great opportunity for all python compilation/translation
    effort got concentrated in one project for maximum impact.”””

    I’m certainly looking forward to being able to steal stuff from their type annotation work at the very least. 🙂 I’ve also glanced at Python2C and noticed that it seems to have a more sophisticated approach to reference counting than Pyrex, among other things that it does differently. In general, Python2C looks more performance-oriented than Pyrex, but it also looks rather out-of-date with respect to the current (Python 2.2+) extension model. Anyway, there are lots of interesting Python-related projects afoot these days, and it’ll be interesting to see how they go.

  • “”Specifically, Leo doesn’t seem to be able to relax the requirement that chunks are only looked up in children of the current node. “”
    the idea there would be to clone the node you want included.
    I’ve lost all perspective on what it is to program outside Leo.
    as one exanple: the ability to run doctest, timeit whatever on the current node w/o generating a file and running it as you have to do elsewhere.
    but ok, fair enough.
    inside of 2 hours you will be able to do anything in Leo you can imagine given the right plugin or script.


  • “””the idea there would be to clone the node you want included.”””

    No, it wouldn’t. 🙂 Cloning makes copies that are the same, and that’s exactly what I *don’t* want.

    What I’m trying to do is things like this… Suppose my code needs a string constant. In C, I need to define this at the top of the file, but add Python initialization for it into the module init function. I may have dozens of such string constants in a complex module, so I don’t want to have to keep jumping to the top and bottom to add them, and I definitely don’t want to forget the initialization. Also, I’d ideally like to be able to say, “here’s the constant and here’s the name, you do the rest”.

    Oh, and the kicker is that I don’t want to do it by copying the data or using a keyboard macro/script that copies it for me. I want it to be declarative, so if I edit the place where I said, “here’s the constant and the name”, stuff just gets updated. Oh, and I shouldn’t have to go anywhere near the declarations or initializations in order to define a new item. I ought to be able to do it in the spot where I need the constant.

    With FunnelWeb, I could do this with two parameterized macros, and anywhere I needed to add a new string constant I could invoke them and add them to the top and bottom sections. It’s not perfect, because I can’t just use one macro, but it’s a lot closer than anything I’ve figured out how to do with Leo.

  • heres some comments of mine from yesterday
    about basically the same topic from someone
    who is probably trying to do something part of the way there.

    you can’t use the template plugin? or see how to extend it?
    that will popup a menu of nodes you’ve marked as templates
    and replace %s in the body with user supplied text.
    I get the difference between manual and automatic. believe it.
    nobody has described the exact use case or produced the code yet.
    I will have to think more about it in the terms of creating a c translation of pyrex or of python. the code colorizer code
    could probably be hooked to supply the nodes based on the macro.

    Leo can hook on node selection or exit and at that
    time replace the parametrized section name w/custom node.
    it could recognize while you are typing the macro name then act.
    custom tangling when producing the output file might work.
    there may be other ways to accomplish this I can’t think of now.
    its python. anything Leo can do a plugin can tell Leo to do
    including changeing where it looks for matches to section references and what it does when it finds them.
    I don’t have enough of the details about how Leo works in this area to say how difficult it would be to modify in this way.
    and this is another reason to go with what you know already works.

    I doubt anyone has the market cornered on whats the best way to literate code.
    much of the time its praised where it seems to me it is mixing implementation details with documentation to no good end.
    is the idea to have no code duplication even when you’re describing code in an overview?
    thanks for the funnelweb link.
    always room to learn a few more tricks.

    “”I shouldn’t have to go anywhere near the declarations or initializations in order to define a new item.””
    this gets a bit trickier. you have to persist the new information somewhere so you aren’t forced to reparse the file at every moment or even on every entry/exit to gather information of what the current state of the translation is.

    I knew I was going to get into trouble baseing Leo’s literate performance on your imagination.
    feel free to pile on a few more use cases where this parametric ability to produce code or documentation would be useful and or how they could work. some of it is bound to stick.



Stay In Touch

Follow our feeds or subscribe to get new articles by email on these topics:

  • RSS
  • RSS
  • RSS


Get Unstuck, FAST

Cover photo of "A Minute To Unlimit You" by PJ Eby
Skip to toolbar