Tuesday, January 25, 2005

Torture-testing Generic Function Performance

Since I'm claiming that my PyCon presentation will be on "efficient" rules in Python, I thought this weekend that it might be a good idea to actually measure the performance of the current generic function implementation in PyProtocols.

For the purpose, I devised a true torture test. Instead of benchmarking what generic functions are best at (lots of complex rules), I decided to benchmark a scenario where if-then rules are easy to use. Specifically, I wrote a trivial generic function with a mere seven rules, dispatching on single argument: a person's age. Then, I wrote the same function as a hand-tuned, binary tree of if-then statements.

Of course, it took me three times as long to write the hand-tuned version as it did to write the generic function version, and the multiplier would've been much worse if there had been more rules or if more complex conditions than simple range checking were involved. But the point was to make it an obscenely unfair test, akin to comparing the output of an non-optimizing C++ compiler with a hand-tuned routine in assembly code.

Nonetheless, I was shocked and horrified when my first timing run showed the generic function taking about 50 times as long to run as the hand-tuned version. The hand-tuned version ran in about 8 tenths of a microsecond, and the generic function ran in about 40 microseconds.

However, I quickly realized what the first problem was: I was measuring pure dispatch here. That is, these functions did nothing but return a value; they had nothing at all to do, except to decide what to do! So, the correct way to read this result was that the generic function machinery was adding about 40 microseconds of more-or-less constant overhead, which if you look at it that way sounds quite reasonable.

But, 40 microseconds is way too much overhead for my taste, if you want to be able to use generic functions everywhere. For example, I'd like to use them as part of peak.web's traversal mechanism, which itself is only a part of value rendering, which needs to be able to run in fractions of a millisecond. 40 microseconds per call can add up much faster than you think.

Interestingly, for certain input values, the generic function was much, much faster, dropping to less than 20 microseconds. This was because the values were a range test, which meant that if you passed an input value that was an exact match for a constant in one of the rules, it would find the result in a hash table and avoid binary searching. So, the actual dispatch operation in that case was about as fast as that of the hand-written Python function, but there was still another 18 or so microseconds of other overhead.

A little bit of experimentation with other input values revealed the problem: if you were using a value at the extreme ends of the ranges being tested by the rules, an extra function call was taking place, to the __cmp__ method of a Min or Max object. Aha! The first direct indication of something that could use a translation to C. I ported the relevant classes and functions using Pyrex, and the overhead for all input values dropped to just 21 microseconds. Yay!

I also took a look at the flow through the dispatch algorithm, and realized that a big piece of the overhead was coming from argument retrieval. Generic functions let you have rules that test arbitrary expressions, and I had coded for the general case, where any expression you work with might involve arbitrary calculation. So, for testing purposes I hacked up a fast-path version of the same routine, and ran the timing tests on it, getting further improvements but not yet being sure how to implement them in the general case.

So tonight, I took another look at it, and realized that what I needed to do was rearrange the dispatch logic a little so that if your rules only end up looking at arguments (rather than arbitrary expressions involving arguments), the generic function takes the values right from the input argument tuple, rather than passing through the "look up an expression" code. This also allows it to skip allocating an expression cache for each invocation, unless non-argument expressions are actually used. Net result: from 21 microseconds overhead to a mere 13. Not bad at all, considering that I only rearranged some Python code a bit and added no new C.

Those remaining 13 microseconds are probably going to take progressively more and more work to get rid of, and it'll be more and more likely that only going to C will remove the rest. Simple timing tests and past experience have given me a rough idea of how much time each statement in the remaining code is taking, and a lot of those microseconds are just attribute accesses or unavoidable function calls adding up, a fraction of a microsecond at a time. The only way to get rid of them is to move the functionality into C, where attribute lookups can be replaced with pointer dereferences. (Unfortunately, the extra calls won't go away; the abstract generic function still has to invoke the selected rule once it's been chosen, so there's always going to be one extra function call of overhead.)

Well, there are a couple of attribute accesses I could trim (while still in Python), if I could make at least the dispatch algorithm completely lock-free. Then I wouldn't need to acquire and release a thread-safety lock around each dispatch. But that will amount to maybe 1.5 microseconds, a mere 10% of the remaining overhead. And it's going to take some thought to create a provably correct lock-free algorithm.

And when all is said and done, there'll still be some overhead; I'm not going to be able to eliminate it altogether for a torture-test scenario like this. Where generic functions really shine is in optimizing rules too complex for us poor humans to code correctly, or which we're too lazy to hand-tune. For such circumstances, even 20 microseconds of overhead are normally going to be lost in the noise, compared to whatever work it is that the rules actually do. (Remember, these torture tests are using rules that just return a value, with no actual computation after the dispatch.)

It would be interesting, however, to see what the overhead is like in situations where there are lots of complex rules that a human would rather not deal with hand-tuning. On the other hand, actually doing the comparison would suck, because by definition it'd be something you wouldn't want to write by hand. :) So, maybe somebody has something they've already written involving complex rules, that we could then recode as a generic function and see how well it works.