Monday, December 27, 2004

More forward-chaining twists

My idea of how the fact-base thing will work keeps evolving in interesting ways. For example, it seems clearer to me now that it's not necessary to restrict facts to tuples, or to require their state to be externalized. Facts can be arbitrary objects as long as they are hashed on immutable aspects of themselves, and compare equal according to those aspects. Fact types can then just be ordinary Python types, and PEAK's "struct" types (tuple subclasses with named fields) would work fine for simple fact types. The generic functions for adding and removing facts can then recognize type by type, and reference facts' attributes instead of tuple items.

An added benefit of using Python types to denote fact types is that types can have additional methods, metadata, or adapters, or any number of other associated Python goodies. This makes it possible, for example, to have a fact base with a rule that notices previously unrecognized fact types, and automatically creates tables in an underlying database for them. Heck, for that matter, you could define fact types for representing a database schema as facts, and then modify the underlying database schema simply by adding or removing schema facts!

Yes, that's right, a fact base can reflect its own structure, and because it allows "triggers" of arbitrary complexity, that means you can do things like automatically generating transitive hierarchies as a result of inserting a single parent-child link. Indeed, if fact bases had the equivalent of INSERT-SELECT or DELETE WHERE, they'd be "complete" relationally.

There are a few interesting odds and ends to sort out, though. Like specifying criteria for "find" operations for example. Handling "modification" of facts as well as adding or removing them. Dealing with transactions, if applicable.

I actually was wondering whether this could be done using SQLite to implement a fact base directly, possibly using a "memory" database, but the problem there is that SQLite doesn't really have an obvious way to trigger outside-the-database actions in response to a trigger event. Also, using SQLite would mean that all facts' data would have to be reducible to strings, and we'd go through object<->string conversions at many layers of the system. For it to work well, you'd have to register tons of custom functions. The only compensating gains are the presence of a query language, and automatic query optimization.

But, a query language is something I need anyway, and Python already has a simple one: list comprehensions and generator expressions already have the full generality required of a query language. By extending the current dispatch.ast_builder module to add them to the syntax my parser supports, I can easily build a representation of such queries. Using a greedy join algorithm similar to that of Gadfly 2, but using a generic function to detect when the fact types being joined are in the same underlying database, I could "push down" joins into SQL.

In fact, a lot of these algorithms are like a prototype I created in the peak.query package, except that I wrote it before I had generic functions, so a lot of things were far more awkward to implement than they'd be now.

Still, I don't need all that sophistication for the initial fact base implementation. It really only needs to be able to do "flat fact" queries, which is to say just filters and no joins or summarization. The nature of the fact base is such that it's trivial to create more sophisticated queries as "materialized views", i.e., that contain the appropriate summaries or totals stored under new fact types. Only for operations that would create huge materialized views would it be necessary to use a query-only approach. And even then, if a database underlies the relevant fact types, the fact type could be queried using the underlying database. This approach could be kind of annoying to deal with, but it's really only a stopgap anyway until we have full queriability.

Interestingly, as soon as we have the ability to parse comprehension queries, we can use that representation to do view materialization, and in principle it should be about as easy to do that as to do queries. In both cases, the abstract representation of the query indicates items to be iterated over and may need to know about indexes and perform joins. The only real difference is that for view materialization, you have a "table" with only one "row", that you then join to the other "tables" in the query. So, a view materialization joining N tables is roughly equivalent to a query joining N-1 tables. So, for all practical purposes, to implement one is to implement the other. A query language of sufficient power, is a query language of sufficient power!

And, once that query language exists, it is also of sufficient power to be a format for expressing fact types in the first place. That is, it would no longer be necessary to plan out all the intermediate fact types and materialization rules, you could just specify rules as queries to be matched, and actions to take when items are added, changed, or removed from the result set of that query.

Hm. There's still a lot I don't know about here, and I'm not sure all of it should go into dispatch.factbase. Actually, I'm pretty sure most of it doesn't belong there, and would properly be part of peak.query. The basic fact base system should just handle simple things, although I can see that it's probably possible to implement the core parts of the query language in the dispatch package as well.

The key is (again) generic functions, which can implement some quick-and-dirty approaches in the base system, and then get much fancier in an extension that supports SQL mapping. Indeed, the base system might just use cartesian products for joins, with no pre-indexing, just doing on-the-fly hash-joins. Actually, once such a hash join is done once, it might make sense to save the index thus created, and reuse it. But again, this can be an extension to the base facility, specifically for in-memory fact types. We could even track how often the indexes are used, and drop indexes that aren't pulling their weight. Such features would be very useful in terms of not having to think about indexing. I don't know if they could be applied directly to SQL mapping, but it's probably not that important.

Hm, indexes could actually be facts themselves. This is really blowing my mind. It's "facts all the way down", really. And when you run a query, it may be that some fact types are virtual, anyway... That is, they could actually be computed in response to a query, the way Gadfly 2 allows "infinite tables" to exist, so long as you join them with something that finite-izes them.

This can be used to represent functions and operators as well as fact types... ooh... I think I see now how to handle relationship inversion. That is, if you have an expression like "x,y for x in foo for y in", then you can logically treat this the same as "x,y for x in foo for y in foobar_type if is y". Which seems like a stupid thing to do, except that it's now reversible to "if y.baz is x", if there is an inverse relationship "baz".

Anyway, the trick is to look at it as something like "if bar(x,y)", where "bar" is a virtual table linking x->y. The system can then "join" the tables "foo(x)" and "bar(x,y)" in order to produce the result. (Unless "bar(x,y)" implies "foo(x)", in which case it suffices to query "bar(x,y)"!) Note that if there is another condition in the query specifies further restriction on "y", we might first query "bar(x,y)", and then join it to "foo(x)", instead of the other way around.

So, we quite literally reduce the object query to the domain relational calculus, turning every part of every expression in the 'for' and 'if' clauses of the query into fact types and variables that get merged into the overall logical expression. We also have to turn 'in', 'not in', and 'exists()/not exists()' and other summarizing subexpressions into existential quantifiers. And I'm not at all sure how to express outer joins. But the basics are looking pretty straightforward right about now. Anyway, once we have transformed to the domain relational calculus, we map each fact type to an implementation and we do the Gadfly greedy join algorithm to assemble smart joins, possibly augmented by some data about what fact types live in a system that can do the joins itself.

Ideally, the resulting structure should be able to be reused for other queries later, and further filtered, although it's not clear to me that such filtering wouldn't necessarily lead to a different ideal join order, so it's not certain that you'd actually save any time that way. As long as you can cache a specific query's plan, that's probably sufficient.

The domain relational calculus form of the query would also be usable for view materialization, since you'd just go through looking for the non-virtual fact types in the query, and for each such fact type, create a clone of the query that has that fact type removed from it, and instead takes parameters for a single fact of the removed fact type. This query can then be used when the removed fact type has a fact added or removed, to compute the derived fact(s) that should be added or removed. Of course, if there is only a single concrete fact type present, then there is no need to query for other items, and the subscription simply creates a one-to-one fact type mapping.

Whew. That's a lot of stuff to implement, and doing it incrementally is going to be an interesting challenge. I'm going to have to remember to start out very simply. The hard part of simplicity, though, is in doing things simply without making simplifying assumptions. That is, if I made simplifying assumptions like "all fact types are known a priori" or "all fact types derive from some base or implement a common interface", or "there is some object that implements a given fact type", etc., then it would be easier to write in some ways, but it would not necessarily be as simple of a system.

If, however, I can reduce the system to operations, instead, then these can be implemented by generic functions, and the system will retain full generality and maximum clarity. (Because code that invokes generic functions tends to be exquisitely simple; almost all of the complexity moves to the selection of an appropriate method that actually performs the behavior.) This flexibility is especially important for e.g. query parsing, where different mappings from subexpressions to fact types may be important for different application domains.

Designing in terms of operations is something I still find a bit harder than more concrete designs, but I think this is an artifact of being used to thinking OO thoughts, not functional-programming ones. However, so far I've found that the gains in ease of programming, usually more than outweigh the minor discomfort of thinking about things differently.

Well, it's gettting late, so I'll have to pick this up again later. Probably what I'll do to start is just create a fact base engine with no built-in query facilities apart from exact matches. Then, I'll see if I can "evolve" more sophisticated query facilities, perhaps by creating "index" facts and seeing what I can build up from there.