Tuesday, December 28, 2004

Fact Types, Fact Sets, and Change Events

After thinking through the fact base design a bit more, I think I have a clearer picture: a fact base is a mapping from fact types to fact sets, that supports rules that are invoked in response to change events. I'm here using "fact type" as effectively the same term used in ORM (Object Role Modelling, not Object-Relational Mapping), although in ORM the term normally implies an elementary fact type, and here I'm using it for any fact type, elementary or compound.

But if you're not familiar with ORM, it's probably easiest to think of a fact type as being like the schema information for a table in a relational DB: the design of the table, not its actual contents. The contents would be a fact set. So, in essence, you ask a fact base for a table of a given design, and it hands you back an implementation, which is a nice trick in and of itself.

However, that sounds almost as though you're going to be specifying a table design, which isn't really the case. The notion of a fact type is completely abstract, and defined operationally. That is, you could use strings or numbers or URLs or a tuple of field names, or whatever the heck you like, as long as you add the implementation for it, which consists (of course) of adding a method to a generic function. So, the FactBase class will have a method like 'implementationFor(factType)', and it will be a generic function, so you could then do something like '@FactBase.implementationFor.when("isinstance(factType, tuple)")' to define how you'll interpret a tuple of fields as a fact type, or '@FactBase.implementationFor.when("isinstance(factType, type)")' to define how you'll interpret a new-style class as a fact type.

This is a pretty good example of the sort of flexibility that adaptation and generic functions lend to an architecture -- the FactBase class itself doesn't really care what a fact type "is", and so is reusable for any number of scenarios, maybe even simultaneously. The only real requirement is that a fact type be a unique dictionary key, so the fact base can cache its implementation (fact set).

Fact sets, on the other hand, implement storage, indexing, and querying on their contained facts. A given fact set class should be applicable to a wide variety of fact types, within certain constraints. However, the fact set itself doesn't need to be tailored for any particular kind of fact type, and ordinarily shouldn't be. Mostly, fact set classes should be created for different kinds of storage, like database tables vs. in-memory, or different locking or index policies, or whether the fact set is a virtual or materialized view.

The glue between the complete abstraction a fact type, and the absolute concreteness of a fact set, is the generic function 'FactBase.implementationFor'. In a given subclass of FactBase, one can add methods to specify how to translate fact types into fact sets in the general case, but also in the specific case. For example, one could say, "In this fact base, fact types are table names in an existing database, so I will write an implementationFor() method that retrieves the named table's schema, and then creates a TableFactSet for the corresponding table." But, you could then also have a rule for, "but if it's fact type 'Foo', then create a MemoryFactSet instead".

I expect, however, that most of the time fact types will be expressed as classes, because that's the most natural thing to do in Python, and it lends itself to potentially using this factbase thing as an object-relational mapper. (Though I'm not yet sure how that aspect is going to play out.)

So, on to rules and change events. The basic idea of these is one I spelled out before, but it's refined a bit more now. FactBase will have another generic function, something like 'factChanged(event,factType,old,new,derived=False)'. You'll add methods to this to respond to facts being added (old=None, new=newFact) , changed (old=oldFact,new=newFact), or deleted (old=oldFact, new=None). The 'derived' flag is set if the change is in response to some other fact changing, because some fact sets will need to know whether they should do anything about it in that case. Materialized views will need to update themselves, but non-materialized views won't, for example.

Oh, forgot to mention. Normally, you'll make changes by performing operations directly on a fact set: add/change/delete stuff. Also, fact sets will support iteration (to retrieve all facts of the type), and __contains__ (so you can check for e.g. 'someFact in factBase[factType]'). Anyway, when you make a change to a fact set, it's responsible for calling factBase.factChanged(), if the change actually resulted in a change. (E.g. if you add a fact that already exists, this shouldn't result in a change event.) And, that call will take place with 'derived=False'.

So, if there are derivation rules on the fact base, like "fact type X = fact type Y where some condition", and an "add" event occurs, then factChanged() will be called recursively by that rule, with derived=True. But, this will actually pass by way of the derived fact type's fact set. That is, if X is the original fact type, and Y is the derived fact type (a subset of X), then the rule will actually call 'factBase[Y].factChanged()' with derived=True. The fact set for fact type Y will then do anything it needs to with this information, and then passes the call back to 'factBase.factChanged()', so that any further derivation rules will be triggered.

This is actually the murkiest area of the design, because there are so many possible ways of deriving one fact type from another, given a few basic concepts like summarizing, calculating, filtering, joining, and merging. And, there are so many ways of actually implementing these concepts, via database tables, database views, in-memory materialized views, and query transformers.

So, I'm not positive that my current idea about how to propagate events is right. What I do know is that derivation rules need to be independent of the actual fact set implementations. Otherwise, your implementation schema constrains your conceptual schema, and that will interfere with both reusability and the design process.

Anyway, the fundamental problem is one of direction. By that, I mean, the direction of change event propagation. You can see this in event driven systems generally, in that when say, a form widget changes an object's field, you don't want the change to cause an event to loop back to the form widget to redisplay the value. Or, I guess I should say, you don't want the redisplay of the value to cause the field to get changed again. Anyway, you don't want an infinite loop, and in our case you don't want multiple copies of the same event to occur, attempting to add or remove a fact more than once.

The reason this is tricky in our case is that we don't know (at the fact type/schema definition level) what direction the fact sets implement changes in, and we don't want to prescribe that "thou shalt only modify such-and-such fact set directly". In principle, we might want to allow a change to a derived fact type to propagate backwards to the original, non-derived fact type. In fact, who's to say what's "real", and what's "derived"? For a given application schema, there could conceivably be implementation circumstances where the relationship should be different.

Given that the whole purpose of this thing is to "do stuff" when we "notice things", we must be able to specify "outward facing" rules that notice a change in facts and do something about it. These are distinct from "inward facing" rules whose purpose is to propagate changes in one fact type to another fact type. And these are potentially distinct from the actual storing or updating of back-end data in response to changes.

So, if derivation rules are inherently bidirectional, this means that any event leaving a given fact type will (in a naive system) return back to the point of origin, resulting in an infinite loop.

So, maybe the solution can be found in USENET techniques: a "path" parameter listing all the fact types that the event has passed through on its way to the current fact type. Or, better yet, since the topology of fact type relationships is more peer-to-peer, we should use the EchoMail "seen-by" technique: a collection of all the fact types to which notification has been sent.

So, initially, a modification operation would invoke something like 'factChanged(event, factType, old, new, set([factType]))', and derivation rules could be conditioned on the target fact type not being in the "seen by" set, and when invoked they would add the current fact type to the set.

In practice it's not quite that easy, because this isn't exactly what EchoMail does. In EchoMail, a node adds all the target nodes to the seen-by set before sending it to any of them. This prevents unnecessary message sends in any sub-network where all nodes have a direct connection to all other nodes. In order to actually duplicate this behavior, we'll need our 'factChanged()' generic function to have a custom method combiner that figures out how the seen-by set should be changed before actually invoking any of the derivation rules. Or, we could get tricky and use rules like 'factType not in seenBy and seenBy.add(factType) is None', which would change the set while the applicable methods were being computed. Because all test expressions used by generic function rules are computed at most once, this would still invoke all the rules that depended on the fact type not being in 'seenBy', but by the time the first rule is invoked, the fact type will be in the seen-by set, and so any recursive calls will not loop back to any fact type that is going to be sent the event directly.

Hm. I'm not sure I like it, unless the trickery can be encapsulated in some way. But I guess that could be done with a function like 'haveSeen(factType,seenBy)' to be used in the actual expression, or perhaps there will just be an API to define derivation rules that will know what fact type you're deriving from what, and handle this bit for you.

Still, this solution is very nice in that it allows us to define all derivations at the schema level, and avoid having to have the fact sets deal in any propagation themselves. So, if you have fact types X and Y that can be derived from each other, you don't need to care at the API level whether you make a change to X or Y, and whether you listen to changes from X or Y. You just do whatever's convenient. And, at the implementation level of X and Y, you can make X a database view of Y or vice versa, according to what makes sense for the application.

But now there's a new trade-off: propagation to all derived fact types, even ones for which no "listeners" exist. (That is, which have no outward-facing rules, and no listeners registered with a fact type). For example, you might have derived fact types that represent weekly, monthly, and quarterly reports of some sort. It's a waste of time to keep updating those totals if you aren't currently displaying a GUI providing real-time updates, or something else that needs the data.

So, there are some additional complexities to be explored here, to distinguish between the abstract concept of "Y can be derived from X", and the practical issue of, "Update Y when you update X". So, in addition to the "seen-by" concept, there may also have to be a (recursive) concept of "cares-about" or "has listeners", such that a fact type only cares about updates if it has listeners currently, or if some fact type that derives from it is listening. And, fact sets that need to know about all changes could count as a listener, too.

But I think this is going to need some more thought. The architecture is starting to sound a bit more like an event-driven system with dynamic routing, and less like a system that's statically routable by generic function(s). At the very least, there are some interesting questions about how to know whether static listeners ("outward-facing rules") exist and therefore that they should be sent events, for example. I think these issues can be resolved, though, it's just going to take some thought.

Another thing that needs further thought is the structure of events themselves. So far, I'm focusing only on "row-level" changes, and not dealing in the potential for mass updates, insertions, or deletions. Propagating these as individual per-row events through a system could be quite costly. In principle, derivations can be applied to set specifications as well as to individual items, but the logic is considerably more complex, and it seems YAGNI-ish to me at best. There's probably nothing that would stop it from being added at a later time, if it turned out to be necessary.

Anyway, I think the fact base system is definitely going to need some sort of API to explicitly deal with derivation and listening, so that you can say, "here's a static rule that wants to know about events regarding X, but it's dynamically enabled/disabled depending on something else", or conversely, "it's statically enabled and we always want it to be notified". In the latter case, it would be flagged as a listener when the fact base is created. The derivation API would take care of the 'caresAbout' and 'seenBy' conditions when it adds methods for you; you would not directly add methods to 'factChanged'. It would also need to use the existence of derivation rules to determine whether a fact types 'caresAbout' another fact type recursively.

This is all a little more complex than what I originally envisioned, as I was thinking that a derivation API would emerge somewhat later in the design/development process. But, my previous vision didn't address all the composability and scalability issues, so that's understandable. Anyway, fact bases are going to have to grow 'listen/unlisten' methods, or something of that sort, and track a "listener count" for each fact type. The count would start at 0 or 1, depending on whether there's a static listener. A derivation graph is then used to recursively propagate the count to indicate that there is also a listener for all the fact types from which the current fact type is derived. (And this needs to use a seenBy map, too.) Anyway, "unlistening" to a particular fact type does the same thing, just decreasing the count instead of increasing it. The derivation API would maintain the derivation graph, and the information about what fact types have initially-enabled listeners, for a given FactBase subclass.

Hm. That's not so bad. It's mostly a lot of stuff to write tests for; the code itself shouldn't be too complicated. The bulk of the implementation should be just a few simple generic functions and some recursion. The base derivation API will really be dealing more in dependency management than the mechanics of fact derivation, which is good because we want the very nature of the facts and schemas to be opaque to the fact base system! That is, we should be able to equally use this API for relational schemas, RDF, XML, or whatever. The point is just to have fact types, fact sets, and change events.

So, this first layer will work with anything, but that means of course that it implements nothing. Our initial tests will use hand-coded derivations, rather than ones derived from queries. Oh, and speaking of queries, I forgot to mention: queries can be fact types, and I indeed expect a significant use of the system in practice might involve things like:
for x,y in factBase[
"(x,y for x in SomeClass"
" for y in OtherClass"
" if x.foo>y.bar)"
print x,y
Interesting, eh? When I've got this whole thing done, that query should be usable as-is whether SomeClass is in memory and OtherClass is stored in an LDAP database; from the point of view of the application, it just won't matter. But in the back-end, the query will be represented efficiently, as long as all the necessary data is in the back-end. (E.g. joining across two databases just isn't going to be efficient, even if you weren't using the fact base mechanism.)

But, that vision is still a ways off. The fact base system I'm describing today will form the basis for it, but first some rules for automatic derivation will have to be implemented, which will mean fleshing out some concrete ways of defining fact types. These things will become part of peak.schema and peak.query, rather than the generic dispatch package, because they will use other parts of PEAK (like the attribute metadata framework). The dispatch package should remain generic, so the raw FactBase class, and some simple in-memory FactSet classes, plus the derivation API, need to be designed in a way that will allow them to be used with any program's notion of what constitutes a fact type or query language.