Saturday, November 06, 2010

The simplest(?) way to do tree-based queries in SQL

The other day I was looking around at "tree" model libraries for Django, and I noticed that there are an awful lot of algorithms people use to manage hierarchies in SQL out there, but the one that is simplest (at least in my view) seems to not be very widely known at all.

There's the obvious way of simply giving records a "pointer" (foreign key) referring to their parent.  The upside is, it's easy and it's normalized: the data means exactly what it says and there's no duplication.  The downside is, you can't really retrieve an entire tree from the database, without doing lots of SQL queries.  Likewise if you need to be able to find all the children of X at any moment, or be able check within another query whether any of X's ancestors are Y.  It really doesn't work for that sort of thing at all.

That's when people start coming up with all sorts of tricks, like storing a path string and using prefix searches (with BETWEEN or LIKE), or storing a beginning and end number (aka the "nested sets" approach), or having a fixed maximum hierarchy depth and representing the materialized path using individual fields (used in Drupal's menu system).

The advantage to these systems is that they don't take a lot of space, but the disadvantage is that nearly all the maintenance has to be done in the application -- often involving quite a few SQL queries!  So, in essence, these approaches trade lots of queries at retrieval time, for lots of queries at update and insert/delete time, and often some additional programming complexity.

But I don't think I've ever seen an article on my preferred way of managing hierarchies in SQL, and that's using what I call a closure table.

Closure Tables

A closure table gives you the ability to do all the same sorts of "find me all children of X, to depth N" queries as any of the other methods, just by doing a join against the closure table.

But the killer feature of the closure table, is that to add or remove a parent-child relationship, you only need to run *one* SQL query -- a query so simple that even the least-powerful SQL databases can be configured to run it as a trigger on the main table!

Let's take a look at how this works.  A closure table is simply a table that maintains the "transitive closure" of the parent-child relationships in the base table.  So, let's say you're modelling a directory structure, and you have a "directory" table, with a foreign key "parent_dir" pointing to each row's parent directory.

With this structure, you can only query direct (depth 1) relationships, but by adding a "closure" table with fields for "parent", "child", and "depth", you can represent the hierarchy to whatever depth is present.  So, if directory C is a child of directory B, and directory B is a child of A, then the base table would look like this:

idparent_dirname
10A
21B
32C

And the closure table would look like this:

parentchilddepth
110
220
330
121
231
132

In other words, the closure table says, "A is a child of itself at depth 0", "B is a child of A at depth 1", and "C is a child of A at depth 2".  The total number of records in the closure table is equal to the number of records in the base table, times the average depth of the tree(s) in the base table.  (That is, each base table row has a row in the closure table, plus one row for each of its parents.)

Inserting Links

Now wait a minute, you may say.  This looks way more complicated and tricky than a materialized path or even the nested sets thingy.  I mean, how would you even begin to write the code to maintain such a thing?

Well, to make PARENT_ITEM a parent of CHILD_ITEM, the code looks like this:

insert into closure(parent, child, depth)
select p.parent, c.child, p.depth+c.depth+1
  from closure p, closure c
 where p.child=PARENT_ITEM and c.parent=CHILD_ITEM

In other words, it's something your average SQL database can do without breaking a sweat. I mean, it's like baby SQL, for crying out loud. Okay, so maybe deliberately doing a cartesian product join isn't something you everyday, but every database can still do it.

And it's not even a particularly performance intensive query! It basically just says, "make every parent of PARENT_ITEM (implicitly including itself) a parent of every child of CHILD_ITEM (again including itself), at the appropriate depth.

The cost of doing this operation is directly proportional to the size of the subtree under CHILD_ITEM, so if you're adding a leaf node to the tree, this will simply insert parent entries for the new leaf, and nothing else. If you're adding a new parent to an existing tree, on the other hand, then it will insert a new row for every node already in the subtree being "adopted".

Of course, in other "materialized path" strategies, you're still paying the same kinds of proportionate maintenance costs to update a deep tree, except that you're doing string manipulation or resetting a bunch of fields -- usually in code that then makes a bunch of SQL queries... and gods help you if your tree is actually too big to fit in memory.

(Even the nested sets model has similar update costs: you can certainly delegate most of the bulk operations to the database, but if you have a frequently-updated, very large tree, the nested sets approach will end up shuffling position pointers a lot more frequently than the closure table will end up with new rows.)

So how does this query actually work? And how do you delete or update relationships?

How This Works

Well, the real secret to making the closure table work is the rows with depth 0. Without that little innovation, you'd actually have to do four separate inserts, or a giant union query (yuck!) to add everything you need for a new link.

Let's start out with our tables in a "flat" structure, with nobody being a parent of anybody else, but with the (self,self,0) entries already in the closure table:

idparent_dirname
10A
20B
30C
parentchilddepth
110
220
330

Now, if we change the base table to make B into C's parent:

idparent_dirname
10A
20B
32C

And then run the SELECT part of the link-insertion query with a parent of 2 and a child of 3, we get:

p.parentc.childp.depth+c.depth+1
231

Resulting in a new closure table entry for the added link:

parentchilddepth
110
220
330
231

Okay, that seems easy enough. But what if we now make B a child of A?

idparent_dirname
10A
21B
32C

Now our SELECT (using PARENT_ITEM=1 and CHILD_ITEM=2) returns:

p.parentc.childp.depth+c.depth+1
121
132

And our closure table now looks like this:

parentchilddepth
110
220
330
121
231
132

Voila! Because the closure table contains correct information for every part of the tree that already exists, our query can make use of that information to fill in the rest of the new information that should exist.

Removing Links

Removing a link works similarly, using a query like this:

delete link
  from closure p, closure link, closure c
 where p.parent = link.parent and c.child = link.child
   and p.child=PARENT_ITEM    and c.parent=CHILD_ITEM

This is the exact same query, just flipped so as to delete the rows instead of inserting them.

Now, imagine if you hooked these two queries up to triggers: you could simply do whatever operations you want on the base table, and let the database do the rest!

Of course, you'd need a couple of extra actions when a row was inserted or deleted in the base table: at insert time, you need to insert the (self,self,0) entry in the closure table (followed by an optional link addition if the new row was inserted with a parent), and at deletion time, you need to delete both the self-link, and all the links that depend on it:

delete link
  from closure p, closure link, closure c, closure to_delete
 where p.parent = link.parent      and c.child = link.child
   and p.child  = to_delete.parent and c.parent= to_delete.child
   and (to_delete.parent=DELETE_ITEM or to_delete.child=DELETE_ITEM)
   and to_delete.depth<2 

Essentially, this is just a fancy way of doing the earlier link-removal operation for each of the target item's neighboring links -- i.e., ones with depth 0 or 1. (The links with a greater depth are taken care of by the first few lines of the query.)

So now, you have just a little bit of logic to do in three triggers for your base table, to get this set up, and presto! No additional code needed. Just work on the base table normally, and run any hiearchical queries through the closure table. (And make sure you've got unique indexes on (parent, depth, child) and (child, parent, depth), so those queries can really fly.)

So Why Aren't You Doing This Already?

Sure, I know, not every database supports triggers... or do they? Sure, maybe if you're on MySQL.... and it's still the 1990's! SQLite has triggers these days, for crying out loud.

Your database library doesn't support them? Okay, fine. Write the equivalent of triggers in whatever your library does support -- like a post-save signal in Django's case -- and enjoy. (You can probably even set up your closure table as a many-to-many "through" table in Django's case, though I haven't actually tried that yet.)

Really, there are only a few reasons not to use a closure table when hierarchical queries are required, and most of them boil down to, "I have a small tree (i.e.human-readable in its entirety) and I need to display the whole thing in hierarchical order using SORT BY".

And even in the application I first used closure tables for (which had tens of thousands of items in various parent-child relationships), there were also some smaller hierarchies that could've used a materialized path or nested sets approach, and if I were doing that application over, I'd have probably used them.

But, if you have a huge set of hierarchical relationships (a large "forest" of large trees), and need to be able to query over the transitive closure (i.e., do "recursive" parent/child lookups from SQL), IMO a closure table is the only way to go. (And if you're in an "enterprisey" environment where lots of programs touch the data, using triggers or stored procedures to maintain that closure table is a must!)