fbpx
dirtSimple.orgwhat stands in the way, becomes the way
The simplest(?) way to do tree-based queries in SQL

The simplest(?) way to do tree-based queries in SQL

The other day I was looking around at “tree” model libraries for Django, and I noticed that there are an awful lot of algorithms people use to manage hierarchies in SQL out there, but the one that is simplest (at least in my view) seems to not be very widely known at all.

There’s the obvious way of simply giving records a “pointer” (foreign key) referring to their parent.  The upside is, it’s easy and it’s normalized: the data means exactly what it says and there’s no duplication.  The downside is, you can’t really retrieve an entire tree from the database, without doing lots of SQL queries.  Likewise if you need to be able to find all the children of X at any moment, or be able check within another query whether any of X’s ancestors are Y.  It really doesn’t work for that sort of thing at all.

That’s when people start coming up with all sorts of tricks, like storing a path string and using prefix searches (with BETWEEN or LIKE), or storing a beginning and end number (aka the “nested sets” approach), or having a fixed maximum hierarchy depth and representing the materialized path using individual fields (used in Drupal’s menu system).

The advantage to these systems is that they don’t take a lot of space, but the disadvantage is that nearly all the maintenance has to be done in the application – often involving quite a few SQL queries!  So, in essence, these approaches trade lots of queries at retrieval time, for lots of queries at update and insert/delete time, and often some additional programming complexity.

But I don’t think I’ve ever seen an article on my preferred way of managing hierarchies in SQL, and that’s using what I call a closure table.

Closure Tables

A closure table gives you the ability to do all the same sorts of “find me all children of X, to depth N” queries as any of the other methods, just by doing a join against the closure table.

But the killer feature of the closure table, is that to add or remove a parent-child relationship, you only need to run *one* SQL query – a query so simple that even the least-powerful SQL databases can be configured to run it as a trigger on the main table!

Let’s take a look at how this works.  A closure table is simply a table that maintains the “transitive closure” of the parent-child relationships in the base table.  So, let’s say you’re modelling a directory structure, and you have a “directory” table, with a foreign key “parent_dir” pointing to each row’s parent directory.

With this structure, you can only query direct (depth 1) relationships, but by adding a “closure” table with fields for “parent”, “child”, and “depth”, you can represent the hierarchy to whatever depth is present.  So, if directory C is a child of directory B, and directory B is a child of A, then the base table would look like this:

idparent_dirname
10A
21B
32C

And the closure table would look like this:

parentchilddepth
110
220
330
121
231
132

In other words, the closure table says, “A is a child of itself at depth 0”, “B is a child of A at depth 1”, and “C is a child of A at depth 2”.  The total number of records in the closure table is equal to the number of records in the base table, times the average depth of the tree(s) in the base table.  (That is, each base table row has a row in the closure table, plus one row for each of its parents.)

Inserting Links

Now wait a minute, you may say.  This looks way more complicated and tricky than a materialized path or even the nested sets thingy.  I mean, how would you even begin to write the code to maintain such a thing?

Well, to make PARENT_ITEM a parent of CHILD_ITEM, the code looks like this:

insert into closure(parent, child, depth)
select p.parent, c.child, p.depth+c.depth+1
  from closure p, closure c
 where p.child=PARENT_ITEM and c.parent=CHILD_ITEM

In other words, it’s something your average SQL database can do without breaking a sweat. I mean, it’s like baby SQL, for crying out loud. Okay, so maybe deliberately doing a cartesian product join isn’t something you everyday, but every database can still do it.

And it’s not even a particularly performance intensive query! It basically just says, “make every parent of PARENT_ITEM (implicitly including itself) a parent of every child of CHILD_ITEM (again including itself), at the appropriate depth.

The cost of doing this operation is directly proportional to the size of the subtree under CHILD_ITEM, so if you’re adding a leaf node to the tree, this will simply insert parent entries for the new leaf, and nothing else. If you’re adding a new parent to an existing tree, on the other hand, then it will insert a new row for every node already in the subtree being “adopted”.

Of course, in other “materialized path” strategies, you’re still paying the same kinds of proportionate maintenance costs to update a deep tree, except that you’re doing string manipulation or resetting a bunch of fields – usually in code that then makes a bunch of SQL queries… and gods help you if your tree is actually too big to fit in memory.

(Even the nested sets model has similar update costs: you can certainly delegate most of the bulk operations to the database, but if you have a frequently-updated, very large tree, the nested sets approach will end up shuffling position pointers a lot more frequently than the closure table will end up with new rows.)

So how does this query actually work? And how do you delete or update relationships?

How This Works

Well, the real secret to making the closure table work is the rows with depth 0. Without that little innovation, you’d actually have to do four separate inserts, or a giant union query (yuck!) to add everything you need for a new link.

Let’s start out with our tables in a “flat” structure, with nobody being a parent of anybody else, but with the (self,self,0) entries already in the closure table:

idparent_dirname
10A
20B
30C
parentchilddepth
110
220
330

Now, if we change the base table to make B into C’s parent:

idparent_dirname
10A
20B
32C

And then run the SELECT part of the link-insertion query with a parent of 2 and a child of 3, we get:

p.parentc.childp.depth+c.depth+1
231

Resulting in a new closure table entry for the added link:

parentchilddepth
110
220
330
231

Okay, that seems easy enough. But what if we now make B a child of A?

idparent_dirname
10A
21B
32C

Now our SELECT (using PARENT_ITEM=1 and CHILD_ITEM=2) returns:

p.parentc.childp.depth+c.depth+1
121
132

And our closure table now looks like this:

parentchilddepth
110
220
330
121
231
132

Voila! Because the closure table contains correct information for every part of the tree that already exists, our query can make use of that information to fill in the rest of the new information that should exist.

Removing Links

Removing a link works similarly, using a query like this:

delete link
  from closure p, closure link, closure c
 where p.parent = link.parent and c.child = link.child
   and p.child=PARENT_ITEM    and c.parent=CHILD_ITEM

This is the exact same query, just flipped so as to delete the rows instead of inserting them.

Now, imagine if you hooked these two queries up to triggers: you could simply do whatever operations you want on the base table, and let the database do the rest!

Of course, you’d need a couple of extra actions when a row was inserted or deleted in the base table: at insert time, you need to insert the (self,self,0) entry in the closure table (followed by an optional link addition if the new row was inserted with a parent), and at deletion time, you need to delete both the self-link, and all the links that depend on it:

delete link
  from closure p, closure link, closure c, closure to_delete
 where p.parent = link.parent      and c.child = link.child
   and p.child  = to_delete.parent and c.parent= to_delete.child
   and (to_delete.parent=TO_DELETE or to_delete.child=TO_DELETE)
   and to_delete.depth<2 

Essentially, this is just a fancy way of doing the earlier link-removal operation for each of the target item’s neighboring links – i.e., ones with depth 0 or 1. (The links with a greater depth are taken care of by the first few lines of the query.)

So now, you have just a little bit of logic to do in three triggers for your base table, to get this set up, and presto! No additional code needed. Just work on the base table normally, and run any hiearchical queries through the closure table. (And make sure you’ve got unique indexes on (parent, depth, child) and (child, parent, depth), so those queries can really fly.)

So Why Aren’t You Doing This Already?

Sure, I know, not every database supports triggers… or do they? Sure, maybe if you’re on MySQL…. and it’s still the 1990’s! SQLite has triggers these days, for crying out loud.

Your database library doesn’t support them? Okay, fine. Write the equivalent of triggers in whatever your library does support – like a post-save signal in Django’s case – and enjoy. (You can probably even set up your closure table as a many-to-many “through” table in Django’s case, though I haven’t actually tried that yet.)

Really, there are only a few reasons not to use a closure table when hierarchical queries are required, and most of them boil down to, “I have a small tree (i.e.human-readable in its entirety) and I need to display the whole thing in hierarchical order using SORT BY”.

And even in the application I first used closure tables for (which had tens of thousands of items in various parent-child relationships), there were also some smaller hierarchies that could’ve used a materialized path or nested sets approach, and if I were doing that application over, I’d have probably used them.

But, if you have a huge set of hierarchical relationships (a large “forest” of large trees), and need to be able to query over the transitive closure (i.e., do “recursive” parent/child lookups from SQL), IMO a closure table is the only way to go. (And if you’re in an “enterprisey” environment where lots of programs touch the data, using triggers or stored procedures to maintain that closure table is a must!)

Join the discussion
33 comments
  • Oracle has had a 'connect-by' clause for ages. I have always assumed that this feature's existence was due to Oracle's business software being configuration driven: hierarchies of interdependent users, user menus, forms and functions. Definitely "enterprise"!!!

    In the 'nested sets' discussion document, the example query gets ugly prompting the author to conclude in his closing sentence:

    "Although this procedure works, you might want to use a language that has arrays in it, instead of trying to stick to pure SQL"

    I wonder if he was thinking python?

    For those working with data residing in Oracle, the connect by syntax is perhaps easier to read. Here is a simple example that demonstrates its use and also a pseudo column 'level' that has been computed and made available.


    select name, emp_id, supervisor_id, level
    from employees
    connect by prior emp_id = supervisor_id
    start with emp_id = 1;

    Unfortunately this is proprietary and not a feature that made it into PostgreSQL. Postgress does nonetheless support nested queries. The following link provides a dynamic solution to using a static closure table.

    http://communities.bmc.com/communities/docs/DOC-9902

  • If your DBMS has support for hierarchical queries, either through recursive CTEs or the CONNECT BY clause, I'd use it in preference to a set of potentially buggy triggers.

    That said, your technique works; I've done it using triggered procedures and it keeps everything up to date quite nicely. It's problematic for a busy DBMS as altering a top node can cause a long running transaction, but there's the same issue with any cascading constraints.

    If someone told me they were doing this as an optimization technique, I'd certainly want to profile and would point out that file systems don't do this.

    "Okay, so maybe deliberately doing a cartesian product join isn't something you everyday, but every database can still do it."

    As a matter of fact, all (non-outer) joins and the set intersection are logically derived from the cartesian product. I've got a pretty detailed explanation here, part II, section 6.

  • I've seen closure table discussions too; what I *haven't* seen are the single-query inserts and deletes, without using procedural code (either in SQL or some other language).

  • Actually, if you think I said I invented this technique or claimed it was original, your reading comprehension needs some work. Nowhere in the article did I say whose it was or who invented it.

    If I had, though, I would have said that I co-invented it with a fellow by the name of Coranth Gryphon, back in early 1998. He had learned SQL roughly three weeks earlier, and I wasn't that terribly much more experienced with it myself.

    (I skimmed a Celko book a few years after and saw the nested sets approach, but not this one.)

    Thanks for pointing out the first instance I've seen on the 'net (or anywhere else) of somebody actually coming *close* to showing how to do this with single SQL statements.

    (Of course, the slides you linked don't actually show any SQL for making or breaking subtree links, (vs. just adding and removing *leaf* nodes), so I hardly could've "copied" something that's you know, not actually there. Just sayin'.)

  • Apart from the connect by clause, hierarchical queries are available using the WITH recursive subquery factoring clause in Oracle. http://download.oracle.com/docs/cd/E11882_01/server.112/e17118/statements_10002.htm#BABCDJDB

    Postgres also has hierarchical queries. Currently they are built in the SQL dialect it uses. http://www.postgresql.org/docs/8.4/static/queries-with.html
    They used to be provided by an add on module that is still available.
    http://www.postgresql.org/docs/8.3/static/tablefunc.html
    Even MS SQL server has WITH clause factoring: http://msdn.microsoft.com/en-us/library/ms186243.aspx
    The closure table approach presupposes some kind of trigger interface which usually comes with some kind of procedural language. Thus I am unable to understand when it is best to use this solution over other straightforwardly procedural solutions. Apart from that it seems that the "closure" table is simply a means of storing explicitly the reachability of vertices in a directed graph. So why stop simply at a tree. Why not a forest or more complicated graph?

  • As long as you don't have any diamonds (i.e., more than one path to get from node A to node B), you can actually represent any sort of DAG you like.

    To put it another way, a node can have more than one parent (belong to more than one tree), as long as none of its depth-1 parents share a common ancestor.

    As for this approach being non-procedural, I mean that it is not row-oriented in the way that many other approaches to structure maintenance are. That is, there is no "looping" or recursion needed to incrementally maintain a closure table, vs. the complexity of some other approaches.

  • Interesting idea but how large would it scale? If the original table of parents and children was several billion nodes and paths were on average dozens of levels deep, the resulting closure table would be a monster.

    Also, what would be the best method, in your opinion, if you had simple relationships (simple meaning nothing belongs to more than one tree) and once inserted none of the relationships ever change, only new parents with their children are added? Would you still do this, or another approach? In other words, you have the difficulty of traversing the tree in the various ways described, but no maintenance issues.

  • If you only ever added children to parents, never the reverse, and never changed anything, a materialized path solution would be superior.

    As for the size of the closure table, it'd simply be D entries times N rows in the base table. If the average depth is "dozens", and the base table is "billions", then the closure table will be "dozens of billions" — but of course there will be less data per row than the base table.

    Really, a closure table is the same as a materialized path turned sideways, so the space used is just a constant multiple of the space that would be used by a materialized path.

    The difference is that both queries and modifications are simpler to do in SQL for a closure table than for a materialized path, and may execute faster in some cases.

  • It looks like one of your sentences was chopped off?

    "…at retrieval time, for lots of queries at update time, and [??]
    But I don't think I've ever seen …"

    Seems like something is missing here?

  • "The total number of records in the closure table is equal to the number of records in the base table, times the average depth of the tree(s) in the base table." – It's (the average depth+1) times (the number of records in the base table).

  • The reason this structure is a good idea is that many real world trees are wide rather than deep (e.g. millions or billions of records but only a dozen or so hierarchical levels). So rather than being O(n^2), the complexity is effectively O(n).

    On a deep tree (e.g. one where the max depth is O(n) as the data grows), this structure becomes inefficient.

    This is not a criticism of the structure – Rather, I'm noting the real world wisdom encoded in the solution choice.

  • Hello, excellent article.

    I am working in a project that I need a similar hierarchy and I think your solution might be the best I 've seen yet. There is small difference, however, and I would like your help:

    the levels I have right now are SuperAdmin -> Admin -> SubAdmin -> Salesman -> Customer. My problem is that the SuperAdmins must be able to see all the users (which means they are the root nodes, but there is not just one SuperAdmin) and the SubAdmins must be able to see all the Salesmen that belong to their Admin (I hope it is clear!).

    Regarding the first, should I create a "tree" with all users for each SuperAdmin as their root? Should I do the same for the second case?

    Waiting for your reply!

    Thanks

  • This approach isn't limited to one root, so you can have as many roots (SuperAdmins) as you want.

    Of course, since you have a fixed hierarchy, you could just have each customer record contain a field for each of the four parent levels, and just run queries that way. This approach is more for if your hierarchies vary in depth.

  • PJE, just reading your article.

    I've got to manage television (series, season, episode tree) and movie programming (flat, though arguably you could put some movies under a "franchise" node).

    I would need to retrieve seasons and episode in order, (season 1, episode 1; season 1, episode 2; season 1, episode n; season 2, episode 1, etc.)

    There are tens of thousands to hundreds of thousands of rows.

    Would you recommend the closure table approach?

  • You don't really need a closure table when you're working with a fixed hierarchy. If you know the levels in advance, you can just include fields for each level, or use a string, like S01E04, and alpha sort it.

  • Great article. It must be noted though that in the last algorithm for the removal of links, the expression "AND to_delete.Depth < 2" seems to be redundant. The operation produces the same results with or without it.

    If the algorithm is being used to delete links in the closure table after the deletion of a node that is the root of a branch of the base table (i.e. DELETE_ITEM), it will only delete those links whose [Parent] is that root node or one of its ancestors. All links whose [Parent] is that of any of the root node's descendants must be deleted separately using something like:

    Additional operation:

    DELETE ClosureTable
    WHERE [Parent] IN (SELECT [Id] FROM DeletedItems WHERE [Id] <> DELETE_ITEM)

    The author does allude to this when he says "you need to delete both the self-link, and all the links that depend on it." The branch's root self-link is taken care of in the author's original algorithm, however all of the root descendants' links, including the self-links are not.

    Given the base table hierarchy below…

    [Id] — [ParentId]
    1 — 1
    2 — 1
    3 — 1
    4 — 1
    5 — 2
    6 — 2
    7 — 3
    8 — 4
    9 — 4
    10 — 4
    11 — 4
    12 — 10
    13 — 10
    14 — 12
    15 — 12

    Let's say we wish to delete the branch whose root is 10 (i.e. 10,12,13,14 & 15 are deleted). The following links in the closure table must be deleted.

    (1,10,2), (1,12,3), (1,13,3), (1,14,4), (1,15,4)
    (4,10,1), (4,12,2), (4,13,2), (4,14,3), (4,15,3)
    (10,10,0), (10,12,1), (10,13,1), (10,14,2), (10,15,2)
    (12,12,0), (12,14,1), (12,15,1)
    (13,13,0),
    (14,14,0),
    (15,15,0)

    The algorithm given by the author will only delete those links in the first three rows. The rest will be taken care of by the additional operation provided above, where 'DeletedItems' is a temporary table that captures the IDs of the base table records that were deleted.

    All of this is not a problem of course if the operation is to move a branch to a new location in the tree, rather than deleting it. In that case the descendants of the branch root will retain their original links to the root and its descendants, and only the links to the root's ancestors need be deleted from the closure table.

    Cheers.

  • Hi there, person above me!

    The issue you found is a non-issue if you were to apply this article's suggestions by refactoring a "naive tree" table into one with a companion closure table.

    You add three triggers to the "naive tree" table (a table with a self-referencing "parent" id):
    Insert trigger with snippet 1
    Update trigger with snippet 1 and 2
    Delete trigger with snippet 3.

    Thus, deleting only the first three rows is preferable in that case, because there are references to node 12, 13, 14, and 15 in the original "naive" table when you delete node 10.

    So, if you wanted to delete the entire subtree, you'd have done the query:

    DELETE FROM naiveTree WHERE id in (SELECT child from closureTable WHERE parent=10);

  • Great article … Would be nice to have a query to produce the full closure table from the (id,parent_id) origin table. I've been trying to produce one and it is elusive. Most attempts produce only one occurrence of each item and either the root or immediate parent only.

    For example:

    SELECT CONNECT_BY_ROOT parentId, id as childId, LEVEL AS depth, name
    FROM Organization
    START WITH parentId is NULL CONNECT BY prior id = parentId

    (DB2 w/Oracle compat)

  • The whole idea here is to use triggers to maintain the closure table from the origin table. If you need to initialize the closure table, create a second table with the triggers, then copy the origin table into the second table so it fires the triggers.

    (Don't forget, too, that the triggers will need to insert 0-depth entries on origin row insertion, and remove them origin row deletion.)

  • Thanks – I broken down and write a brute-force proc to rebuild the closure table (a useful thing to keep if you need to disable triggers for some reason).

    Anyway, the delete trigger has issues in DB2 since it does not allow the "delete link from" syntax. Do you think this is basically the same:

    CREATE OR REPLACE TRIGGER ODIN_Org_Delete_Trigger
    AFTER DELETE ON Organization
    REFERENCING OLD AS OLD
    FOR EACH ROW
    BEGIN
    delete from OrgClosure as O where exists (
    SELECT 1
    FROM OrgClosure p, OrgClosure c
    WHERE p.parentId = O.parentId AND c.childId = O.childId
    AND p.childId = OLD.parentOrgId
    AND c.parentId = OLD.id
    );
    END

    ?

  • That delete trigger looks suspicious to me: it's only joining three copies of the table and the original joins *four*.

    I think the problem is that you're confusing deleting a *link* with deleting a *node*. When you delete a node, you have to do BOTH: first delete the link, then delete the node.

    Likewise, when you insert, you have to insert the node first, then the link. (Your gist currently has that backwards.)

  • Like I said, your example above is only one of the two queries, and it looks like the one for deleting a link, not the one for deleting a node. The one for deleting a node has to be run after deleting the link.

    To put it another way, if your main table has a parent ID in it, then when you delete that row, you are first deleting "child -> parent @ depth 1", and then you must ALSO delete "child -> child @ depth 0", because that link will no longer exist.

dirtSimple.org

Menu

Stay In Touch

Follow our feeds or subscribe to get new articles by email on these topics:

  • RSS
  • RSS
  • RSS

 

Get Unstuck, FAST

Cover photo of "A Minute To Unlimit You" by PJ Eby
Skip to toolbar