twitter
    musings about technology and software development..

Algorithms for storing and querying hierchical trees

I've often found myself needing to represent hierarchical data in my database -- navigation trees, forum threads, organizational charts, taxonomies, etc.  I've been trying different approaches to maintaining a hierarchy, and thought others might be interested in my findings.  For purposes of illustration, our sample tree is the following:

        1
      /   \
     2     4
   /  \
  3    5

Approach #1: Adjacency list
The idea here is simple, you store each node's parent in a table:

table: nodes
  id   parent_id 
1null
21
32
41
52

This is trivial to implement, but hierarchical queries become hard. In order to query for all nodes under a given branch, you have to recurse through its children. If you don't have too many nodes, you can just read the entire table into memory and cache it -- which is sufficient for most web site navigation structures, for example.

Approach #2: Store the Path as a string
Here, the idea is that each node stores its path as string. For example, a node might have a path of "1_8_13". Thus, you could find the children of node "8" by querying for all nodes with a path of "1_8_%".

table: nodes
       id        path 
1
 "1"
2
 "1_2"
3
 "1_2_3" 
4
 "1_4"
5
 "1_2_5" 

This gives you the benefit of hierarchical queries, but only if you add an index on the "path" column, forcing SQL to do the heavy lifting. And, since it's a string column, your performance will not be as fast as if it were integer-based.

Approach #3: Nested subsets
The idea here is that each subtree is kept within a range of IDs, and the range of its subtree is stored in the node. In the example, the subtree of 1 is (obviously) within the range of 1..5. However, you'll notice the subtree of 2 is NOT within the range of 3..5 because node 4 violates that rule. As a result, we need a mutable ID in order to maintain the subset.

table: nodes
       id        mutable_id   min_mutable_id    max_mutable_id  
1115
2234
3333
4555
5444

Note how we had to swap the IDs of 4 and 5, so that node 2 could have a valid nested subset range of 3..4. This can easily happen on insertions as well and force us to recompute large parts of the table if shifting is required. However, hierarchical reads are fairly inexpensive, as they just become numerical range queries.

Approach #4: Expanded tree
The idea here is that you store the normal adjacency list, but maintain another table of the tree already recursively expanded-out:

table: nodes
  id   parent_id 
1null
21
32
41
52

table: nodes_expanded
  id   expanded_parent_id 
11
21
22
31
32
33
41
44
51
52
55
Essentially, the expanded table acts as a hierarchy cache.  For example, to get all nodes under the "2" subtree, just find all nodes with (expanded_parent_id == 2), which will return matches on 2, 3, and 5 as expected.  The main benefit of this approach is that all your SQL queries are based on exact match, whereas the last two approaches use range queries. Likewise, while an insertion will require you to futz with the "nodes_expanded" table, the data in the "nodes" table stays intact. With the nested subsets approach, you may find your main "nodes" table locked on reads while all the IDs get shuffled around.

So, to summarize:
Pros
Cons
Adjacency list
  • Easy to implement
  • Minimum storage
  • Slow calculation of subtrees (can mitigate with in-memory caching)
Path substrings
  • Easy to implement
  • Handles hierarchical queries
  • Relies on SQL index on a string column
  • Inefficient storage (only using 0-9 and "_" in the char range)
Nested subsets
  • Handles hierarchical queries
  • Insertions can be expensive
  • Insertions can result in lock contention
Expanded tree
  • Handles hierarchical queries
  • Hierarchy is pre-cached as a simple "equality" join
  • Requires maintaining separate "nodes_expanded" table
  • Insertions can be expensive, but not against the main "nodes" table

Later, I hope to implement and benchmark each approach against each other.  Any other algorithms worth investigating?

Windows 7 Shortcuts

Just thought I'd share some shortcut keys I use all the time:

Windows + D: Show Desktop
Windows + Tab: 3D Flip
Windows + #: Runs the #'th program on your Quick Launch

And in Explorer:

Shift+Right-Click on a folder/file: Additional options like "Open command window here"
Alt+Up: Goes up a folder level in Windows Explorer (plus Alt+Left/Right for Back/Forward)

My love for dependencies ...

Once upon a time, we had a developer whose full-time job was debugging random issues in some particular feature.  That feature had a dependency on an external team who had no vested interest in this feature, and therefore using their library was a bit like using chopsticks (their library) to eat steak (of course our feature is the delicious steak).  Sure, you can use the chopsticks, but every time you do you question whether you'd be better off without them and just eating the steak with your hands.

Couldn't get any worse, right?

So, when a different team approached us with a product that was a perfect fork and knife that they used to eat steak every day for the last three years, we chomped at the bit to get a hold of it.  Long story short, their utensils were made of plastic and were constantly breaking, and now we have two developers whose full-time jobs are debugging random issues in this feature. 

We long for the days of having chopsticks to eat our steak.  Do not take dependencies lightly.

Hard Drive Backup with Live Mesh

I hope everyone out there is backing up their data.  Up until now, I've used the tried-and-true method of copying my files periodically to another drive.  Of course, in the event of data catastrophy, I would lose all my changes since the last xcopy .. which was .. about 9 months ago.  A file backup gestation period, if you will.

In any case, I'm now using Live Mesh.  It's cross-platform and you get 5gb of online storage for free (you can sync unlimited data between machines).  I've synchronized my musics, videos, and documents between all my machines which is pretty fantastic.  In case you want to try it, here's what I would have liked to know beforehand:
  • You cannot synchronize your Desktop folder.
  • Your first 5gb of synchronized files ends up in the cloud. Choose wisely.
  • You have to login with a LiveID, but it doesn't share cookies with the browser.  So, if you will ever want to sync with a friend, create a new LiveID to share.
  • When you add a folder to be sync'd, it will show up on every other machine as a virtual folder.  This can be very confusing when you've named them all "Documents" -- prefix folder names with the computer name.
My next step is to set up a sync with my a friend in another state, in case my home with all my computers burns down.  Overall, it was pretty easy to setup, although I now have a paranoia that one node will decide to delete something, and spontaneously trigger all my files to be deleted on every machine simultaneously.