twitter
    musings about technology and software development..

Algorithms for storing and querying hierchical trees

I've often found myself needing to represent hierarchical data in my database -- navigation trees, forum threads, organizational charts, taxonomies, etc.  I've been trying different approaches to maintaining a hierarchy, and thought others might be interested in my findings.  For purposes of illustration, our sample tree is the following:

        1
      /   \
     2     4
   /  \
  3    5

Approach #1: Adjacency list
The idea here is simple, you store each node's parent in a table:

table: nodes
  id   parent_id 
1null
21
32
41
52

This is trivial to implement, but hierarchical queries become hard. In order to query for all nodes under a given branch, you have to recurse through its children. If you don't have too many nodes, you can just read the entire table into memory and cache it -- which is sufficient for most web site navigation structures, for example.

Approach #2: Store the Path as a string
Here, the idea is that each node stores its path as string. For example, a node might have a path of "1_8_13". Thus, you could find the children of node "8" by querying for all nodes with a path of "1_8_%".

table: nodes
       id        path 
1
 "1"
2
 "1_2"
3
 "1_2_3" 
4
 "1_4"
5
 "1_2_5" 

This gives you the benefit of hierarchical queries, but only if you add an index on the "path" column, forcing SQL to do the heavy lifting. And, since it's a string column, your performance will not be as fast as if it were integer-based.

Approach #3: Nested subsets
The idea here is that each subtree is kept within a range of IDs, and the range of its subtree is stored in the node. In the example, the subtree of 1 is (obviously) within the range of 1..5. However, you'll notice the subtree of 2 is NOT within the range of 3..5 because node 4 violates that rule. As a result, we need a mutable ID in order to maintain the subset.

table: nodes
       id        mutable_id   min_mutable_id    max_mutable_id  
1115
2234
3333
4555
5444

Note how we had to swap the IDs of 4 and 5, so that node 2 could have a valid nested subset range of 3..4. This can easily happen on insertions as well and force us to recompute large parts of the table if shifting is required. However, hierarchical reads are fairly inexpensive, as they just become numerical range queries.

Approach #4: Expanded tree
The idea here is that you store the normal adjacency list, but maintain another table of the tree already recursively expanded-out:

table: nodes
  id   parent_id 
1null
21
32
41
52

table: nodes_expanded
  id   expanded_parent_id 
11
21
22
31
32
33
41
44
51
52
55
Essentially, the expanded table acts as a hierarchy cache.  For example, to get all nodes under the "2" subtree, just find all nodes with (expanded_parent_id == 2), which will return matches on 2, 3, and 5 as expected.  The main benefit of this approach is that all your SQL queries are based on exact match, whereas the last two approaches use range queries. Likewise, while an insertion will require you to futz with the "nodes_expanded" table, the data in the "nodes" table stays intact. With the nested subsets approach, you may find your main "nodes" table locked on reads while all the IDs get shuffled around.

So, to summarize:
Pros
Cons
Adjacency list
  • Easy to implement
  • Minimum storage
  • Slow calculation of subtrees (can mitigate with in-memory caching)
Path substrings
  • Easy to implement
  • Handles hierarchical queries
  • Relies on SQL index on a string column
  • Inefficient storage (only using 0-9 and "_" in the char range)
Nested subsets
  • Handles hierarchical queries
  • Insertions can be expensive
  • Insertions can result in lock contention
Expanded tree
  • Handles hierarchical queries
  • Hierarchy is pre-cached as a simple "equality" join
  • Requires maintaining separate "nodes_expanded" table
  • Insertions can be expensive, but not against the main "nodes" table

Later, I hope to implement and benchmark each approach against each other.  Any other algorithms worth investigating?

0 comments:

Post a Comment