null or empty: Algorithms for storing and querying hierchical trees

I've often found myself needing to represent hierarchical data in my database -- navigation trees, forum threads, organizational charts, taxonomies, etc. I've been trying different approaches to maintaining a hierarchy, and thought others might be interested in my findings. For purposes of illustration, our sample tree is the following:

Approach #1: Adjacency list
The idea here is simple, you store each node's parent in a table:

table: nodes
id	parent_id
1	null
2	1
3	2
4	1
5	2

This is trivial to implement, but hierarchical queries become hard. In order to query for all nodes under a given branch, you have to recurse through its children. If you don't have too many nodes, you can just read the entire table into memory and cache it -- which is sufficient for most web site navigation structures, for example.

Approach #2: Store the Path as a string
Here, the idea is that each node stores its path as string. For example, a node might have a path of "1_8_13". Thus, you could find the children of node "8" by querying for all nodes with a path of "1_8_%".

table: nodes
id	path
1	"1"
2	"1_2"
3	"1_2_3"
4	"1_4"
5	"1_2_5"

This gives you the benefit of hierarchical queries, but only if you add an index on the "path" column, forcing SQL to do the heavy lifting. And, since it's a string column, your performance will not be as fast as if it were integer-based.

Approach #3: Nested subsets
The idea here is that each subtree is kept within a range of IDs, and the range of its subtree is stored in the node. In the example, the subtree of 1 is (obviously) within the range of 1..5. However, you'll notice the subtree of 2 is NOT within the range of 3..5 because node 4 violates that rule. As a result, we need a mutable ID in order to maintain the subset.

table: nodes
id	mutable_id	min_mutable_id	max_mutable_id
1	1	1	5
2	2	3	4
3	3	3	3
4	5	5	5
5	4	4	4

Note how we had to swap the IDs of 4 and 5, so that node 2 could have a valid nested subset range of 3..4. This can easily happen on insertions as well and force us to recompute large parts of the table if shifting is required. However, hierarchical reads are fairly inexpensive, as they just become numerical range queries.

Approach #4: Expanded tree
The idea here is that you store the normal adjacency list, but maintain another table of the tree already recursively expanded-out:

table: nodes
id	parent_id
1	null
2	1
3	2
4	1
5	2

table: nodes_expanded
id	expanded_parent_id
1	1
2	1
2	2
3	1
3	2
3	3
4	1
4	4
5	1
5	2
5	5

Essentially, the expanded table acts as a hierarchy cache. For example, to get all nodes under the "2" subtree, just find all nodes with (expanded_parent_id == 2), which will return matches on 2, 3, and 5 as expected. The main benefit of this approach is that all your SQL queries are based on exact match, whereas the last two approaches use range queries. Likewise, while an insertion will require you to futz with the "nodes_expanded" table, the data in the "nodes" table stays intact. With the nested subsets approach, you may find your main "nodes" table locked on reads while all the IDs get shuffled around.

So, to summarize:

	Pros	Cons
Adjacency list	Easy to implement Minimum storage	Slow calculation of subtrees (can mitigate with in-memory caching)
Path substrings	Easy to implement Handles hierarchical queries	Relies on SQL index on a string column Inefficient storage (only using 0-9 and "_" in the char range)
Nested subsets	Handles hierarchical queries	Insertions can be expensive Insertions can result in lock contention
Expanded tree	Handles hierarchical queries Hierarchy is pre-cached as a simple "equality" join	Requires maintaining separate "nodes_expanded" table Insertions can be expensive, but not against the main "nodes" table

Later, I hope to implement and benchmark each approach against each other. Any other algorithms worth investigating?

null or empty

Algorithms for storing and querying hierchical trees

0 comments:

Post a Comment

About Me

Labels

Blog Archive

Subscribe via email