Relational databases are universally conceived of as an advance over their predecessors network and hierarchical models. Superior in every querying respect, they turned out to be surprisingly incomplete when modeling transitive dependencies. Almost every couple of months a question about how to model a tree in the database pops up newsgroup at the comp.database.theory. In this article I'll investigate two out of four well known approaches to accomplishing this and show a connection between them. We'll discover a new method that could be considered as a " Mix-in "Between MaterialIzed Path and Nested Sets.
Adjacency List
Tree Structure Is A Special Case Of Directed Acyclic Graph (DAG). One Way To Represent Dag Structure IS:
Create Table EMP (Ename Varchar2 (100), MGRNAME VARCHAR2 (100));
Each record of the emp table identified by ename is referring to its parent mgrname. For example, if JONES reports to KING, then the emp table contains
A typical hierarchical query would ask if SCOTT indirectly reports to KING. Since we do not know the number of levels between the two, we can not tell how many times to selfjoin emp, so that the task can not be solved in traditional SQL. If Transitive Closure Tcemp of the Emp Table IS KNOWN, THEN THE Query Is Trivial: Select 'True' from TCEMPWHERE ENAME = 'Scott' And mgrName = 'KING'
The ease of querying comes at the expense of transitive closure maintence.
Alternatively, Hierarchical Queries Can Be Answered with SQL EXTENSIONS: Either SQL3 / DB2 Recursive Query
with tcemp as (select ename, mgrname from tcemp union select tcemp.ename, emp.mgrname from tcemp, emp where tcemp.mgrname = emp.ename) select 'TRUE' from tcempwhere ename = 'SCOTT' and mgrname = 'KING';
That Calculates Tcemp As An Intermediate Relation, or ORACLE PROPRITARY Connect-by Syntax
Select 'True' from (select Ename ") where ename = 'scott') where ename = 'king'
IN which the inner query "chases the point" from the scott node to the root of the the Tree, and the outer query checks WHETHER THE KING Node is on the path.
Adjacency List is Arguably The Most Intuitive Tree Model. Our Main Focus, HoWever, Would Be The Following Two Methods.
Materialized path
In this approach each record stores the whole path to the root. In our previous example, lets assume that KING is a root node. Then, the record with ename = 'SCOTT' is connected to the root via the path SCOTT-> JONES- .> KING Modern databases allow representing a list of nodes as a single value, but since materialized path has been invented long before then, the convention stuck to plain character string of nodes concatenated with some separator; '.' most often or '/' .. in the latter case, an analogy to pathnames in UNIX file system is especially pronounced.In more compact variation of the method, we use sibling numerators instead of node's primary keys within the path string Extending our example:
ENAME PATH KING 1 Jones 1.1.1.1 ford 1.1.2 Smith 1.1.2.1 Blake 1.2 Allen 1.2.1 Ward 1.2.2 CLARK 1.3Miller 1.3.1
Path 1.1.2 indicates That Ford is The second child of the parent jones.
Let's write queries.
1. An Employee Ford and Chain of His Supervisors:
SELECT E1.ename from EMP E1, EMP E2where E2.Path Like E1.Path || '%' and E2.NAME = 'FORD'
2. An Employee Jones and All His (Indirect) Subordinates:
SELECT E1.ename from E1, EMP E2WHERE E1.Path Like E2.PATH || '%' and E2.NAME = 'Jones'
Although both queries look symmetrical, there is a fundamental difference in their respective performances. If a subtree of subordinates is small compared to the size of the whole hierarchy, then the execution where database fetches e2 record by the name primary key, and then performs a Range Scan of E1.PATH, WHICH IS Guaranteed to Be Quick.
On The Other Hand, The "Supervisors" Query Is Roughly Equivalent To
SELECT E1.ename from EMP E1, EMP E2where E2.PATH> E1.Path and E2.Path Select E1.ename from Emp E1where E2PATH> E1.Path and E2PATH Here, IT Is Clear That Indexing On Path Doesn't Work (Except for "Accidental" Cases in Which E2PATH HAPPENTAL, SO That Predicate E2PATH> E1.PATH IS SELECTIVE). The obvious solution is that we do not have to refer to the database to figure out all the supervisor paths! For example, supervisors of 1.1.2 are 1.1 and 1. A simple recursive string parsing function can extract those paths, and then the Supervisor Names Can Be Answered by SELECT E1.ename from Emp Where E1.Path in ('1.1', '1') Which Should Be Executed As a Fast Concatenated Plan. NESTED SETS Both the materialized path and Joe Celko's nested sets provide the capability to answer hierarchical queries with standard SQL syntax. In both models, the global position of the node in the hierarchy is "encoded" as opposed to an adjacency list of which each link is a Local Connection Between Immediate Neighbors Only. Similar To Materialization Path, The Nested Sets Model Suffers from Supervisors Query Performance Problem: Select P2.emp from Personnel P1, Personnel P2where P1.LFT Between P2.LFT and P2.RGTAND P1.EMP = 'Chuck' (Note: This query is borrowed from the previously cited Celko article). Here, the problem is even more explicit than in the case of a materialized path:. We need to find all the intervals that cover a given point This problem is known to be difficult. Although there are specialized indexing schemes like R-Tree, none of them is as universally accepted as B-Tree. For example, if the supervisor's path contains just 10 nodes and the size of the whole tree is 1000000, none of indexing techniques could provide 1000000/10 = 100000 times performance increase. (Such a performance improvement factor is typically associated with index range scan in a similar, very selective, data volume condition.) Unlike a materialized path, the trick by which we computed all the Nodes without querying the database doesn't work for nested set. Another - more fundamental -.. Disadvantage of nested sets is that nested sets coding is volatile If we insert a node into the middle of the hierarchy, all the intervals with the boundaries above the insertion point have to be recomputed In other words, when we INSERT A Record Into The Database, Roughly Half of The Other Records Need To BE Updated. this is why the nested sets model for static hierarchies. Nested sets are intervals of integers. In an attempt to make the nested sets model more tolerant to insertions, Celko suggested we give up the property that each node always has (rgt-lft 1) / 2 children. In my opinion, this is a half-step towards a solution: any gap in a nested set model with large gaps and spreads in the numbering still could be covered with intervals leaving no space for adding more children, if those intervals are allowed to have boundaries at discrete points (ie , integers) ONLY. ONE NEEDS TO USE A DENSE DOMAIN LIKE Rational, or real number instead.nested interval Nested Intervalness Generalize Nested Sets. A Node [CLFT, CRGT] IS An (Indirect) Descendant Of [Plft, Prgt] IF: PLFT <= CLFT and CRGT> = prgt The domain for interval boundaries is not limited by integers anymore:.. We admit rational or even real numbers, if necessary Now, with a reasonable policy, adding a child node is never a problem One example of such a policy would be finding an unoccupied Segment [LFT1, RGT1] within a pent interval [PLFT, PRGT] and INSERTING A CHILD NODE [(2 * LFT1 RGT1) / 3, (RGT1 2 * LFT) / 3]: After Insertion, We Still Have Two More Unoccupied Segments [LFT1, (2 * LFT1 RGT1) / 3] AND [(RGT1 2 * LFT) / 3, RGT1] To add more children to the parent node. We are going to amend this naive policy in the folloading sections. Partial ORDER Let's Look At Two Dimensional Picture Of Nested Interval IS A Horizontal Axis X, AND LFT IS A VERTICAL One - Y - Y - Y - Y, The Nested Interval Tree Looks Like this: Each node [lft, rgt] has its descendants bounded within the two-dimensional cone y> = lft & x <= rgt. Since the right interval boundary is always less than the left one, none of the nodes are allowed above the diagonal y = X.THE OTHER WAY to LOOK AT THIS PICTHER IS To NOTILTATACE PARENT NODE WHENEVER A SET OF All Points Defined by The Child CONE Y> = CLFT & X <= CRGT IS A SUBSET OF THE PARENT CONE Y> = PLFT & X <= prgt. a Subset Relationship Between The Cons on The Plane Is A Partial Order. Now That We know The Two Constraints to Which Tree Nodes Conform, I'll Describe Exactly How To Place Them at the xy plane. The mapping . We'll describe further details of the mapping by induction. For each node of the tree, let's first define two important points at the xy plane. The depth-first convergence point is an intersection between the diagonal and the vertical line through the node. For Example, The Depth-First Convert Convergence Point for Now, for each parent node, we define the position of the first child as a midpoint halfway between the parent point and depth-first convergence point. Then, each sibling is defined as a midpoint halfway between the previous sibling point and breadth-first convergence Point: for example, Node 2.1 is posiented AT x = 1/2, y = 3/8. Now That The mapping is defined, IT IS Clear Which Densse Domain We are for: It's. Interestingly, the descendant subtree for the parent node "1.2" is a scaled down replica of the subtree at node "1.1." Similarly, a subtree at node 1.1 is a scaled down replica of the tree at node "1." A structure with Self-Similarities IS Called A Fractal. Normalization Next, we notice that x and y are not completely independent. We can tell what are both x and y if we know their sum. Given the numerator and denominator of the rational number representing the sum of the node coordinates, we can calculate x and y Coordinates Back As: function x_numer (numer integer, denom integer) RETURN integer IS ret_num integer; ret_den integer; BEGIN ret_num: = numer 1; ret_den: = denom * 2; while floor (ret_num / 2) = ret_num / 2 loop ret_num: = ret_num / 2; RET_DEN: = RET_DEN / 2; End loop; return RET_NUM; END; Function X_Denom (Numer Integer, Denom Integer) ... Return Ret_Den; End; in which function x_denom body differs from x_numer in the return variable only. Informally, numer 1 increment would move the ret_num / ret_den point vertically up to the diagonal, and then x coordinate is half of the value, so we just multiplied the denominator by Two. next, we reduce Both numerator and death by the commitinate is defined as a companies to the sum: Function Y_Numer (Numer Integer) Return INTEGER IS NUM INTEGER; DEN INTEGER; Begin Num: = X_NUMER (Numer, Denom); den: = x_denom (numer, denom); While den Function Y_Denom (Numer Integer, Denom Integer) ... Return Den; End; Now, Test (where 39/32 is the node 1.3.1): SELECT X_NUMER (39, 32) || '/' || x_ndenom (39, 32), Y_NUMER (39, 32) || '/' || y_ndenom (39, 32) from DUAL 5/8 19 / 32Select 5/8 19/32, 39/32 from Dual 1.21875 1.21875 I do not use a floating point to represent rational numbers, and wrote all the functions with integer ariphmetics instead. To put it bluntly, the floating point number concept in general, and the IEEE standard in particular, is useful for rendering 3D-game Graphics Only. in The Last Test, However, We Used A Floating Point Just To Verify That 5/8 and 19/32, Returned by The Previous Query, Do Indeed Add to 39/32. . We'll store two integer numbers - numerator and denominator of the sum of the coordinates x and y - as an encoded node path Incidentally, Celko's nested sets use two integers as well Unlike nested sets, our mapping is stable: each node has. a predefined placement at the xy plane, so that the queries involving node position in the hierarchy could be answered without reference to the database. in this respect, our hierarchy model is essentially a materialized path encoded as a rational number.Finding Parent Encoding and Sibling Number Given a child node with numer / denom encoding, We Find the node's parent like this: Function Parent_Nume (Numer Integer, Denom Integer) Return INTEGER IS RET_NUM INTEGER; RET_DEN INTEGER; Begin if Numer = 3 THEN RETURN NULL; END IF; RET_NUM: = (NuMer-1) / 2; RET_DEN: = DENOM / 2; While Floor ((RET_NUM-1) / 4) = (RET_NUM-1) / 4 LOOP RET_NUM: = (RET_NUM 1) / 2; RET_DEN: = RET_DEN / 2; End Loop; Return Ret_NUM; Function Parent_Denom (Numer Integer, Denom Integer) ... RETURN RET_DEN; END; The idea behind the algorithm is the following:. If the node is on the very top level - and all these nodes have a numerator equal to 3 - then the node has no parent Otherwise, we must move vertically down the xy plane at a distance equal to the distance from the depth-first convergence point. If the node happens to be the first child, then that is the answer. Otherwise, we must move horizontally at a distance equal to the distance from the breadth-first convergence point until we Meet the parent node. Here is the test of the method (in which 27/32 is the node 2.1.2, while 7/8 is 2.1): SELECT PARENT_NUMER (27, 32) || '/' || Parent_Denom (27, 32) from dual7 / 8 In The Previous Method, Counting The Steps When Navigating Horizontally Would Give The Sibling Number: function sibling_number (numer integer, denom integer) RETURN integer IS ret_num integer; ret_den integer; ret integer; BEGIN if numer = 3 then return NULL; end if; ret_num: = (numer-1) / 2; ret_den: = denom / 2 ; RET: = 1; While Floor ((RET_NUM-1) / 4) = (RET_NUM-1) / 4 LOOP if Ret_Num = 1 and RET_DEN = 1 THEN RETURN RET; END IF; RET_NUM: = (RET_NUM 1) / 2; RET_DEN: = RET_DEN / 2; RET: = RET 1; End loop; Return Ret; For a node at the Very First Level a Special Stop Cond_Den = 1 IS NEEDED. THE TEST: SELECT SIBLING_NUMBER (7, 8) from Dual 1 Calculating Materialized Path and distance Between Nodes Strictly speaking, we do not have to use a materialized path, since our encoding is an alternative. On the other hand, a materialized path provides a much more intuitive visualization of the node position in the hierarchy, so that we can use the materialized Path for Input and Output of the data if we provide the mapping to our model. Implementation is a simple application of the method. We print the sibling number Function Path (Numer Integer, Denom Integer) Return Varchar2 Is Begin if Numer is Null Ten Return '; end if; return path (parent_numer, denom), Parent_Denom (Numer, Denom)) ||'. '|| sibling_number (Numer, Denom); SELECT PATH (15, 16) from Dual.2.1.1 NOW WE a Ready to Write The Main Query: Given The 2 Nodes, P and C, WHEN P IS THE PARENT OF C? A More General Query Would Return The Number of Levels Between P and Cix Is Reachable from P, And Some Exception Indicator; OtherWise: function distance (num1 integer, den1 integer, num2 integer, den2 integer) RETURN integer IS BEGIN if num1 is NULL then return -999999; end if; if num1 = num2 and den1 = den2 then return 0; end if; RETURN 1 distance (Parent_Numer (NUM1, DEN1), Parent_Denom (NUM1, DEN1), NUM2, DEN2); END; Select Distance (27, 32, 3, 4) from Dual2 Negative numbers are interpreted as exceptions. If the num1 / den1 node is not reachable from num2 / den2, then the navigation converges to the root, and level (num1 / den1) -999999 would be returned (readers are advised to find a less clumsy Solution). The alternative way to answer whether two nodes are connected is by simply calculating the x and y coordinates, and checking if the parent interval encloses the child. Although none of the methods refer to disk, checking whether the partial order exists between the points seems much less expensive! On the other hand, it is just a computer architecture artifact that comparing two integers is an atomic operation. More thorough implementation of the method would involve a domain of integers with a unlimited range (those kinds of numbers are supported by computer algebra Systems, SO That a Comparison Operation Would Be ITATIVE AS Well. Our system would not be complete without a function inverse to the path, which returns a node's numer / denom value once the path is provided Let's introduce two auxiliary functions, first:. Function child_numer (num integer, den integer, child integer) RETURN Integer IS Begin Return Num * Power (2, Child) 3-Power (2, Child); End; Function Child_Denom (Num Integer, Den Integer, Child Integer) Return Integer is Begin Return Den * Power (2, child); End; SELECT CHILD_NUMER (3, 2, 3) || '/' || CHILD_DENOM (3, 2, 3) from Dual19 / 16for Example, The Third CHILD OF THE NODE 1 (Encode AS 3/2) Is The Node 1.3 (Encode AS 19/16). The path encoding function is: Function path_numer (path varchar2) Return INTEGER IS NUM INTEGER; DEN INTEGER; Postfix Varchar2 (1000); Sibling Varchar2 (100); Begin Num: = 1; DEN: = 1; Postfix: = '.' || PATH || ' WHILE Length: = Substr (Postfix, 2, INSTR (Postfix, '.', 2) -2); Postfix: = Substr (postfix, INSTR (postfix, '.', 2 ), Length (postfix) -instr (postfix, '.', 2) 1); Num: = child_numer (NUM, DEN, TO_NUMBER (SIBLING)); DEN: = Child_Denom (Num, Den, TO_NUMBER (SIBLING)) End loop; return; function path_denom (path varchar2) ... return den; end; select path_numer ('2.1.3') || '/' || Path_Denom ('2.1.3') from dual51 / 64 The Final Test Now That The Infrastructure IS Completed, We Can Test It. Let's Create The Hierarchy create table emps (name varchar2 (30), numer integer, denom integer) alter table empsADD CONSTRAINT uk_name UNIQUE (name) USING INDEX (CREATE UNIQUE INDEX name_idx on emps (name)) ADD CONSTRAINT UK_node UNIQUE (numer, denom) USING INDEX ( CREATE UNIQUE INDEX NODE_IDX ON EMPS (NUMER, DENOM)) AND Fill IT with Some Data: INSERT INTO Emps Values ('1'), Path_Denom ('1')); INSERT INTO Emps Values ('Jones', Path_Numer ('1.1'), Path_Denom ('1.1')); Insert Into Emps Values ('Scott'), Path_Denom ('1.1.1')); INSERT INTO Emps Values ('Adams', Path_Numer (' 1.1.1 '), Path_Denom (' 1.1.1.1 ')); Insert INTO Emps Values (' Ford ', Path_Numer (' 1.1.2 '), Path_Denom (' 1.1.2 ')); Insert Into Emps Values (' Smith ', Path_Numer (' 1.1.2.1 '), PATH_DENM ('1.1.2.1')); Insert Into Emps Values ('Blake', Path_Nume ('1.2'), Path_Denom ('1.2')); Insert Into Emps Values ('Allen', Path_Numer ('1.2.1' ), Path_Denom ('1.2.1')); INSERT INTO Emps Values ('Ward', Path_Nume ('1.2.2'), Path_Denom ('1.2.2')); Insert Into Emps Values ('Martin', Path_Numer ('1.2.3'), Path_Denom ('1.2.3')); Insert Into Emps Values ('Turner', Path_Nume ('1.2.4'), Path_Denom ('1.2.4')); Insert Into Emps Values ('CLARK', PATH_NUMER ('1.3'), Path_Denom ('1.3')); Insert Into Emps Values ('Miller', Path_Numer ('1 .3.1 '), Path_Denom (' 1.3.1 ')); All the functions written in the previous sections are conveniently combined in a single view: create or replace view hierarchy as select name, numer, denom, y_numer (numer, denom) numer_left, y_denom (numer, denom) denom_left, x_numer (numer, denom Numer_Right, X_Denom (Numer, Denom) Denom_right, Path (Numer, Denom) Path, Distance (Numer, Denom, 3, 2) Depth from Emps And, Finally, We can create the hierarchical reports. DEPTH-first Enumeration, Ordering by Left Interval Boundary Select LPAD ('', 3 * Depth) || Name from hierarchy order by numer_left / denom_left lpad ('', 3 * depth) || Name ------------------ ---------------------------- King Clark Miller Blake Turner Martin Ward Allen Jones Ford Smith Scott Adams DEPTH-First Enumeration, Ordering by Right Interval Boundary Select LPAD (', 3 * Depth) || Name from hierarchy order by numer_right / denom_right desc lpad (' ', 3 * depth) || name ----------------- ------------------------------------ King Jones Scott Adams Ford Smith Blake Allen Ward Martin Turner CLARK MILLER Depth-first enumeration, Ordering By Path (Output Identical To # 2) Select LPAD ('', 3 * Depth) || Name from hierarchy Order by Path LPAD ('', 3 * depth) || Name -------------------------------------- --------------- King Jones Scott Adams Ford Smith Blake Allen Ward Martin Turner Clark Miller All The Descendants of Jones, Excluding Himself: Select H1.Name from Hierarchy H1, Hierarchy H2 WHERE H2.NAME = 'Jones' And distance (h1.numer, h1.denom, h2.numer, h2.denom)> 0; Name --------- --------------------- Scottadamsfordsmithall The Ancestors of Ford, Excluding Himself: Select H2.Name from Hierarchy H1, Hierarchy H2 WHERE H1.NAME = 'FORD' AND DISTANCE (H1.NUMER, H1. Denom, H2.Numer, H2.Denom)> 0; Name --------- --------------------- Kingjones - Vadim Tropashko works for Real World Performance group at Oracle Corp. In prior life he was application programmer and translated "The C Programming Language" by B.Stroustrup, 2nd edition into Russian. His current interests include SQL Optimization, Constraint Databases, and Computer Algebra Systems.