PostgreSQL: Recursive queries

I recently needed to work with a hierarchical data structure in a PostgreSQL database. I had seen a lot of solutions previously for managing the recursion needed with this kind of structure however each seemed to have some form of limitation or another. For example, I had seen some queries which use a join for each level of depth.

SELECT level1.name AS level1, level2.name as level2, level3.name as level3, level4.name as level4  
FROM employees AS level1  
LEFT JOIN employees AS level2 ON level2.superior = level1.id  
LEFT JOIN employees AS level3 ON level3.superior = level2.id  
LEFT JOIN employees AS level4 ON level4.superior = level3.id  
WHERE level1.id = 4;  

With this approach, every time the depth of the hierarchy increases, the query has to be updated. Another approach I had seen is to use the programming language to handle the recursion.

$cto = $this->getEmployee(2);

$this->getSubordinates($cto);

...

/**
  * Get the subordinate employees.
  *
  * @param Employee $employee The employee to grab subordinates for.
  *
  * @return Employee[]
  */
  private function getSubordinates(Employee $employee)
  {
       $subordinates = $this->query('SELECT * FROM employees WHERE superior = (?) ', $employee->getId());

       foreach ($subordinates as $subordinate) {
          $subordinates = array_merge($subordinates, $this->getSubordinates($subordinate));
       }

       return $subordinates;
   }

With this approach, multiple queries need to be executed which isn't the most efficient.

I wanted to find a solution that could support an infinite level of depth within one, efficient, SQL query. After some research I discovered common table expressions (CTE) which were introduced in SQL:1999 and have been supported in PostgreSQL since version 8.4.

A common table expression (CTE) can be thought of as a temporary result set that is defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE, or CREATE VIEW statement. A CTE is similar to a derived table in that it is not stored as an object and lasts only for the duration of the query. Unlike a derived table, a CTE can be self-referencing and can be referenced multiple times in the same query.

The ability for CTE's to be self-referencing is what allows for the recursive querying of a hierarchical data structure.

To demo this approach I have created a simple database with an employees table and populated it with dummy data to represent the hierarchical staff structure of a fictitious company.

-- Create database

CREATE DATABASE foo;

-- Create table

CREATE TABLE IF NOT EXISTS employees(  
  id SERIAL,
  name varchar NOT NULL,
  job_title varchar NOT NULL,
  superior INTEGER CONSTRAINT parent_employee_id REFERENCES employees(id) ON DELETE RESTRICT,
  CONSTRAINT employee_id PRIMARY KEY (id)
);

-- Populate with dummy data

INSERT INTO employees (id, name, job_title) VALUES (1, 'Janet Martinez', 'CEO');  
INSERT INTO employees (id, name, job_title, superior) VALUES (2, 'Carol Green', 'CTO', 1);  
INSERT INTO employees (id, name, job_title, superior) VALUES (3, 'Joyce Henderson', 'Development Manager', 2);  
INSERT INTO employees (id, name, job_title, superior) VALUES (4, 'Anthony Coleman', 'Development Manager', 2);  
INSERT INTO employees (id, name, job_title, superior) VALUES (5, 'Jimmy Daniels', 'Development Manager', 2);  
INSERT INTO employees (id, name, job_title, superior) VALUES (6, 'Todd Parker', 'Lead Software Engineer', 3);  
INSERT INTO employees (id, name, job_title, superior) VALUES (7, 'Edward Long', 'Lead Software Engineer', 4);  
INSERT INTO employees (id, name, job_title, superior) VALUES (8, 'Terry Day', 'Lead Software Engineer', 5);  
INSERT INTO employees (id, name, job_title, superior) VALUES (9, 'Justin Crawford', 'Software Engineer', 6);  
INSERT INTO employees (id, name, job_title, superior) VALUES (10, 'Ryan Brown', 'Software Engineer', 6);  
INSERT INTO employees (id, name, job_title, superior) VALUES (11, 'Pamela Frazier', 'Software Engineer', 7);  
INSERT INTO employees (id, name, job_title, superior) VALUES (12, 'Shawn Lewis', 'Software Engineer', 7);  
INSERT INTO employees (id, name, job_title, superior) VALUES (13, 'James Arnold', 'Software Engineer', 8);  
INSERT INTO employees (id, name, job_title, superior) VALUES (14, 'Jeremy Thomas', 'Software Engineer', 8);  
INSERT INTO employees (id, name, job_title, superior) VALUES (15, 'Andrea Hansen', 'Developer', 9);  
INSERT INTO employees (id, name, job_title, superior) VALUES (16, 'Mark Kelly', 'Developer', 10);  
INSERT INTO employees (id, name, job_title, superior) VALUES (17, 'Donna Brown', 'Developer', 11);  
INSERT INTO employees (id, name, job_title, superior) VALUES (18, 'Shawn Fields', 'Developer', 12);  
INSERT INTO employees (id, name, job_title, superior) VALUES (19, 'Victor Lane', 'Developer', 13);  
INSERT INTO employees (id, name, job_title, superior) VALUES (20, 'Jean Kennedy', 'Developer', 14);

To select all the subordinate employees of the developer manager Anthony Coleman I created the following query using a recursive CTE.

WITH RECURSIVE employee_hierarchy AS (  
  SELECT * FROM employees
  WHERE id = 4
  UNION
  SELECT employees.* FROM employee_hierarchy, employees
  WHERE employees.superior = employee_hierarchy.id
) 
SELECT * FROM employee_hierarchy;  

Running this query returns a row for each employee that sits underneath the development manager by following the chain of command expressed through the superior column all the way down to the two Developers.

 id |      name       |       job_title        | superior
----+-----------------+------------------------+----------
  4 | Anthony Coleman | Development Manager    |        2
  7 | Edward Long     | Lead Software Engineer |        4
 11 | Pamela Frazier  | Software Engineer      |        7
 12 | Shawn Lewis     | Software Engineer      |        7
 17 | Donna Brown     | Developer              |       11
 18 | Shawn Fields    | Developer              |       12
(6 rows)

Time: 1.320 ms  

Breaking down the query, the first part is the recursive CTE declaration which is given the name employee_hierarchy.

WITH RECURSIVE employee_hierarchy AS (  
  ...
)

The next part of the expression is known as the Anchor member. The result of this query is used as the base of the recursion. In this case, it selects the development manager who we want to get all subordinate employees of. The Union operator is used to join the results with those returned from the recursive member.

SELECT * FROM employees  
WHERE id = 4  
UNION  

The next part of the expression is known as the Recursive member. It runs iteratively grabbing all the direct subordinates of the employee(s) returned from the last iteration starting with the development manager returned from the Anchor member. The iteration only stops when no more results are returned i.e it reaches the bottom of the hierarchy. The self-referencing can be seen in the WHERE part of this member which uses the expression name employee_hierarchy to get the results of the previous iteration.

SELECT employees.*  
FROM employee_hierarchy, employees  
WHERE employees.superior = employee_hierarchy.id  

The final part of the query is to use the expression which acts like a virtual table containing all the subordinates employees of the development manager. Here we simply select all of these results using the name of the expression just as we would from a real table.

SELECT * FROM employee_hierarchy;  

The query as a whole works using a top-down approach, taking a root employee and working its way down but it's trivial to adapt the query to work in reverse, working up from a selected employee to the top.

WITH RECURSIVE employee_hierarchy AS (  
  SELECT * FROM employees
  WHERE id = 20
  UNION
  SELECT employees.* FROM employee_hierarchy, employees
  WHERE employees.id = employee_hierarchy.superior
) 
SELECT * FROM employee_hierarchy;  

Using the developer Jean Kennedy as the anchor and reversing the WHERE clause in the recursive member we can find all of Jean Kennedy's superiors right the way up to the CEO.

 id |      name      |       job_title        | superior
----+----------------+------------------------+----------
 20 | Jean Kennedy   | Developer              |       14
 14 | Jeremy Thomas  | Software Engineer      |        8
  8 | Terry Day      | Lead Software Engineer |        5
  5 | Jimmy Daniels  | Development Manager    |        2
  2 | Carol Green    | CTO                    |        1
  1 | Janet Martinez | CEO                    |
(6 rows)

Time: 50.892 ms  

It's worth noting the use of UNION over UNION ALL within these expressions which in this case, ensures duplicate results are not returned, protecting the expression from circular references which would otherwise cause an infinite loop. For example, if the developers at the bottom the hierarchy were erroneously linked as superiors of their development manager Union will remove the duplication of the development manager returning an empty result instead and stopping the iteration.

To sum up, the recursive CTE provides a relatively simple and concise way to work with hierarchical data in a database. It supports an infinite depth level in a single SQL query and a relatively simple syntax.

WITH RECURSIVE [expression_name] AS (  
  [anchor_member]
  [recursive_member]
)