Topological Search

This week I spent some time working on a CS problem that required me to perform topological search of a graph. The problem didn’t seem so complicated at first glance:

Design a program which takes as input two parameters: an array of tasks and an array containing pairs of dependency relationships. Each dependency relationship consists of a pair of tasks wherein the first task must be completed before the second task is completed. Your program must must return an array consisting of a valid order in which each of the tasks may be completed or return an error if no valid order is possible.

Solving this problem, however, isn’t quite so straightforward as many of the string and array manipulation questions I typically get confronted with on a day to day basis. The question really requires diving into the topic of topological sorting of graphs. This question really excited me so I decided to sit down, put in the time, and work my way through a solution.

What captured my attention about this problem is that its practical application was immediately clear: this is dependency resolution. In other words, the problem solved by this exercise is the same fundamental problem that many of the tools I rely on every single day address such as Bundler to install Ruby gem dependencies, Apt to manage package dependencies on my Debian GNU/Linux laptop, and Systemd to manage process initialization on my servers.

Although the inputs in this exercise are arrays, they represent the edges of a directed graph, i.e. a graph in which nodes have a one-way or directional relationship to adjacent nodes. In this case the directionality is the dependency -> dependant relationship between some of the task nodes.

According to Wikipedia, topological sorting of a directed graph can be described as

[…] a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering. For instance, the vertices of the graph may represent tasks to be performed, and the edges may represent constraints that one task must be performed before another; in this application, a topological ordering is just a valid sequence for the tasks. A topological ordering is possible if and only if the graph has no directed cycles, that is, if it is a directed acyclic graph (DAG). Any DAG has at least one topological ordering, and algorithms are known for constructing a topological ordering of any DAG in linear time.

Simple right? So how do we actually implement topological sorting and solve the problem at hand? Lets say that we are given a list of projects and their respective dependency relationships in the following form:

tasks = ["a", "b", "c", "d", "e", "f", "g"]

dependencies = [["b", "e"], ["a", "e"], ["f", "b"],
                ["f", "c"], ["c", "a"], ["f", "a"],
                ["b", "a"], ["d", "g"]]

As a first step, let’s see if we can interpret this data to more naturally visualize it as a graph. Each element in the dependencies array represents a directional connection from, for example, task b to task e. While we want to capture this representation programattically in our code let’s start out by using the DOT graph description language to generate a graph diagram so we can better comprehend the data. The graph in this sample input can be described (with some tweaks for aesthetics and clarity) by the following DOT file:

digraph G {
	rankdir=tb
	d -> g;
	f -> b;
	f -> c;
	f -> a;
	b -> a;
	c -> a;
	a -> e;
	b -> e;
	{ rank=same "d" "f" }
	{ rank=same "e" "g" }
	{ rank=same "c" "b" }
}

Once compiled, the above file produces this chart:

chart of sample input

From just a quick look at this image we can ascertain a few basic facts:

This graph is disconnected and therefore there is not a path between each node and all other nodes.
This graph contains no cycles and therefore should have at least one valid topological sorting order.
This graph contains nodes with multiple distinct dependencies and therefore likely has multiple valid topological sorting orders because the order of dependencies is not strictly prescribed.

Here is the first version of the code I came up with to solve this problem:

# Main method, takes specified input arrays
# returns array of tasks in valid order
# unless error is raised
def build_order(tasks, dependencies)
  # Converts graph data to hash representation
  deps = create_deps(tasks, dependencies)
  completion_order = []
  # Loops through hash removing tasks which do not have
  # dependencies and adding them to `completion_order`
  while !deps.empty? do
    processing = add_no_dependents(deps)
    completion_order << processing
    # decrement the dependencies key for all dependents
    processing.each do |t|
      if !deps[t][:dependents].empty?
        deps[t][:dependents].each do |d|
          deps[d][:dependencies] -= 1
        end
      end
      # delete the task
      deps.delete(t)
    end
  end
  # return the completion order array
  return completion_order.flatten
end

# Parse graph and store in Hash datastructure
def create_deps(tasks, dependencies)
  deps = Hash.new
  # set the default deps hash value for each task
  tasks.each do |t|
    deps[t] = {dependencies: 0, dependents: []}
  end

  dependencies.each do |d|
      # add each dependant to the dependency's :dependents array
      deps[d[0]][:dependents] << d[1]
      # increment the dependant task's :dependencies counter
      deps[d[1]][:dependencies] += 1
  end

  return deps
end

# Iterate through the `deps` hash to find tasks with
# no dependencies
def add_no_dependents(deps)
  processing = []
  # Find all tasks which have no dependencies
  deps.keys.each do |t|
    if deps[t][:dependencies] == 0
      processing << t
    end
  end

  return processing if !processing.empty?
  # if processing is empty than we have reached an irreconcilable cycle
  raise "An error has occurred. No valid order detected."
end

Here is a quick overview of how this code functions:

This solution starts out by parsing the input graph and transforming it to a hash where each task is a key and each value is a nested hash with a :dependents key containing an array of dependant tasks and a :dependencies key containing a counter of that task’s dependencies.
The solution then performs a loop through the graph hash looking for root tasks which have no dependencies.
The root tasks are then added to the completion_order array and removed from the graph hash.
The removed task’s dependents also have their :dependencies counter decremented.
On the next loop around any new root tasks are identified and the cycle continues until either the each task in the graph has been added to the completion_order array or no new root task can be found and an error is raised.

Given the sample input, the program returns:

=> ["d", "f", "b", "c", "g", "a", "e"]

Checking our diagram this appears to be one of the many possible valid orders of task completion for the specified graph. What if we tweak the sample input just a bit by adding an element ["b", "f"] to the dependencies array input? This graph now looks like this:

chart of invalid sample input

When we run the program on this modified input we get the following output:

An error has occurred. No valid order detected.

As you can see, this small change creates a big problem! The newly added b -> f dependency creates a bidirectional connection. In other words, without f there can be no b and without b there can be no f. Accordingly there is no valid order in which the tasks can be completed and the program correctly raises an error.

This code a solid solution and although I imagine there are some improvements I could make in my approach, I must say that I’m quite pleased with it. I did, however, discover something interesting: Topological sort is actually an included module in the Ruby standard library! Using the TSort module and a custom class, we can accomplish the same outcome with far fewer lines of code:

require 'tsort'

def build_order(tasks, dependencies)
  deps = create_hash(projects, dependencies)
  graph = G.new(deps)
  graph.tsort
end

def create_hash(tasks, dependencies)
  deps = Hash.new

  tasks.each do |t|
    deps[t] = []
  end

  dependencies.each do |d|
    deps[d[1]] << d[0]
  end

  return deps
end

class G
  include TSort
  def initialize(g)
    @g = g
  end

  def tsort_each_child(n, &b)
    @g[n].each(&b)
  end

  def tsort_each_node(&b)
    @g.each_key(&b)
  end
end

This is one of the first graph problems I’ve had to deal with and I’m really excited to further explore how I can use graphs to solve challenging problems.