A recent client request had Expero working with a relative newcomer in the graph provider space, TigerGraph. We tend to fancy ourselves as graph experts and have worked with a variety of graph vendors, so we always welcome a chance to work with new technologies.
What made this experience different? All graph databases emphasize relationships, which help find associations that are not always immediately obvious. For example, answering the questions “How are two objects related?” or “How closely are two objects related?”. The answer to these questions provides new insights and patterns to explore based only on the results. What if there was a mechanism that could provide insights to how the results were derived that may provoke more questions to ask that are obscured by simply looking at the results?
What is TigerGraph?
TigerGraph is a native parallel graph database. Tiger defines this as storing data as a graph as opposed to an underlying RDBMS or NOSQL database engine. TigerGraph’s storage and processing engine are implemented in C++, and support distributed querying across the cluster to tackle tasks faster and parallelize processing.
There is no universally accepted query language in the land of graph, so TigerGraph has implemented their own:GSQL. GSQL is a schema-enforced SQL-like query language offering traditional control structures and parameterized queries.
Accumulators a powerful feature of GSQL
What sets GSQL apart from other graph query languages, such as Cypher or Gremlin? One key differentiator is the concept of Accumulators. Accumulators can be local or global in scope. Accumulators give a way to compute or check values as vertices and edges are traversed during a selection, or post traversal on the result set. This is a very powerful feature that allows for the straightforward implementation of various algorithms or data analysis.
Accumulators come in many different flavors. For example some are logical in nature like OrAccum and AndAccum, which are boolean and store the logical result of all conditionals along the way. Some provide ways to capture numerical results along the traversal, such as MinAccum, MaxAccum, and AvgAccum, which record the minimum, maximum and average value respectively. Finally accumulators can also be used simply for storage of values or objects like SetAccum, ListAccum, and MapAccum, which offer basic container functionality, such as contains, remove and clear.
The Path Best Traveled
Start With a Simple Schema
Dijkstra’s algorithm is used for graphs to find the shortest path, if one exists, between a given vertex and all other vertices in the graph. Let’s take a look at an implementation of Dijkstra in GSQL, using some of these nifty accumulators. Before we begin we need a schema to operate on and for the purposes of this exercise a truly simple schema is all that is needed.
There is one vertex type, VSimple, which has nothing except for a string id. There are two edge types, each with an attribute of Weight. The edge type ESimpleUndirected is an example of an undirected edge, while ESimpleUndirected is an example of a directed edge. Notice the orange ESimpleUndirected has an arrow on one end on the reference to VSimple.
Create a Couple of Vertices and Edges
Since there is now a schema we can begin to define a graph to traverse. The following graph will be weighted and directed and edges will have the following weight:
Review of Dijkstra’s Algorithm
First choose a vertex to be the be the root, or initial vertex. In our case this will be vertex S, and the distance from vertex S to S will simply be 0.
- Mark all vertices unvisited, except for the initial vertex S.
- Assign a tentative distance value to all vertices marked unvisited. The tentative value is generally set to infinity, but for our case we will assume all distances are positive and -1 will act as infinity. The vertex S will be set as the current vertex.
- For the current vertex consider all unvisited neighbor vertices and calculate the tentative distance through the current vertex to each neighbor. The newly calculated distance is compared to the current distance value for that vertex. If the newly calculated distance is less than the current distance, then the new path through the current vertex is the best path. Therefore update the distance value, otherwise if the newly calculated distance is greater than what has been recorded, then no changes need to be made.
- After all neighbors of the current vertex have been checked, mark the current vertex as visited. A visited vertex will never need to be checked again.
- If there are still unvisited vertices and there is a distance value that is not set to infinity then select the smallest distance value from the unvisited vertex set and go back to step 3. Otherwise if the unvisited set is empty, or if the only vertices in the unvisited set are all infinity, then the algorithm has finished.
Overview of the GSQL Query
Now let’s look at an implementation of Dijkstra’s algorithm that will be run on this particular graph. Don’t worry we will step through the code in piecewise fashion reviewing each section in detail.
<p> CODE: https://gist.github.com/experoblogs/c9f036c079b64597ac2d3bb75c372e2d.js</p>
GSQL in Detail
First is the outer query body, which simply defines a query with the name DijkstraExample that takes a parameter of type String.
<p> CODE: https://gist.github.com/experoblogs/186604ae22c94b6e32307739cc56e51e.js</p>
One feature to note is that queries can also be defined to return values or objects. It is also possible to reference queries from inside a query allowing users to build easily build more complex queries, while keeping code segregated into logical constructs.
Let’s take a look at some of the accumulators that this query will be using.
<p> CODE: https://gist.github.com/experoblogs/08722fdb5aea1716137bfa496a3e4300.js</p>
Here we have declared a couple different types of accumulators. First let’s look at the two accumulators, @@minDistance and @@maxDistance, which have been declared as a MinAccum and MaxAccum respectively. Note how both also have a type of INT declared and both are preceded by @@. The use of INT informs the typing of the two accumulators, while the @ and @@ dictates a local or global scope. A single @ acts as a local, or vertex attached accumulator. That is to say that each vertex has an instance of that accumulator, while @@ indicates a globally shared accumulator not attached to any one specific vertex.
Next there is an OrAccum, @visited that has been declared as a vertex attached accumulator. Notice how no data type has been specified. This is due to the fact that both AndAccum and OrAccum are logical accumulators and will always results in a true or false value. It is important to realize in the example that we have also specified a default value of FALSE for the accumulator.
Then there are two SumAccum, @distance and @path. The SumAccum will allow for us to keep a running total for each vertex. Accumulator @distance is of type INT is set to our default value of -1. However @path has been given no default value and is of type STRING. For @distance the sum accumulation works as expected and for @path acts as string concatenation.
Finally the last type of accumulator is a container accumulator, SetAccum, which will keep and unordered collection of objects, where no duplicates are allowed. Both @@visitedSet and @@unvisitedSet will hold VERTEX objects.
Initializing and Simple Selects
<p> CODE: https://gist.github.com/experoblogs/f3a4762e70badb0d0ab66de3a68e0f6f.js</p>
The first line to look at is that there is a vertex set, vertexList, which is being instantiated with all vertices of type VSimple.
<p> CODE: https://gist.github.com/experoblogs/753b81c7986f6d63eafc861657b355d7.js</p>
Next we have the first SELECT statement using the vertexList defined above.
<p> CODE: https://gist.github.com/experoblogs/b73c5bfd26846c6698c5e6c9103c37f0.js</p>
Looking closer at the FROM statement we see a reference to the set vertexList followed by :p, where p is vertex alias for the vertices from vertexList. Next the -()-> is used to designate an edge connection. The -> is indicating a direction, while the value in between the parentheses indicates the allowed edge types and edge alias. In our case we are limiting edges to only ESimpleDirected and e is the alias that is used for edges in this context. The last little bit is defining the destination vertex for a given edge. Again in this simplified example we are limiting the vertice destinations to only be of type VSimple and these will have an alias of t.
Lastly before moving into accumulators there is a defined WHERE clause in the above query
<p> CODE: https://gist.github.com/experoblogs/c372bff22006e28327c0371e35fefd21.js</p>
The variable origin is a parameter and is the id of the vertex that will act as the root vertex. Since this value is a Primary Key, no two vertices will have the same id value, otherwise this would be a violation constraint.
The Accum Clause
Now for the first foray into using accumulators.
<p> CODE: https://gist.github.com/experoblogs/325e964ef8c6bc3f2f1cb70657d79100.js</p>
Notice that the keyword ACCUM is used to denote an accum block. There are three operations, which are currently being done in the ACCUM block, t.@distance = e.Weight, ‘t.@path = p.id + -> and @@maxDistance += e.Weight. Remember that t is the alias of the vertex destination of the edge and that e is the alias of the edge itself. The statement
<p> CODE: https://gist.github.com/experoblogs/429ab268d7ab9908d516ac92217398b8.js</p>
is setting the value of the vertex attached accumulator @distance to the edge weight that connects p to t, for each t that p is connected. Similarly we are setting the @path string value to the id of the starting vertex and concatenating that value with ->. Finally we are adding the weight value to the global accumulator @@maxDistance using the operator +=. Because @@maxDistance is a MaxAccum the += will only update the value if it is greater than value currently recorded.
The Post-Accum Clause
The last bit of the first SELECT is the POST-ACCUM clause. The Post-Accum clause takes place after the result set has been established and is performed only on the result set. This allows a post check of values that were computed in the Accum phase.
<p> CODE: https://gist.github.com/experoblogs/19db44cb104a2811bfbcdee063492337.js</p>
Now this is setting the value of the origin vertex @visited to TRUE, the @distance value to 0 and notice that we are adding the vertex p to the set of visited vertices.
What are we left with?
The results of the above where stored in a temporary vertex set, currentVertex. In our case because of the defined where clause the size of this set will either be 1 or 0 depending on whether the value passed in when calling the query is legitimate VSimple id or not.
Two More Examples of Sets
Now let’s take a look at the next two statements
<p> CODE: https://gist.github.com/experoblogs/9e6bd999e2d2f0dbc8b66b41ac6e139e.js</p>
The top statement should select all vertices where the vertex attached value @visited is FALSE, which should be very vertex is not the initial vertex. These vertices are also added to the unvisited set. Finally we can demonstrate that vertex sets can be copied by declaring a new vertex set choiceSet, which contains the same contents as unvisitedSet.
Another nifty feature built into GSQL, is the traditional use of control structures. For example the support of a conditional while loop is as simple as follows
<p> CODE: https://gist.github.com/experoblogs/32a0ee1fd21aee995964f259065920e1.js</p>
The above code block also has an example of some of the built in functionality for vertex set, namely the use of size() to determine how many vertices are in the set. As with traditional programming be aware that infinite loops can occur if the terminating condition is not guaranteed. Additionally these control structures can be used in Accum and Post-Accum blocks as well as DML blocks that may exist.
Dijkstra in GSQL
Now let’s review the actual implementation of the algorithm, which is the code that exists in that WHILE block. Remember before entering the loop, steps 1 and 2 of Dijkstra have been satisfied. We have marked all vertices with a tentative distance and have marked each vertex as visited or not visited. The vertex set currentVertex should always contain at most one vertex
<p> CODE: https://gist.github.com/experoblogs/6f129a55ad695322eb7a555b408422d1.js</p>
For the current vertex all that happens is it is marked as visited and the @path is concatenated with the it’s own id to make the @path value more aesthetically pleasing. Next using the current vertex we calculate a potential value for all of the current vertex’s neighbors, which corresponds to step 3 in the Dijkstra algorithm.
<p> CODE: https://gist.github.com/experoblogs/dfa0cb1b80fc8e4f0f3274c64cf18a30.js</p>
Notice in the above code block we have an example of an IF and ELSE IF statement, which demonstrates using control structures in a query body statement and is also an example of a nested conditional. Walking through the logic for any vertices that have not been visited then perform a check on the distance value. If @distance is less than 0, which in our case means that no distance has been calculated for the target vertex n. Notice the only subtle difference between the IF and the ELSE IF block is how the @path value is being set. The results are stored in the unvisitedNeighbor vertex set. The following snippet also covers Step 4.
<p> CODE: https://gist.github.com/experoblogs/5c1e74a6c7eb2737fee9f1970e5f6636.js</p>
The following code block once again shows off some features of GSQL, mainly that two vertex sets have set operations defined, such as UNION and INTERSECT.
<p> CODE: https://gist.github.com/experoblogs/b6887caccfe163d685dbd130f332d82f.js</p>
Recall that in our version of Dijkstra we are assuming a value of -1 is acting as infinity, so utilizing the where clause, we specify that we want to consider all non-negative distances.
<p> CODE: https://gist.github.com/experoblogs/5f992c06c3b6462939190afa8d1115bc.js</p>
The above is simply establishing an upper bound of greatest distance that exist within the vertex set. A sentinel value could have been used to simplify the above, but I wanted to highlight the use of a MaxAccum, as well as being able to use conditional IF, ELSE block to manipulate global accumulators outside of a vertex select statement. After the above we are guaranteed that @@minDistnace will be non-negative. Now to perform the final step of Dijkstra:
<p> CODE: https://gist.github.com/experoblogs/abc955bf137fab98ea46b61d6cb38a32.js</p>
Notice that the above also makes use of a HAVING clause, where we specify the result set should contain only those with minimum distance. Furthermore there is also use of LIMIT, which guarantees that the currentVertex set contains at most one vertex. For good measure let’s review the terminating condition for the while loop
<p> CODE: https://gist.github.com/experoblogs/9a32a10a4f69c6e8e50154e81488d2c1.js</p>
We are going to repeat the WHILE loop until we have no more valid connected vertices to choose. This is the end of Dijkstra’s algorithm and the rest of the query is again demonstrating some concepts and functionality of GSQL.
Getting There from Here
Let’s take a look at the last bit of the query.
<p> CODE: https://gist.github.com/experoblogs/40b10d89c23c8477508744dfde6eb5f0.js</p>
The above demonstrates container accumulators as well as the FOREACH control structure, which will iterate through the objects in the @@visitedSet. Each vertex is then tested to see if it is in the @@unvisitedSet by using the contains function. If the v instance exists then it is removed from the @@unvisitedSet.
<p> CODE: https://gist.github.com/experoblogs/e7826829dd82aeb05b0f25781bcb869f.js</p>
The only new functionality to review is the PRINT command. This can be used to write output to the JSON response when the query is run. Both accumulators and vertex sets can be printed.
Running the Query
When running the query, we are first prompted to provide a value, so we can pick our initial vertex. For our purposes we will use S.
Understanding the Results
Running the query produces the following results in JSON.
Notice how both the contents of the container accumulators, @@visitedSet and @@unvisitedSet, are in the response. In this particular example everything was reachable, so @@visitedSet contains all vertices in the graph, while @@unvisitedSet contains no elements.
Next notice the vertex set, shortestPath that was output. For each vertex there are the attributes that exist on the Vertex. First are the system attributes, v_id and v_type, where v_id is the primary id of the vertex and v_type denotes the vertex type. Next there is an attribute map of the vertex. There are is an actual attribute called id and then there is an output of all the vertex attached accumulators defined for the query. We can see that the vertex has been visited, the distance to the vertex from S to E has a distance of 8 and the actual path to take is S->A->D->E.
Making a Slight Change
Let’s add one more vertex to the graph, vertex G. Let’s make this vertex by itself with no edges connected it to the existing graph.
Rerunning the exact same query with the same parameter now produces a slightly different result.
Taming the TigerGraph
Accumulators provide a powerful tool to gather additional information during a graph traversal. The implementation provided for Dijkstra was done more to highlight the features present and provided for TigerGraph as opposed to making the most efficient implementation in GSQL. Indeed many decisions for the algorithm were made to simply highlight what can be done in GSQL.