Fitting the Data: AllegroGraph & SPARQL
Josh Perryman

Fast, Commercial Software……on Linux Only

Now four posts into this series we look at AllegroGraph, a commercial triple-store database.

(Be sure to see the first post to understand the problem we’re trying to solve, the second post for the Neo4j evaluation, and the third post for the Titan evaluation.)

Turning from the open-source Titan to the commercial AllegroGraph was like stepping out of my 1998 4Runner and taking a spin my boss’s BMW. It was fast. It was sleek. It had all of the modern thingamajiggies that come with new, well-engineered solutions built by companies with the resources to do it well.

Get Working Fast

There was no setup. I simply downloaded an Ubuntu virtual machine that was pre-configured complete with working examples. The APIs were clean and coherent and came with extensively documented tutorials that worked right out of the box. The tutorials themselves were well thought out and progressively developed your understanding of the API and the server capabilities. Getting to work in Python for a bit was also pretty fun as well.

In addition to the APIs there’s a few different clients that are supported by AllegroGraph, including a WebView web interface to the server and the Gruff application. And finally, since the server supports the recursively named SPARQLquery language, which is defined by W3C, there were hosts of additional examples out on the Internet.

About Triple-stores

A triple-store isn’t specifically a graph database, though it does allow one to store graph data. In a triple-store, everything is stored as subject, predicate, object assertions. AllegroGraph is technical a quint-store, since it also stores both a graph name and an index number for every triple. But focusing only on the triples, here’s an example of how some basic assertions would look:

Several relationships are so common, like the Type or Class designations, that there are standard definitions for them that are available on the W3C web site and have acronyms like RDF, and OWL. I should also mention that it is possible to designate a data type of the object field. This ability, if thoughtfully implemented, can improve indexing and query performance.

So learning the technology was a breeze. Getting the data in wasn’t that challenging. Then it came time to do some querying. Let’s just say that SPARQL has all of the elegance of XML.

SPARQL, Like XML: Verbose

The simple task of asking for the cost properties of one type of entity in our data and then adding together the cost values required a query like this:

PREFIX rdf: <> 
PREFIX attr: <> 
SELECT ?name ?hardwareCost ?softwareCost ?laborCost ?totalCost WHERE {
  ?entity rdf:type "1" ;
          attr:name ?name .
  OPTIONAL { ?entity attr:hardwareCost ?hardwareCost }
  OPTIONAL { ?entity attr:softwareCost ?softwareCost }
  OPTIONAL { ?entity attr:laborCost ?laborCost }
  BIND(?hardwareCost + ?softwareCost + ?laborCost AS ?totalCost)
 ORDER BY ?name
Those PREFIX statements serve as aliases so that we can make the query more readable. The same query in Cypher, though, is much more succinct:
MATCH (E1 { type : 1 } ) 
RETURN, E1.hardwareCost, E1.softwareCost, E1.laborCost, 
, E1.hardwareCost + E1.softwareCost + E1.laborCost AS `totalCost`

Then I ran into the challenge that relationships really only go one way in a triple-store (though I’m sure that this can be addressed by duplicating each relationship in the opposite direction). This made following one-way paths, like the following, very easy:

one way path through 5 entities
one-way path through 5 entities

But let’s say that our same entities didn’t have one clear path through them, though they are still connected, like this:

multiple direction path through 5 entities
multiple-direction path through 5 entities

In this case, for some parts of the query we’ll follow the usual subject, predicate, object pattern, but for other parts we’re looking at using an object, predicate, subject pattern.

Some of the sophisticated queries that Cypher makes easy and Gremlin makes possible, don’t lend themselves to elegant expression in SPARQL. What was already inelegant started to get positively kludgy with more complex graph pattern queries.

It was becoming clear to me that triple-stores and SPARQL, while a capable and even compelling technology, were not the right fit for the requirements of this data set. At least they weren’t as good of a fit as Neo4j & Cypher.

So we had looked at three alternatives to the “graph in SQL tall tables” approach, found all usable and performant to one extent or another, and in general, preferable to the current implementation. But there was one thing that we hadn’t tried yet, a normalized implementation in SQL, one that tried to take advantage of the strengths of the RDBMS query engine and not fight against it. Next week we’ll look at a “wide tables” implementation.