The Challenge of Open-source Tools
In this third post we look at how the application’s graph data fit into another property graph database: Titan. Getting the data in wasn’t too easy, and getting it out proved a challenge as well.
(Postscript: since this evaluation was finished, the developers of Titan were hired by DataStax, the Cassandra people. We’d expect ease of use to advance dramatically.)
After the Neo4j evaluation, we turned our guns on Titan. Titan was of particular interest because it boasts the ability to scale to hundreds of billions of nodes and relationships (or, using the Titan nomenclature, vertices and edges). It has the advantage of being open-source (Apache License 2.0) and supporting a variety of back-end databases (Cassandra, HBase, BerkeleyDB, or Persistit). Finally, Titan makes extensive use of the TinkerPop stack, a family of open sourced, graph-oriented projects.
The real power for Titan is the ability to have different storage back-ends, each of which is optimized for a different use case. If you need an ACID local data storage, then BerkeleyDB is a great choice. If you need extreme scalability and are ok with an eventually consistent graph, then Cassandra’s your choice.
Gremlin, the graph traversal language that’s part of the TinkerPop stack, is also a pretty nifty tool. Gremlin is essentially the Groovy dynamic language with additional graph-specific features. The ability to incorporate Closures in the Gremlin statements can be extremely powerful.
But it is an open-source project with limited resources, and this becomes noticeable in areas like documentation. It was very difficult to get a performant data load completed. Loading data isn’t hard, but as the graph gets bigger, loading tens of thousands of vertices and edges creates costly transactions. I eventually pieced together the requirements for a batch load by scouring samples and the Google Groups discussions before learning that all vertex property types and all possible edge labels must be defined beforehand. In the normal transactional use, vertex property types and edge labels can just be declared to exist and Titan will create them as necessary.
Another challenge in working with Titan is that Gremlin is a better traversal language than a query language. If you know where you are starting in the graph and need to quickly move around, traversing relationships here and there, perhaps even using a Closure to transform or iterate over the vertices, then Gremlin is great. If you just want to declare a pattern to be found in the graph, like is so easy with Cypher, then Gremlin is not the best tool for the job.
These were all solved or solvable problems, but the bottom line was that the customer’s data simply wasn’t big. It was a few gigabytes. Titan is going to really shine — and be worth the extra setup trouble — if you’ve got terabytes. We had one more graph database solution to evaluate: AllegroGraph, a commercial triple-store.