As part of the work we’re doing to refresh our graph database evaluation for a couple of clients (and our upcoming talk at Graph Day!) we took Titan 1.0out for a spin last week. We’ll be doing more in-depth explorations on some in-house and public datasets over the next few weeks, but here’s some preliminary impressions based on a contrast with the Titan we came to know a year ago or so. The goodness is mostly distributed between useful data modeling features (largely v0.5) or the upgrade to TinkerPop3 (v1.0).
(New with Titan 1.0) Getting to know Titan involves getting to know a lot of cute code names: TinkerPop, Giraph, Faunus, Rexter, Gremlin. Every complex offering has subcomponents, and we enjoy when they have some personality. But it’s also nice when it gets simplified, and that’s what happened with Titan 1.0. The old standalone project “Rexter” — which did a lot of different things — has been folded into TinkerPop 3 and is called “Gremlin Server“. It works on a WebSocket protocol and focuses only on executing Gremlin queries remotely. If you weren’t in a JVM language, you’d use the Gremlin Server to connect to Titan.
(New with Titan 0.5) If you want a distributed database, you’ve got tons of data, and it’s more than likely you have that much because you have a high rate of influx. How do you know when you need to recalculate a key analysis, e.g. a clustering or network flow? You could just run it every 24 hours, or you could emit certain kinds of transactions into the log and trigger them that way. For instance, anytime a new device with a network connection appeared, you could queue up a network connectivity analysis, which would run no more than every X minutes on demand.
Another great use case is made possible by
(Supported properly in Gremlin/TinkerPop in 1.0, though introduced in Titan 0.5) A vertex’s property assignment (e.g. rating = 4.8) can itself have a property associated with it. (Basically the property is like another edge, which can have properties.) Why go down this path? It’s a nice use for something that’s often set by triggers in a traditional RDBMS — attribution & auditing. Who set the rating? Well, that should be captured in the data model if it’s a significant part of the domain. Perhaps the rating itself should be a vertex. But it may make sense to tag it with the time it was updated (last-modified=2010.05.21T14.12.33), or even which identity did it (last-modified-id=fr44322).
(New with Titan 0.5) Once your graph is big and important enough, you’ll look for support for keeping your schema straight, and likely in limiting your queries to certain kinds of objects or relationships. It’s great that you have a huge graph that combines knowledge of your people, software, hardware, processes, social links and the like, but when a piece of hardware fails, it’s unlikely you want to gum up your analysis by following social links. You can restrict the query to only follow certain kinds of relationships (edges) in your graph, for instance looking only at network connections.
[code language="groovy"] g.V(failedComponentId).outE('label', 'networkConnection')... [/code]
If your graph is big enough to need Titan for, you’ll care deeply about your indexes, and labels give you a method for restricting the construction of indexes only to labels appropriate for the work in question, for instance creating a medium-type index only on network connections so I can efficiently discover, say, only ethernet connections.
[code language="groovy"] medium = mgmt.makePropertyKey('medium').dataType(String.class).make() networkCxn = mgmt.getEdgeLabel('network-connection') mgmt.buildIndex('byConnectionMedium',Edge.class).addKey(medium).indexOnly(networkCxn).buildCompositeIndex() [/code]
(Note that using labels as an index in and of themselves is a separate topic for another day. It may sound useful, but to make an analogy with relational database systems, think of labels as table names. They don’t have to be used this way, but the comparison is pretty reasonable: all the entities of one kind should fit in one ‘label’, or one ‘table’. Having a label name in hand is about as useful as having a table name in hand. It’s a start, but no one ever started a great RDBMS query plan with TABLE SCAN FULL. You’re going to want a more discriminating index if your dataset is at all large.)
Another nifty feature labels allow, if you’re using a Cassandra backend, is TTL (‘time to live’) on vertices or edges. Let’s say you’re using Titan to ingest massive amounts of log or sensor data, but don’t need to keep more than 30 days of data in the system. You could define a label for each day, and give it a TTL of 30 days, and then watch the database clean up after you.
(New with Titan 1.0/TinkerPop 3) TinkerPop 3 brings several changes to the raw Gremlin query language for Titan. Where before it was really a query plan laid raw — follow this edge to this vertex — it’s now got more declarative operators like match(). This makes for more readable code, partucularly for those used to working in SQL:
[code language="groovy"] g.V().match( __.as('vmHost').out('hosts').as('guest'), __.as('guest').out('hasOS').as('guestOS'), __.as('guestOS').has('distribution', 'CentOS').has('version', 6.3)). select('vmHost','guest').by('hostname') [/code]
The match() expression has a lot more capabilities beyond this simple example. We’re looking forward to using it to interrogate some gnarly data sets. Set-based instincts from SQL can still apply to the world of graph-traversal. Recall that while the mental model of a graph traversal is of an arbitrary number of parallel traversing agents, in practice these traversers are merged when they meet at a common place so as to avoid combinatorial explosion. In other words, they travel in sets if you let them.
(New for 1.0/Tinkerpop 3) (Much of) the Gremlin language is now supported for OLAP, i.e. entire-graph analyses. Where previously you had to break out Graph or Faunus as separate tools, there is now a GraphComputer abstraction that combines Gremlin traversal with a Map/Reduce step to do work more compactly. We’ll be working an example of this in the coming weeks, but suffice to say that whole-database analytics are one of the most compelling reasons to go with a graph representation of your data. When you have a business question that can be posed as a clustering, network flow or matching model, you have considerable power over a relational model.