...that is the question.
We get asked that question a lot given our early customer work with Titan evaluations, participation in the JanusGraph project and usage of Apache TinkerPop while concurrently being a premier DataStax Graph partner.
Why should you ever pay for something if you can get it for free? The conversations go usually something like “Isn’t DSE Graph just a port of the open source graph version that you already worked on?” or “How does DSE Graph perform relative to the way Titan did/Janus does?”.
We tell them that there are many business reasons that should drive the buy vs. build decision and they are well documented. All the classic ones that apply to open source vs commercially supported flavors are germane in this case. For many customers, it’s simply not practical. This is still a relatively young field and acquiring the talent to run production grade graph ecosystems is not trivial.
We then tell them “if you have capable team members who can maintain complex composite environments of graph engines sitting on top of NOSQL backends with integrations to text search and analytics engines, then at least you can consider the option”. In that case, we ask them to consider a few subsequent questions:
- How much of this (top-grade) talent’s time is spent on this infrastructure? And what core business problem of yours would they be working on if they didn’t have to create and maintain this plumbing?
- What happens if that talent leaves?
- Are you willing to bet your mission critical business function on this team? Who carries the beeper and what’s your internal SLA back to your line of business?
After we get past those reasonably classic talking points, we then get into some of the specifics in this case. Titan and JanusGraph work over a number of different storage backends including Apache Cassandra, Apache HBase, Google Big Table, and BerkeleyDB. This is made possible through an abstraction layer that assumes the underlying storage engine conforms to the model described in Google’s Bigtable paper. This flexibility is great but can come at the cost of performance because it’s more difficult to take advantage of each storage engine’s specific strengths when operating through the additional layer of abstraction. DSE Graph has intentionally made the choice to forgo this storage flexibility in exchange for certain performance and operational benefits.
DSE Graph, “pretty much an entire rewrite” of Titan, is tightly coupled with Apache Cassandra, and DSE’s other integrated components. Datastax has undertaken a rewrite of legacy Titan code with the intention of improving performance, scalability, operations and data integrity. Performance and scaling claims tend to be highly use case/query specific and should always be read with healthy skepticism, there are indeed operational and data integrity benefits to DSE Graph over Titan. Data synchronization across components is a particularly thorny problem that they’ve put a lot of engineering time into and it shows in more user friendly operation.
Data Integrity - JanusGraph and Titan are commonly deployed with a third party indexing solution such as Elasticsearch or Apache Solr. These provide robust full-text and geospatial search capabilities, but JanusGraph and Titan are responsible for keeping the raw, source graph data and the indexes in sync. Though much work has been done to keep these systems synchronized, there are scenarios where a divergence can occur, and a potentially expensive index rebuild will have to occur. DSEG manages much of this internally and while it’s not ACID, it is eventually consistent without the customer having to intervene.
Cassandra Storage and Parallel Traversal Step Execution - JanusGraph and Titan store data in Cassandra as byte arrays whereas DSEG stores data unpacked in CQL tables. Many common access patterns such as vertex and edge retrieval can be performed quickly and efficiently using this approach. The downside is that the storage engine, Cassandra for the sake of argument, can’t “see” into the data, and therefore is limited in what processing it can perform on it without passing it back to JanusGraph/Titan for further processing. Additionally the DSEG execution engine can make better use of C* (underlying Cassandra), permitting more of the traversal work to be done in parallel by Cassandra on nodes where the relevant data is already stored.
Example - Say you’re storing a social network, where people have friends. These people and their relationships have various properties on them. We can easily imagine many queries filtering on one property or another (eg: “purchases that include a particular brand of peanut butter”, “friends added in the last 30 days”). With Janus/Titan, unless a property is indexed, that filtering cannot be pushed down into the storage layer, and must instead happen inside of JanusGraph/Titan. For queries that hit a small number of vertices and edges, this difference will be negligible, but it can quickly cause problems if you’re having to pull large result sets off of your C* cluster for further filtering.
Tooling and DSE Studio
Common administrative tasks must be done: backups, restores, profiling and performance monitoring. In both cases, there are approaches to handle these sometimes mundane but mission critical tasks. DSE OpsCenter is a known but frequently underappreciated value.
Lastly, a commonly overlooked step of a project is data exploration. As part of the Expero Insights process, we counsel our customers to really understand their data at a very early stage. A key tactic that we use is visual exploration of the data as it stands right now. You may discover that some use cases cannot be supported by your data or you may miss the art of the possible. DSE Graph includes DSE Studio which is a terrific tool for seeing what’s there and understanding the relationships. If you are to do any meaningful graph work, you’ll need to know what’s in your data.
How would you do this critical step without something like DSE Studio? To be sure, there are freely available tools but that’s yet another tool that needs to be acquired, connected and maintained.
So while there’s no “one-size-fits-all” answer to this question, we do believe that customers should take deeper look at their own ROI model coupled with a deeper understanding of the apples vs. oranges comparison they are making.
In future segments in this blog series, we’ll explore scenarios where open source options are attractive, sample ROI models and specific items to evaluate as you decide on your backend graph engine.