We are meeting more people who are interested in looking into the world of graph databases. Palladium has executed proofs–of–concept for clients to help them explore this world. In this post we summarize what sorts of questions we feel like a proof of concept project can answer, and how we typically tackle them. For our presentation at Graph Day, we’ll be walking through one in particular, but really there are a variety of answers you may want:
- An idea of risks and timelines to expect for using a graph database in a larger project.
- Whether your data is inherently suited or unsuited to a graph database.
- If the query & analysis capabilities peculiar to graph databases enable you to ask & answer questions of high value to your business.
- Suitability of a graph database as a replacement for a homegrown system
- Ability of a proposed solution to scale to you ingress, data size, and query speed requirements.
We typically tackle an evaluation project like this in four phases. It can be a few weeks or a few months depending on the depth of the questions and the maturity of the answer required.
Phase 1: Frame the Problem
- What business problem are you solving? Understanding this helps us stay focused on the right questions as we explore options during data modeling and benchmarking.
- Summarize the opportunity, compelling event, or pain point that has necessitated this PoC.
- Define clear access to stakeholders who can define the high value business problems and technical resources who understand the data and existing compute facilities & applications.
- What are the objectives of the PoC?
- Are you kicking the tires, de-risking some ideas, or working to size an implementation?
- Are you looking for better performance on existing data, modeling new data, or researching entirely new capabilities?
- Are you comparing vendors in the graph space? Comparing with non-graph representations?
- Does success look like a faster answer to a known question?Previously unasked questions? Progress against a larger project roadmap?
- Where will the data come from?
- Do you have existing datasets we can use?
- Will we combine data from multiple systems?
- If no data exists for a speculative idea, can we create a plan to synthesize it based on natural domain properties?
- What questions will we ask of the data?
- Transactional queries (e.g. all downstream components affected by this failed network switch)
- Whole-graph analysis (e.g. most-connected components, clustering, etc.)
- What is the expected data ingress rate?
- What is the expected geographical distribution of the database?
- What applications will access this data and how?
- Provision appropriate hardware
- Can we use cloud instances?
- Do you have onsite compute capacity that is more appropriate?
- What scale are we testing at? For a capability PoC, we may be able to get reasonable results from a small number of users or small data. For a performance PoC, we should model high concurrency and/or large datasets.
Phase 2: Data modeling & confirmation
- We present several data models with different theoretical tradeoffs for performance and expressiveness. Depending on scope of PoC, consider including non-graph, e.g. relational or document store, and existing database if available. We want a control set.
- We require interaction with your domain experts to confirm assumptions made during modeling so that business rules are not violated.
- It helps to study data models for existing systems if they are available, or systems which use the existing data.
- A large graph database typically has several different graphs hiding in it: each will be modeled separately.
- For large data sets, suggest partitioning strategies in line with expected business, scale and geographic requirements.
- If data does not exist, then a rigorous modeling of the shape of the data is required.
- What is the distribution of data in different categories?
- Is the graph fundamentally like physical-world data, like a road network, or highly connected like a social network?
- What size and growth rate is expected?
- Is there base data we should multiply & permute to achieve a large enough dataset
- Can we define the behavior of pseudo-random agents who will construct a synthetic graph for us?
Phase 3: Data Loading & Trials
- Load or synthesize the data into different data models, looking at design tradeoffs such as whether objects should be represented as vertices or edges, whether to represent metadata in the graph or let it be implied in queries, and how to deal with divergent requirements, such as large table-like requirements embedded in a graph representation.
- Measure speed of agreed queries individually and under concurrent load. These form an objective set of answers to performance or capability
- Evaluate queries with domain experts for correctness, subjective usefulness, and ease of programming.
- Evaluate data representation strategies so as to best utilize index structures available in the underlying database. This can also involve denormalization.
- Correct mistakes or mis-models and repeat, attempting to approach optimal I/O patterns and speed as suggested by design of the underlying system and physical hardware limits.
Phase 4: PoC wrap up
- Document questions asked, data models proposed, and raw test results.
- Recommend appropriateness of graph database against business objectives and other alternatives.
- Work with project champion to document potential ROI achieved from performance or capability improvements.
- Detailed technical presentation to technical staff, high level summary to executive staff, followed by Q&A.
We’ve seen recommendations for and against graph: it’s good to know whether it’s right for your project. If you think an engagement like this would be useful to you, drop us a line.