Having been covered extensively in previous blogs here, here and here, there are several advantages to representing enterprise data in graph database models for certain use cases. But what’s rarely discussed is “Where does this data come from?”, “How is it originated?”,and “How can I be sure I don’t lose any before I get it into my graph?”.
Over the last several years, one of the most common pieces of infrastructure we encounter is Kafka or it’s commercially offered sibling, Confluent. Take for instance the supply chain problem described in our case study of a multinational equipment manufacturer. Graph technology and machine learning techniques are required, to be sure, but what’s required for the data to be collected in the disparate facilities, organized and annotated by source, and interpreted and processed as it goes in the graph?
To bring this data from a variety of input formats into a cohesive form in order to process it, and mine it for trends and for what is occurring requires making good choices in the technology to achieve such. Often solutions are arrived at by merging different forms of technology stacks to weave together a process flow for a data pipeline, which in the long run, often require a significant amount of maintenance and refactoring to keep the pipeline consistent with evolving business needs.
Ideally a unified approach is the most desired option, using a single technology stack. Kafka provides the means of doing so, in a flexible, reliable and consistent fashion.
In our client’s case, they had literally scores of warehouses and fulfillment centers, each one with thousands of sensors at different points in the building. IoT messages are sent as components pass different junctures, either to be packaged for shipment or selected for inclusion into a larger assembly.
In order to pull the data together to feed the target graph, it was necessary to also build out the means to coalesce the data into a proper format which was consumed by the graph database for processing.
By using Kafka, the necessity to write a bunch of proprietary code to feed the graph is minimized, and in some cases negated.
Immutability and Data Durability
In order to reliably process information in a graph format, the data that sources the graph process needs to adhere to the concept of immutability, being that the data collected cannot be changed in transit, and durability, the notion that it will be there when it is time to consume it.
Data submitted to Kafka sticks around. Kafka employs a commit log metaphor. Whereby as data is posted to a Kafka broker, it is recorded in an immutable form. Instead of using the common metaphor of posting to a conventional database scheme of some sort to coalesce the information desired to process, necessitating stringent schema management to avoid system reliability issues, moving the data to Kafka allows for dynamic and ad-hoc representation of the data stream, and the ability to process it in a real-time fashion. As well as the ability to evolve the underlying data schema without impact to existing production implementations.
Data posted to the Kafka commit log is maintained by a specified and dynamically configurable retention policy. Durability is an important feature of Kafka. When data is committed to Kafka it cannot be changed except through another message which in effect replaces the one committed prior. For data that is critical to be available such as a when something has shipped, the durability feature is key to ensure that it will be consumed by some system that will analyze it. Such as a graph analysis system.
It can be retrieved at will and also as soon as it is committed to the underlying stream log. Through replication, once committed, data is guaranteed to be retained until consumed and posted to a managed persistent store where it can then be analyzed and mined for relevant information pertinent to supply chain distribution patterns. And as in many industrial use cases, the order matters in supply chain. Composite parts leaving the facility before the component parts arrive is nonsensical - we need to make sure we’re processing this information in the order it was received. Kafka provides the means to reliably ensure correct ordering of data, even when processing the data off of a topic with concurrent consumer clients, as it streams into the data pipeline flow in real-time.
A stream processor can then establish how and where the data is transformed and published to a relevant topic for subsequent consumption. The Streams API allows for inspection of topics and to act upon them. For example, as new data is being published to a general topic a filter can be applied, to inspect the contents and make decisions as to what may need more structure and push that data to another topic for further analysis, whereby the schema may be updated so that it can be incorporated into the data pipeline flow and applied rapidly. Data can be transformed and merged into new consistent data streams for consumption by a graph system.
It is possible to scale stream processors in tandem while ensuring each record is processed only once. Furthermore, there are mechanisms to process the topic contents in “chunks” by carving up the contents in a variety of ways. A couple of which are denoted in this article.
Suppose for instance, you wished to process the contents in even chunks over time (throughput per day or minute for instance) you can use what’s known as a tumbling window. Data can be processed in sequential chunks, as it becomes available in the topic stream. This is the simplest and most common form of processing data from a topic.
Alternatively, perhaps you’d like to watch a trailing window of time to see if your running average throughput exceeds some current threshold, there is the notion of the hopping window. (Special thanks to Frank Rosner for windowing image animations). In this form, it is possible to process data in a concurrent format, utilizing multiple topic consumers in tandem, to maximize data throughput.
How Would This Work in Practice?
Imagine a series of Confluent topics designated to each of the major aspects of the supply chain puzzle: supply, regulatory and demand. This paradigm can evolve as your model gets more sophisticated.
In order to feed the graph process, it is desirable to bring data into a common format from the various and disparate systems. By making use of existing confluent connectors, which are supported for a variety of popular languages and databases, all of the points of input can be easily retrofitted to submit messages to common consistent data topics for consumption and analysis regardless of the source of input. Into a singular consistent transport metaphor via Kafka.
Instead of tying the input of messages from the source directly to the supply/fulfillment application, say for instance an existing application that already accepts message from IOT devices, it can be minimally augmented to use a Confluent connector to send data to a general supply topic. Or a simple agent specific to the source application can be established in the language of your choice which makes use of a connector to submit data to the supply topic.
By decoupling the points of input from a proprietary data protocol, or via common metaphors such as a REST endpoint requiring tedious management over time when new requirements are established, often requiring managing different types of languages and means of feeding the graph database, the data flow moves into a common singular format, establishing the means of managing the various and sundry types of inputs and sources to feed a complex graph system via an easily manageable form, with inherent cluster and scalability support. This allows the focus to be put on solving the graph solution instead of spending time on how the data is fed into it.
Additionally, the real-time aspect of Kafka can now be leveraged to process information as soon as it becomes available within the supply chain topic.
By using Kafka, flexibility is afforded by the ability to add new data sources, even in differing formats, in an ad-hoc fashion. Kafka, along with the additional management topology provided by Confluent services, allows for dynamic schema management. It can evolve without conflicting compatibility issues, at a pace dictated by the business demand.
Say that you have an existing Supply Topic (A) and another type of input source becomes available, which feeds into a new Supply Topic (B) which employs a different schema format. By making use of the Kafka Streams API, the two topics can be merged into a single common Supply Topic format, using the existing schema which was defined by the original Supply Topic (A).
The Streams API establishes the means to easily alter and transform your data pipeline flow into an existing singular system process flow, without changing any behavior in the target application, and without having to juggle a variety of language implementations. By leveraging the distributed schema management API provided by confluent, it is also possible to change the underlying schema that manages the common supply topic, while the system is running, allowing for backwards compatibility to existing data flows without impact to the overall system.
Confluent provides the means to generate new data streams via K-SQL, to query and move data into other topics, without the need to write custom code. Producing real-time data streams, which may be consumed by other processing systems. All without the need to write custom proprietary code.
Emphasis is now placed on designing how you want data to flow in your system, without the need to figure out how the data is to be moved from one location to another. Kafka along with available Confluent services does the heavy lifting on your behalf.
By using Kafka as the common data management topology, any type of system can be integrated in the supply/fulfillment information flow, to feed a graph analysis implementation in a common, stable and durable format. The Confluent ecosystem provides all of the scalability and management infrastructure for you without having to write it yourself. And is a proven topology for moving data between systems consistently, reliably and in real-time.
Stay tuned for subsequent entries in the blog series as we piece this together.