For decades now, role-based access control (RBAC) and access control lists (ACL) have been managed using the same technology stacks. With the advent of graph databases and streaming services, we can now model arbitrarily deep organizations with different departmental security systems - all maintained in real-time.
“What is the new hire even doing?”
Imagine a scenario where you are just starting off at a company. You receive your new credentials for internal operations and are ready to get to work, only to find out that you won’t have access to the VPN service until next Monday, the data lake on Wednesday, and expense processing on Friday. You are completely at a stand-still until you have access to the required resources. We have all been there, and it is a completely awful place to be, and it's now taken you two weeks to really get going at your new job. In that time, you probably went so far down the YouTube recommended section that Google knows you better than you know yourself.
Some reasons as to why it would take days to get you access to necessary resources:
- Each resource’s manager must manually add you to the access control list, and one of them is managed in Asia, 12 time zones away and you can't find the guy who needs to approve it
- Automated processes for granting access don’t operate in real-time
In this blog post I’ll propose a new solution to these issues.
In an enterprise environment, users are typically managed in a service like Microsoft’s Active Directory or more generally some Lightweight Directory Access Control (LDAP) interface. Imagine a simple LDAP software sitting on top of a relational database modeled like the following:
In this schema a party is akin to a user, and a role is akin to a group or identity. A party can have multiple roles. Roles is typically a pre-populated table like “Developer”, “Manager”, etc. System administrators are typically the ones within a business who add a new employee and the relevant roles to that new party.
Now I want to also introduce Master Data Management (MDM). This is an idea that an organization has a master copy of all their data, which can be manipulated to be application-specific. Enter my colleague Rick Paiste’s blog post entitled “Next Generation Authorization and Entitlements Services”. Within his blog post, he calls out the specific benefits of using a graph database over various alternatives for MDM. He proposes that using a graph database can simplify the complexity of resource authorization queries and decrease execution times quite drastically. I recommend reading it to get the entire background for his thoughts. Within it he proposes the following graph schema for entitlements within an MDM approach.
Let’s say we have an organization using our simple LDAP interface with the relational database modeled above as well as the entitlements graph above for their master data model. For the purpose of this blog, identity and role are interchangeable. The largest problem master data models face is staying in sync with their sources. With real-time streaming software, that issue is a thing of the past.
Software that operates in real-time has been on the rise for a while now. Enter Apache Kafka. Kafka was developed internally at LinkedIn and eventually gifted to the Apache Foundation, initially released in 2011. Kafka is a stream-processing platform for handling data in real-time. To get a summary of Kafka, check out my lightning talk Apache Kafka 101. One company that has built an enterprise product on top of Kafka is Confluent. It was founded by the same people who wrote Kafka. What you get with Confluent over Apache Kafka includes a web-based tool called Control Center for managing and viewing details of your Kafka cluster, security including LDAP-based authorization for access to certain Kafka topics, a schema registry, enterprise support, and a batteries-included K-SQL server.
The part of Apache Kafka that would help us to achieve real-time access-control is Kafka Connect. Kafka Connect allows us to speedily develop pipelines for real-time data. It uses sources and sinks to develop a pipeline.
We can use Confluent with Kafka Connect to easily add parties and roles into our graph in real time. Confluent provides a first party connector to operate with JDBC which can be read about here. The JDBC connector will watch the tables of our database which we specify and pass new rows of data through the Kafka pipeline. The relational database which is serving as the data store for our LDAP interface will be the data source while a TigerGraph instance will be the data sink. TigerGraph provides a Kafka Loader to pull data from Kafka and into the graph. I recommend reading TigerGraph’s documentation to get up to speed on developing for it.
The repository with the example implementation can be found here.
Let’s take Confluent and Apache Kafka with the aforementioned database schemas. For a relational database, I’ll pick MariaDB and for the graph database, I’ll pick an up-and-coming database called TigerGraph. Both of these will be connected through Kafka. On Confluent’s website, they have a hub of Kafka connectors which can attach to a wide range of databases, storage layers, etc. For this implementation, I will keep the relational database connector rather generic and use the Kafka Connect JDBC connector. In theory, any JDBC-compliant database could work here, maybe SQL Server or PostgreSQL. For the graph database side, I can’t be as generic, so I will be using TigerGraph’s built-in Kafka Loader.
The Kafka pipeline will look something like this:
We have seen the models for our LDAP data store and the graph database. The Kafka flow of data will look like the following:
With the setup from the previous two images, what we end up having is three JDBC source connectors which can watch all three of our tables, and pass records to the respective topics. On the flip side we will need to setup a Kafka loading job in GSQL (TigerGraph’s query language), which will pull data off of all 3 topics.
Let’s walk through an example.
We just set up our LDAP software and want to include a couple of roles. Our LDAP software inserts our new roles and the roles table looks like so:
Then we add a few parties into the LDAP system.
Those parties then get roles associated with them.
As these rows are added, the JDBC connectors will pick them up. They fly across the Kafka cluster as records. Now that our records are moving through the cluster, the TigerGraph Kafka loader which is watching the topics will do its job, which is mapping the data in the Kafka topics accordingly into graph objects..
"full_name": "Fran Farmington",
All this culminates into the following graph. The schema is a reduced version of the schema above. I have gone ahead and added three resources to the graph including a VPN, time sheets, and financial reports.
You can see that all employees can access the VPN, managers can access time sheets, and the president and vice-president can access the organization’s financials.
For those of you asking if this implementation handles updates, it does. This is thanks to the implementation of the JDBC Connector, and the way TigerGraph handles upserts. TigerGraph will see if a vertex with the primary ID already exists and update it, or insert the new vertex. For edges, TigerGraph will look for an edge between two vertices and update it if it exists, or inserts the new edge.
Changing out TigerGraph for some other graph technology shouldn’t be too difficult. Once you find the right Kafka connector all you have to do is map data to a schema. Currently there is a fairly standard graph query language called Gremlin. A few graph databases implement a Gremlin interface for executing graph queries and traversals. Be on the lookout for a Gremlin Kafka Connector if you are looking for a generic solution similar to the JDBC Connector.
We live in a world where resource authorization doesn’t have to take days or even hours to happen. Resource authorization should happen at the click of a button. As soon as a new employee, user, or party is added to an LDAP provider, using the Confluent/Apache Kafka and TigerGraph solution posted above or a similar approach, the master data model will be updated automatically, so that downstream applications can make use of new information. Entitlements aren’t easy, but hopefully this blog can help you build a solution for your organization.