Sometimes great (corporate) minds think alike and when we saw the blog from Dr. Yu Xu, the CEO of our partner TigerGraph, we were struck by how aligned TigerGraph’s vision of the future is with ours. In his blog, he outlines five forward looking initiatives for data analytics in 2020, each of which, we have assets or offerings already available for or well underway.
Automation of Data Analysis
It’s no longer enough to have a plethora of data scientists rendering interesting, and sometimes competing, answers to your business questions in various Python notebooks. This science needs to make its way into your corporate compute environment in a way that’s predictable, repeatable and auditable. You wouldn’t have one developer compile your application on her laptop in order to redeploy your corporate application, that’s what devops pipelines and configuration management are for. Data science needs the same software engineering principles applied. We call that “Data Products.”
Organizational data products enablement takes both coordinated organizational process and additional technological tooling. Companies must integrate their data science teams with their engineering and product teams in a way that encourages the alignment of user needs with analytics tools. Data science and engineering teams need additional tools and techniques for enabling the scalable release of analytics processes into production. We see these tools falling into four major categories:
- Machine learning CI/CD pipelines, including data versioning and taskgraph based computing environments (see our webinar for more on this).
- Data and model monitoring data feeds and/or dashboards to verify the accuracy and applicability of data science outputs to users.
- Model interpretability tools to mitigate bias in data science outputs and expose model reasoning.
- Active learning systems that automatically retrain machine learning systems based on changing characteristics of users or business operations.
Better Data Lineage
Tracing the origin of your corporate data is mandatory as compliance and regulatory requirements continue to take hold. Our project history has led us to a security and audit solution that we call “Entitlements.”
Considered in a broader context, even if you do not need data provenance for compliance reasons, you often times will need to know where a particular piece of data came from and when it was last updated. Business decisions can be made off of information like this, as to how much to discount or elevate information based on the trust of the origin of the data, as well as the elapsed time since it was last updated or confirmed. For example, consider the case in which alternative assets portfolio managers are evaluating two strategies for balancing a portfolio of CLOs. If the portfolio manager is forced to consider two conflicting CLO risk scores on both instruments, she’ll be forced to guess at her strategy. If, however, she knows one risk score came from Moody’s and one from Morningstar, she knows from her years of domain experience that Morningstar tends to be more bullish during a volatile Euro cycle, and that in this instance she should use Moody’s score to choose the instrument to buy.
There are many ways that “Data Lineage” can be implemented such that downstream line-of-business decisions can be made. An interesting new development in the world of data lineage is to use versioned data repositories like Pachyderm or Quilt to record not only lineage but the current state as a function of data changes. These tools enable downstream systems to make decisions about which records to use at which times. For instance, a data store ingesting from multiple autonomous car manufacturers can version each of the nightly batches with a separate commit from each manufacturer. A downstream application containing a machine learning model for deciding how fast to accelerate from stop signs can choose to train off of Volvo’s data which is more conservative, or off of Tesla’s data which is more aggressive.
Since our original post on this back in 2017, user’s expectations on the-art-of-the-possible has exploded with the adoption of Amazon Alexa, the Google Assistant and Apple’s Siri. Why are we not seeing these advancements in our corporate applications? We will be soon.
To be sure, NLP is part of the answer. After all, the systems you build have to understand the questions that are posed. But that is not enough. These systems have to provide non-linear paths through the data, allowing expert analysts the flexibility to allow them to pursue whatever avenue of inquiry their experience and intuition requires. Linear, wizard-driven query composers will not suffice. Nor will a page full of dashboard components.
Google confirms that 16-20% of the searches are new every single year and over a third are 4 words or more. If Google hasn’t seen the search before, how in the world could you expect your corporate application developer to anticipate it with a dashboard or search panel. It must be open ended. We’re entering into “Turing test” territory.
Data Analytics on IoT
Demystifying computation on IoT has been a focus of ours for many years now - in no small part because there’s been no simple best practice or “right answer” for concurrently modeling time series and relationships in a single data store.
Data and the digital twin demands a hybrid approach in most cases and does not fit neatly in a tidy marketing message. The answer to this problem continues to be a mixture of well chosen technologies and a healthy dose of intentional data model design based on the specifics of the customer problem domain. Beware of “one-size-fits-all” solutions here - the frequency of the data coming in from remote devices can significantly affect the optimal way to model this problem.
By combining best of breed tools to perform computation on the IoT streams as they come in with computation on the consolidated data once its been ingested, you get the best of both worlds: 1) communicating only the relevant state based changes on the streams and 2) the correlation of the events in relation with each other.
This may be the most obvious and prolific area of exploration for most corporate customers and consequently where we’ve focused. Our current areas of interest are in fraud detection, supply chain planning, and customer journey mapping (aka Know Your Customer) to name just a few.
Thankfully, the graph database vendor community continues to provide us many tools of the trade so that we can focus on what we do best - solve complex problems for our customers. Some of our favorite sources of algorithmic building blocks can be found at:
- TigerGraph’s GSQL algorithm library
- TinkerPop’s recipes which can be leveraged by any of the Gremlin-based graph engines like JanusGraph, Amazon Neptune, Microsoft CosmosDB or DataStax Enterprise Graph
- Neo4j’s Graph Algorithm library
Graph convolutional networks, using characteristics of the graph in conjunction with traditional machine learning techniques, has rendered very attractive results in our trials and is an area of great promise and further investigation.
Lastly, with recent chip advances like FPGA, what historically has been theoretically possible to compute is now becoming practical without breaking the bank. Advancement in science requires vetting your ideas and sometimes being wrong. Experimentation can now be done without waiting weeks to see the results.
This area is exploding with other use cases drawing us to cyber security, entity resolution and natural language processing. Stay tuned for more on these and other domain topics.