Implementing Data Products
In the mid-2010s, Twitter promised when we hired thirty data scientists money would start shooting out of our Macbook Pros so fast we wouldn’t be able to buy enough in-office 24 karat gold ping pong tables to keep up with the interest. Well, now it’s 2020; how’s that working out for you? Many industries are currently falling off the back of the data science hype-cycle because organizations aren’t seeing the promised ROI. Here’s what you can do about it.
Data Projects vs. Data Products
The misconception that data science equals increased revenue stems from a misuse of the data science workflow and the technologies involved. Our extensive work in a myriad of industries and applications shows isolation is the most common killer of data science value. It takes the integration of data teams with dev and product teams to build scalable, reproducible data science insight. Iteratively producing artifacts from data science R&D which power data-driven product takes both coordinated organizational process and team-common technology. We call these artifacts Data Products.
In some limited cases, one data scientist performing a one-off analysis on her laptop can provide a revolutionary insight. A single data scientist, or a small team of data scientists, working to solve a single problem is a traditional data project. A data project many times will conclude with a report or a Jupyter notebook containing a static visualization, which prevents the insight from propagating away from the creating data science team and often requires the same team for updates and reproduction.
Figure 1: data products depend on both code and data, requiring additional organizational process and technological tooling for successful deployment and maintenance.
However, in general, magical discoveries and delivering continuous value to users is the result of extended operational use of analytics tools built by mixed teams with many delivery iterations over time. Sound familiar? Of course! I’m describing software development here. The difference, though, between software product development and data product development is that while software development depends only on software artifacts, data product development depends on both software artifacts and data artifacts. But data characteristics change over time, based on changing operational conditions of whatever is being measured. As a result, data product development must include some non-zero amount of iterative of R&D. Including the R in R&D is non-trivial and requires additional process and tooling than is required for traditional software engineering.
Implementation Requires Technology and Process
I’ll attempt to give as much detail here as is possible, but this is a vast (and dynamic) subject, and we have tons of opinions. Tune back into the blog in the coming weeks for pieces by other authors with detailed descriptions of our best-practices in implementation through a product lens and tech-stack specs, both based on case studies on our work in the space.
As mentioned above, software products rely on code, but data products rely both on code and on data. This necessitates the use of additional tooling in the stack to build and maintain a data product. Here’s an example: building a core product feature like a filtering mechanism on a web app showing a table of the weather conditions of all the cities in the world depends on a branch in version control where the feature lives until deployment time. Conversely, a data product feature like a forecasting mechanism to predict where a heatwave will propagate tomorrow depends not only on a feature branch for the code but also a versioned data snapshot on which the code was trained. If we change the training data, we change the artifacts that drive the forecast, precisely as if we change the source code.
Data versioning is a newish phenomenon, and readers may be interested to learn about our recommended best practices using tools like Quit, Pachyderm, or Molecula to accomplish the goal of creating fully reproducible data pipelines and which drive data products. This exercise is not an academic one. Imagine a data products deployment goes awry or causes a subset of users to experience inaccurate results. We’d want to roll that deployment back to a previous version, and in some cases, this requires retraining of machine learning models. We can’t reliably perform that retraining without the original data the model was trained on in the working deployment version, ergo, we need data versioning. I know what you’re thinking: just serialized the trained model and version it alongside the deployment. I’ll leave it to the reader to join us for our many data products webinars, lightning talks, and (upcoming) blogs to understand the conditions under which that is appropriate or inappropriate.
A related idea is that of micro versioning. Depending on the scope of the workloads serviced by your data products, breaking data pipelines up into individual processing steps as first-class citizens enables fully versioned components for CI/CD. This is especially helpful in the case of machine learning pipelines, as you’re then able to roll out new pieces of processing as data or requirements change. For example, let’s go back to our heatwave movement model; if we want to change the input source of our humidity feature, we’ll have to update our ETL pipeline, and possibly a small piece of preprocessing code, let’s say the units are in millibars of mercury instead of Pascals.
Figure 2: simple computational task graph. Nodes represent either results or processes, which can be versioned independently and deployed on varying compute architectures.
When we build a computation task graph for the entire data pipeline, we can then update just the steps related to the curl statement that grabs the data and the unit conversion processing step. Rolling those changes out is simply less dangerous than rolling out an entire new preprocessing script, a common practice in productized machine learning workflows. Beautifully, this also allows the natural computational distribution of contributing processing steps, rather than unnecessarily running the entire processing pipeline on the same compute. Again, by way of example, let’s say your unit conversion processing step can run single-threaded, but your humidity interpolation processing step needs to run distributed on a Yarn cluster by Dask. That’s no problem. The framework of computational task graphs allows you the flexibility to do both things, and update one between versions without updating the other.
More on this in an upcoming webinar and blog post (both in January 2020).
Figure 3: many open source or proprietary tools can be used to build a successful machine learning CI/CD architecture, and it’s key to modularize the components which make up a data pipeline for reproducible and revokable deployments.
One of the most common pain points we hear across industries is that they don’t know how to measure where they are in the data science lifecycle. For instance, imagine a company that matches retail customers to home services (such as cleaning, lawn care, plumbing, etc.) is building a services recommender system for logged-in users. They currently have a product in the form of a web app, in which a customer can select schedule available services, but they want to decrease friction by anticipating what users will need next. This company tasks its data science team with building, but not integrating, the recommender system. The recommender system feature integration is on the product roadmap in Q3 of 2020 (it’s currently Q4 2019).
This company, and many others like it, cannot tell how close they are to staging integration for testing because they don’t know how much longer it will take the data science team to “finish” their data science research and development of the technology required to drive the product feature. This is a failure of organizational process, not technology, as it requires breaking the data science R&D effort into quantized buckets, which can be communicated to the product team and the engineering team. The company in question is now integrating said buckets into their data science task management flow to gain visibility over this and other R&D efforts. In their case, the buckets include data extraction and conditioning, baseline model benchmarking, alternative model comparison, model and architecture selection, data scale-out validation, and dev environment integration.
R&D effort bucketization enables data science process visibility for the product and dev teams, but it doesn’t change the way R&D is integrated into the core product. For that, additional process engineering is needed. Specifically, allocating cross-functional team members that sit between the data science and engineering teams to move product features from the data science dev environment into the engineering staging environment. These employees have knowledge of both the intricacies of data science and a deep understanding of deployment pipelines. They work full time to meticulously migrate R&D hack jobs into a hardened, extensible format, ready for deployment. Their skills include the construction of machine learning CI/CD pipelines, using modern ML tools such as Tensorflow Serving, and deployment tools like Kubernetes and GKE.
Figure 4: marrying data science to product thinking is accomplished through the software engineering pipeline. Define bucketized data science R&D work stages to instantiate agile methodologies on data products releases.
Possibly the most critical aspect of data science organizational integration is the need for a data science-to-product “translator.” In our example home services company from above, this person is the manager of the data engineering team that sits between data science and software engineering. She spends non-trivial, but essential, time translating product requirements to the data science team, and translating data science requests and progress back to the product team. We almost exclusively see organizations’ product and data science teams speaking different languages and having a different level of understanding of the other’s workflows. Because of this, we always recommend selecting one person per data science team to act as a “data products translator.”
There you have a summary of our DNA, folks. Data products guidance and implementation is our craft, and we’d be happy to brainstorm about it with you. Please poke around the website for more in-depth information, or reach out directly with specific questions.