Netflix recently open-sourced its in-house data science tool Metaflow. Here are my thoughts:
I’m excited. Look at all this value
Versioning as a first class citizen, right out of the box.
So many times I’ve needed to reproduce a result from the past for verification purposes. But it may surprise the average reader how many of the components are ephemeral: git commits, package version mismatches, and sometimes data sources sensitive to the second.
The Metaflow client looks fantastic for accessing artifacts and results from a previous run. Want to compare the model trained with new data vs the previous model? No problem. Want to serve your results with a lightweight endpoint? That should work.
I have some reservations about the tight integration with AWS but I’ll get to that later.
They get so many things right
Look at this perfect figure included in their release blog:
This aligns with my experience both as an engineer and doing data science work myself.
And these problems are not easy to address.
A data scientist really doesn’t care if data comes from Hive, Cassandra or some god-forsaken csv. But each of these sources has very real and very important costs in terms of size, performance, security and volatility.
Compute resources? “We ran a job once that required a 128gb stack size, so we run everything with 128gb now”. Because we don’t care about the stack size, we care about the job running.
Job scheduler? “I’m using a 2-year old version of pandas because I don’t understand the Java errors I get when running pyspark with the latest version.”
The latest version is the only version.
So where does Metaflow fit in?
As mentioned earlier, I do have some reservations about Metaflow’s tight integration with AWS.
If you are already on AWS, then no problem. Metaflow has put in a lot of engineering to make S3 data transfer fast. And the integration with AWS batch looks fantastic - with a few decorators you can size individual components as necessary.
But if you are on Azure, GCP or even on-prim, you won’t be able to take advantage of these features. But you can still take advantage of Metaflow’s other useful features.
Metaflow is a new product in a field of growing data science orchestration products. Other examples might be Apache’s Airflow or Kubeflow from Google. Metaflow seems to be more developer friendly than the others, but lacks some of the redundancy features of airflow or the requirements rigor of kubeflow. As a developer, I think it’s a great trade-off.
All of these products (including metaflow) will work on any cloud service provider (or on-prem), and one could potentially combine components - for example using metaflow within kubeflow containers.
The future's looking brighter and brighter with these products bringing rigorous engineering to the growing field of data science.
I’m excited to use Metaflow in combination with other packages to create repeatable, versioned data flows.