Metaflow: Rapid Reaction

What Netflix’s newly open-sourced data science toolkit has to offer.

Contact us

Metaflow: Rapid Reaction

What Netflix’s newly open-sourced data science toolkit has to offer.

Fill out form to continue
All fields required.
Enter your info once to access all resources.
By submitting this form, you agree to Expero’s Privacy Policy.
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Netflix recently open-sourced its in-house data science tool Metaflow.  Here are my thoughts:

I’m excited.  Look at all this value

Versioning as a first class citizen, right out of the box.


So many times I’ve needed to reproduce a result from the past for verification purposes.  But it may surprise the average reader how many of the components are ephemeral: git commits, package version mismatches, and sometimes data sources sensitive to the second.


The Metaflow client looks fantastic for accessing artifacts and results from a previous run.  Want to compare the model trained with new data vs the previous model?  No problem.  Want to serve your results with a lightweight endpoint?  That should work.


I have some reservations about the tight integration with AWS but I’ll get to that later.

They get so many things right

Look at this perfect figure included in their release blog:

This aligns with my experience both as an engineer and doing data science work myself.


And these problems are not easy to address.


A data scientist really doesn’t care if data comes from Hive, Cassandra or some god-forsaken csv.  But each of these sources has very real and very important costs in terms of size, performance, security and volatility.


Compute resources?  “We ran a job once that required a 128gb stack size, so we run everything with 128gb now”.  Because we don’t care about the stack size, we care about the job running.


Job scheduler?  “I’m using a 2-year old version of pandas because I don’t understand the Java errors I get when running pyspark with the latest version.”


The latest version is the only version.

So where does Metaflow fit in?

As mentioned earlier, I do have some reservations about Metaflow’s tight integration with AWS.


If you are already on AWS, then no problem.  Metaflow has put in a lot of engineering to make S3 data transfer fast.  And the integration with AWS batch looks fantastic - with a few decorators you can size individual components as necessary.


But if you are on Azure, GCP or even on-prim, you won’t be able to take advantage of these features.  But you can still take advantage of Metaflow’s other useful features.


Metaflow is a new product in a field of growing data science orchestration products.  Other examples might be Apache’s Airflow or Kubeflow from Google.  Metaflow seems to be more developer friendly than the others, but lacks some of the redundancy features of airflow or the requirements rigor of kubeflow.  As a developer, I think it’s a great trade-off.


All of these products (including metaflow) will work on any cloud service provider (or on-prem), and one could potentially combine components - for example using metaflow within kubeflow containers.

Closing Statements

The future's looking brighter and brighter with these products bringing rigorous engineering to the growing field of data science.


I’m excited to use Metaflow in combination with other packages to create repeatable, versioned data flows.


User Audience

Services & capabilities

Project Details

Technologies

Ryan Brady

February 19, 2020

Metaflow: Rapid Reaction

What Netflix’s newly open-sourced data science toolkit has to offer.

Tags:

Netflix recently open-sourced its in-house data science tool Metaflow.  Here are my thoughts:

I’m excited.  Look at all this value

Versioning as a first class citizen, right out of the box.


So many times I’ve needed to reproduce a result from the past for verification purposes.  But it may surprise the average reader how many of the components are ephemeral: git commits, package version mismatches, and sometimes data sources sensitive to the second.


The Metaflow client looks fantastic for accessing artifacts and results from a previous run.  Want to compare the model trained with new data vs the previous model?  No problem.  Want to serve your results with a lightweight endpoint?  That should work.


I have some reservations about the tight integration with AWS but I’ll get to that later.

They get so many things right

Look at this perfect figure included in their release blog:

This aligns with my experience both as an engineer and doing data science work myself.


And these problems are not easy to address.


A data scientist really doesn’t care if data comes from Hive, Cassandra or some god-forsaken csv.  But each of these sources has very real and very important costs in terms of size, performance, security and volatility.


Compute resources?  “We ran a job once that required a 128gb stack size, so we run everything with 128gb now”.  Because we don’t care about the stack size, we care about the job running.


Job scheduler?  “I’m using a 2-year old version of pandas because I don’t understand the Java errors I get when running pyspark with the latest version.”


The latest version is the only version.

So where does Metaflow fit in?

As mentioned earlier, I do have some reservations about Metaflow’s tight integration with AWS.


If you are already on AWS, then no problem.  Metaflow has put in a lot of engineering to make S3 data transfer fast.  And the integration with AWS batch looks fantastic - with a few decorators you can size individual components as necessary.


But if you are on Azure, GCP or even on-prim, you won’t be able to take advantage of these features.  But you can still take advantage of Metaflow’s other useful features.


Metaflow is a new product in a field of growing data science orchestration products.  Other examples might be Apache’s Airflow or Kubeflow from Google.  Metaflow seems to be more developer friendly than the others, but lacks some of the redundancy features of airflow or the requirements rigor of kubeflow.  As a developer, I think it’s a great trade-off.


All of these products (including metaflow) will work on any cloud service provider (or on-prem), and one could potentially combine components - for example using metaflow within kubeflow containers.

Closing Statements

The future's looking brighter and brighter with these products bringing rigorous engineering to the growing field of data science.


I’m excited to use Metaflow in combination with other packages to create repeatable, versioned data flows.


User Audience

Services

Project Details

Similar Resources

Expero CoNNected Financial Crimes

Recent events have created increased focus on Financial Crimes attacks as well as Cyber, AML, and fraud attacks that are growing in sophistication creating losses in the billions. This session will identify how to reduce false positives by 60%, increase accuracy by 70% and improve overall team productivity by 80% with Expero CoNNected software that bolts onto current on premise technology.

Watch Demo

Cyber and Graph Analytics

  • Cyber & Malware Fraud Avoidance
  • Graph Algorithms & Boolean Logic
  • Advanced Visualization
  • Real Time Intervention
Watch Demo

A Fraud Series - Part Two: Adapting Technology to Fight Fraud

This post looks at the different technology approaches and adaptations to finding and detecting fraud, and the technology behind Expero's Fraud Product.

Watch Demo

Fight Cyber Crime and Fraud with Graph and ML

During this webinar, Expero and Tigergraph will discuss how new ML and Graph analytics can work together with simple deep link visualization to increase detection accuracy, decrease false positives, and increase transparency between silos to allow real time alerting to avoid sanctions and fines.

Watch Demo