Serverless ML

Serverless ML, ML on AWS Lambda, ML on Google Cloud Functions, Scalable Serverless ML, Classify your dog for less than a penny!

Serverless hosting has revolutionized how projects are delivered to the world at large, with savings on maintenance, costs and man-hours spent managing environments.

Some have struggled with package sizes and worried about performance but I’m here to tell you that you can work within the deployment size limits and you can absolutely get real time results (instead of waiting for whatever your batch schedule is). 

Size Matters.  

The first roadblock a team will inevitably face comes from the hard limits of the providers themselves.  Google Cloud Functions are limited to 500MB uncompressed deployments.  AWS Lambda functions are limited to 250MB uncompressed deployments (but watch out for the newly released container support on AWS, with an eye-popping limit of 10GB for containers).

Python packages add up quickly, and we can quickly see that the base packages of Tensorflow and PyTorch already exceed the size limits of our most popular cloud choices.

However, we can work within these limitations. By using smaller runtimes, we reduce the need for the full suite of packages used by data scientists, keeping only what’s needed.  ONNX in particular provides a path to convert models created in Tensorflow, SKLearn, PyTorch - basically all the most commonly used Machine Learning packages - into a model compatible with the relatively small ONNX runtime package (13MB). 

Furthermore, we can work with cloud storage (such as AWS S3 or GCS Storage buckets) to store model weights and artifacts for just-in-time consumption.

For context, the image classification model used in this blog requires 50MB.  A fraud detection model used in another project required < 30MB of space.  A toy reinforcement learning routing model used in another blog of mine ( required < 10MB of space.

GPT-2 - the popular transformer based language model - requires 635MB of space in ONNX format, and thus cannot fit in an AWS deployment.  However, with S3 and an increased memory limit (1.5 GB), we can still load the model directly from S3 and serve generated texts.

What about compatibility?

Works on my box?  Works on the cloud box too.  By building and assembling our custom models and dependencies with a cloud-service provided docker image, we can ensure that our results will remain consistent regardless of the deployment location.  And that means we can use the absolutely latest version (or the unbroken older version) of whatever dependencies we need - our choice.


As an initial demonstration I will show a simple deployment of the popular iris dataset from sklearn -

  1. On my laptop, I’ve created a simple classifier for the iris problem based on petal and sepal dimensions (classic machine learning toy problem).
  2. I converted my model to ONNX format and saved to S3
  3. I have an AWS layer ready to go with the ONNX runtime installed
  4. I authored an AWS lambda function to load my model from S3 and classify flowers on demand

But what about something bigger?

The iris problem and solution is not particularly big.  What about something larger?

For the next proof I will use the EfficientNet-Lite4 model (source:  This model is a bit bigger, weighing in at 50MB.  Can the model correctly classify my dog Rocky?

Success!  Note there that the first invocation of a cloud function typically takes longer due to the providers internals for loading layers and artifacts from storage.  For this test, I needed to increase the timeout and memory limits for the function.

You said something about realtime?

Yes, absolutely.  We can classify an image in under 3 seconds and frankly an ocean of models can run even faster than this.  Product recommendations, routing decisions, fraud detection, chat bots - run your model and deliver outcomes right now.

What about the cost?

I ran this particular test on AWS lambda.  Here’s the bill:

2616 ms (rounded up) time with <512 MB of memory: $0.00002171


It is possible to run machine learning inference with serverless cloud compute power.  For fast, affordable and scalable consumer models, that has to make a lot of sense.

What's next?

Getting the model running does most of the work, but we may still require transformations and feature engineering that might not fit into our simple inference functionality.  Resizing Rocky’s portrait, for example.  But we can still perform those transformations in the serverless environment and weld them all together to create a true Machine Learning pipeline.

What other problems can Expero help solve?

Subscribe to our quarterly newsletter

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.