finally plugs that gap. Now you can mount an arbitrarily large file system to any of your Linux EC2 instances. Like other “cloud-native” solutions before it, you don’t have to provision or manage your own server capacity to manage this file system, you just start using it and AWS scales it up or down as necessary. The service is in preview currently and we were lucky enough to get our hands on an account, so we decided to kick the tires a bit. (As a side note, Azure has a similar program under preview as well.)
Why use it?
There are a lot of great reasons a file system is preferable to more traditional cloud buckets, like S3 or Azure Blob Storage.
- All The Libraries. All your existing libraries already know how to read and write files! Especially in scientific computing where you can just HDF5 and done, why would you re-write your existing software to deal with cloud-native buckets?
- Strongly Consistent. A file system makes guarantees about consistency that many blob systems don’t, like that you can write to a file, then turn around and read what you’ve just written. Yes, we can build robust systems by dealing with eventual consistency, but do we always need to?
- Why Do It Yourself? There is a cottage industry of people managing their own cluster-based filesystems in their data centers, from hdfs to cluster. It seems reasonable to bet that the Amazon offering will at first be less performant than these, but it sure will be a lot easier to manage.
The downside: cost
But it’s not all sunshine and puppies. Let’s compare storage costs (at least today, in the Oregon (US-West-2) region)
Concrete Use Cases
What’s worth doing for a factor of ten in cost? Well, it’s probably not your first choice if you’re handling petabytes of user data in a typical consumer-facing application. But there are a couple of scenarios where I’d consider it
- Home Directories. If you want to provision a classic Unix working environment for data scientists, let’s say, you could fire up an EFS and administer it just like a university or government lab unix file system. Compared to provisioning the hardware on-premise, you only pay for what you need, but all your existing tools continue to work.
- Software Distribution for Immutable System Images. In the cloud, you should make your system images immutable. One dies? You need to add more scale? Just launch a new one, no lengthly configuration required. Whether launching from an AMI or using a Docker container, it happens fast if you don’t have to worry what might be on the system disk of the machines. But it’s still wasteful having to construct these machine images in the first place, every single one of them with their own copy of, say, libgfortran3, or your latest processing algorithms. Why not take a cue from the shared Unix world and use a shared file system to hold your binaries? Your binary images and configuration files probably don’t add up to more than a handful of gigabytes at any one time. Why use CodeDeploy to push new code to your servers when you could just fire up a new set of instances that point at/global/bin/v0.2/bananas instead of /global/bin/v0.1/bananas? Or have your images reading their paths via a consul demon and automatically re-running themselves when your target configuration changes? Pretty simple when the binaries are on a shared filesystem. In many cases, changing configurations is as simple as source-ing a setup script.
- mmap. OK, strictly speaking this is just an offshoot of ‘All the Libraries’, but memory-mapping is an effective technique when you need semi-random access to small parts of extremely large files. This is common in scientific computing, and especially visualization. (It’s no panacea, but until a file system like this existed, it wasn’t even an option in native AWS for files meant to be shared between multiple machines.)
- Classic HPC. You know, using MPI and parallel HDF5. If you have an existing application and the patience to try it in the cloud, this may be a quick way to treat AWS as a provider of elastic scale without having to re-write any of it. Unfortunately with machines being relatively unreliable, and most MPI applications not dealing gracefully with hardware failure, this only takes you so far in a messy cloud like Amazon’s.
Some Read Benchmarks
I spent a few minutes doing some simple benchmarks on large files I had lying around (6GB and 9GB seismic datasets). Without lots of measurements under different usage patterns, especially heavy ones, these need to be taken with a grain of salt. But without any particular effort, it appears that throughput of 100MB/s and over is easily achievable with EFS. We did a few quick and dirty measurements on read performance.
S3 -> EFS
Copying 9GB from S3 to EFS on a single machine with 10GigE connections sustained at 47MB/s. My guess is this was about 50MB/s coming in from S3, and 50MB/s going out, i.e. neither S3 nor EFS was connecting to my machine with more than 1GigE connections. This was using the AWS SDK, so should automatically be doing multi-threaded fetching and the usual tricks to get maximum speed from S3.
Reading 9GB from EFS using that age-old file system benchmark cat file > /dev/nullclocked in at 105MB/s from one machine with “high” network performance (a m4.xlarge).
multi-machine cat EFS
With three of those instances all reading the same file from EFS (all in the same availability zone), they each averaged 42MB/s for total bandwidth of over 125MB/s. Though in cases where I read from multiple machines, the first machine that started the read seemed to have a 10-20% throughput advantage over the others. This certainly suggests that multiple machines could stream data at high speeds if data were spread out properly on EFS, but there are no configuration switches for affecting this yourself. Some further testing with looking at one versus many files would be revealing.
A piece of code which mmaps a 6.1GB file and reads values spaced every 8.4k took 61 seconds to complete, which again hits the magic number of 100MBs, or basic sequential reading throughput. (The identical test on a 2013 Macboook Pro with an SSD took only 37 seconds, or more like 165MB/s.) This is expected for a typical SSD-based system whose block size is sure to be at least 4k, if not 8k or more. Each read of just 4 bytes pulls in a 4k page, so you don’t do any better than just reading the whole file.
Reading 10,000 values at random locations in the 6.1GB file took only 455ms after flushing local file caches. (A seek time of just 45µs which is at or better than the optimal RTT for a GigE connection. Opportunistic reading or lucky random numbers may have necessitated slightly fewer reads.) Clearly the file was still cached on the EFS server, which is a powerful benefit. (The same test run a second time of course produces a 204ns seek time, which is on the order of RAM seek times, indicating it’s cached on the local server.)
Testing heavy random access from multiple nodes will be an interesting test for another day
Setting up an elastic file system is extremely easy. To start playing, just use the point-and-click web interface in the AWS console. There was really only one point that was poorly documented. When setting up an elastic file system, you are told to choose a security group for the file system. Recall that this security group basically sets up a firewall for the system. It’s unlikely your default security group allows inbound traffic on the NFS port (2049), so you’ll want to set up a security policy allowing this from whatever machines you intend to use it from. Likewise, the machines on which you’d like to mount the EFS need to enable outbound traffic on port 2049.
Finally, it’s no fun having the EFS disappear when a machine reboots, so consider adding the EFS mount to the /etc/fstab of your machine image. Unfortunately, you have different mount points per availability zone, so it’s a little harder to bake into a per-region AMI as one usually does. It may be wise to configure in a provisioning script to be run on boot. The magic line for your /etc/fstab is
us-west-2a.fs-12345678.efs.us-west-2.amazonaws.com:/ /efs nfs4 defaults,nofail,nobootwait 0 2
(obviously replacing with whatever your actual availability zone and filesystem id are!)
I’m looking forward to letting some more applications rip on this file system to see just which ones are worth the price tag. Let us know if you have an interesting use case!