At Rescale, we spend a lot of time optimizing cloud computing for HPC workloads. With the advent of cloud-enabled GPU systems, it is now practical to train deep learning models with a high degree of performance. In this article, we look at a variety of cloud GPU systems and evaluate performance of a deep learning workload on these systems.

When training on recent generation GPUs like P100s and V100s, it is not sufficient to just have high-performance accelerators in isolation. They must be connected to storage that can supply training data at high throughput. Picking the “default” storage options for some cloud providers will likely lead to sub-optimal performance.

TensorFlow ecosystem

The landscape of TensorFlow compatible data sources allows quite a bit of flexibility in connecting your data, wherever it lives, directly into TensorFlow. This is great for reducing the need to copy very large datasets between different data stores. However, there is little guidance on recommended configurations or the available performance for the various choices.

TensorFlow published a nice, detailed survey of CNN training performance on the NVIDIA DGX-1 system, as well as a variety of cloud providers. Unfortunately, this survey is over a year out of date. It lacks any results for V100 GPUs and the only P100 results are in the context of the DGX-1. The only cloud GPU evaluated is a K80.

GPU availability in the cloud has improved dramatically since this survey. All the major cloud providers have adopted support for the latest generation GPU (V100) more quickly after release. As the performance of GPU-based deep learning training improves, access to high throughput data sources becomes more significant. Modern cloud GPU systems can host up to 8 V100s, with NVLink interconnect. This is a lot of computational power! If one wants to get their money’s worth given the high hourly expenses for these systems, more thought must go into the data layer.

Looking closer at the choices of benchmarks from the TensorFlow survey, we can see that cloud data sources probably didn’t matter that much at that time. For the AWS GPU systems, the data source used was EFS, which is a useful file system for redundant cross-zone sharing of data, but it is known to not be very performant for high-throughput applications. We see an inkling of EFS’s drawbacks by noting that the “Training real data” performance metric is listed as “N/A” with a disclaimer:

“Training AlexNet with real data on 8 GPUs was excluded from the graph and table above due to our EFS setup not providing enough throughput.”

The goal of this study is to provide more up-to-date guidance on architecting a high performance training cluster for large datasets.

Experimental Setup

Storage overview

These are the training systems we evaluate here (the top 3 are shown in the above diagram):

  1. Single node (multiple GPUs) connected to “local” or attached disks
  2. Single node directly connected to object storage
  3. Single node connected to a distributed filesystem
  4. Multi-node scale-out

System 4, multi-node, consists of the storage configurations in 2 and 3, except with multiple training nodes.

We evaluate the different configurations across these public clouds:

  • Amazon Web Services
  • Microsoft Azure
  • Google Cloud Platform

TensorFlow Benchmarks

All the results in this article are based on the code in the TensorFlow benchmark GitHub repo. We have made a few minor modifications in the benchmark code detailed below, which reside in this fork.

Multi-ImageNet

We are using the ILSVRC2012 dataset that the TensorFlow benchmarks are designed to work with, but with a twist: we replicated the dataset 5 times so it comes out to approximately 700GB of JPEGs. The reason for augmenting the dataset was that the original 140GB dataset is now big enough to easily fit in RAM of all available nodes that we tested. We wanted to ensure that we were adequately exercising secondary storage rather than just reading records from RAM. For these benchmarks, we are measuring the training data ingestion rate and not model accuracy, so the fact that the data is redundant should not adversely affect our results.

We generated the data by preparing the dataset with this script from the TensorFlow models repo, but modified it in the following ways:

  • Copy each JPEG file to 4 other copies with unique IDs (by suffixing with 00000, 00001, 00002, 00003)
  • Copy each XML metadata file with the same IDs
  • Run the postprocess portion of the script with 5120 TFRecord shards instead of 1024

These TFRecord sets are all stored in object storage for the 3 clouds under investigation.

From the benchmarks repo, we chose these 3 supported models in our experiments. These choices were made to demonstrate performance bottlenecks under light, medium, and heavy compute workloads:

“Trivial”: Single input and output layers with nothing in between. Mostly just testing images into the model and classifications out

ResNet-50v2: Microsoft Research’s “residual network” deep learning model used for many performance comparisons. Characterized as having a good balance between accuracy and model complexity

ResNet-152v2: A more complex ResNet, more computation per image

All these experiments were run with TensorFlow 1.6, built from source on the r1.6 branch into a Docker image with this Dockerfile.

Docker images were run via Singularity on the Rescale platform. Our platform makes it easy to run batch computing jobs over a wide variety of cloud hardware with minimal configuration changes.

Training from Attached Disks

Attached Disk Experiment

Starting with the simplest system to test, we look at times to sync data from each cloud’s native object store onto a GPU node’s local storage and then how long it takes to train on a single epoch of our augmented ILSVRC dataset.

We look at how the performance of different disk types on different providers affect overall training throughput. The 2 figures of merit when evaluating these disks are:

Read throughput: What is the rate at which we can pull sequential bytes off of the disk Read IOPS: How many different file accesses (seeks) can we make in a second

Since our dataset consists of only 5120 separate TFRecord files of relative large size, if we were pushing data through the GPUs in sequential fashion, you would expect that throughput to be the dominant factor. For our workload, we also shuffle the order of the images in which they are provided for training. This is standard practice for training machine learning models to prevent overfitting to the data ordering. The random access of images within TFRecords places strain on the storage system to perform more IOPS, in addition to the necessary throughput to move the image bytes. We will see the impact of these factors in the experiments below.

Provider GPUs Available Max GPU Count Storage Type Max Read Throughput Max Read IOPs
AWS V100 8 gp2 EBS 160 MB/s 10000
      st1 EBS 500 MB/s 500
      io1 EBS 500 MB/s 32000
Azure V100 & P100 4 premium storage 250 MB/s 7500
Google P100 4 persistent SSD 800 MB/s 40000
      local SSD 1500 MB/s 400000

Here we list out the different disks that we tested among the different clouds. AWS has the most varied disk offering with “general” (gp2), high-throughput (st1), and high-IOPS (io1) options. In the case of io1, we allocate disks big enough to get the maximum throughput allowed and with the maximum number of provisioned IO operations (PIOPs) listed in the table.

For the Google cloud disks, we only tested the persistent SSDs since we found the additional performance of the local SSDs to not make that much of a difference when feeding the previous generation P100 GPUs.

The provided TensorFlow benchmarks measure images per second, so this is the primary figure of merit in all the results in this study.

Attached Disk Performance - Trivial

We start with the trivial model, which mostly just tests throughput of training data into GPU. This model is the most demanding on the data layer. The io1 EBS volume, with maxed out IOPs, provides best training image throughput.

The GCP bars are absent from the V100 metrics since we did not have results on those GPUs yet. The Azure bar is missing from the 8xV100s bar group because the Azure NCv3 series VMs only support up to 4 V100s.

While the AWS io1 disks coupled with V100s offer superior performance, we see that the other AWS disk types perform much worse. The next step is to look at how much these disks impact the system when the GPUs are actually doing some work.

Attached Disk Performance - ResNet50

Moving on to a real model, we see Azure disks can deliver the necessary throughput to match io1 disks in the case of 4 V100s. On AWS, once again, the st1 (blue) and gp2 (red) disks cannot keep up with 4 V100s compared to io1 disks. We see pretty good scale up from 4 to 8 V100s with using an io1 disk (1.75x).

Attached Disk Performance - ResNet152

Looking at a more complicated model where the compute to data ratio is quite a bit higher, even lower-performing gp2 disks (red) are able to provide some scaling from 4 to 8 V100s. The io1 disk systems scale similarly compared to the ResNet-50 case (1.73x).

Let’s drill into some more detailed disk metrics for the V100 systems to see how utilized these different disks are to ensure they match up with the images/second measurements.

4xV100 Disk Utilization

Looking at the disk utilization supplying 4 V100 GPUs, we see the gp2 and st1 disks are already maxed out at 100%, which matches up with the lack of scaling from 4 to 8 GPUs in the previous graphs. io1 and Azure premium disks still have plenty of capacity.

8xV100 Disk Utilization

Moving up to disk utilization on the AWS 8xV100 system, we see the io1 disks are still have a bit of headroom and are unlikely to be a bottleneck.

Back to the gp2 and st1 disk, drilling further into the low performance, we can compare metrics from actual disk usage to the listed specifications:

Provider GPUs Available Max GPU Count Storage Type Max Read Throughput Max Read IOPs
AWS V100 8 gp2 EBS 160 MB/s 10000
      st1 EBS 500 MB/s 500
      io1 EBS 500 MB/s 32000
Azure V100 & P100 4 premium storage 250 MB/s 7500
Google P100 4 persistent SSD 800 MB/s 40000
      local SSD 1500 MB/s 400000

8xV100 Disk Throughput

First, comparing gp2 throughput to the specifications, we see under load, we get just about the max read throughput (~160MB/s), so that appears to be the limiting factor for these disks. However, st1 disks are nowhere near their 500MB/s read throughput limit so those disks must be getting throttled for some other reason.

8xV100 Disk IOPS

Plotting read IOPS shows where st1 disks fall down. When shuffling inputs from TFRecords this workload maxes out the read IOPS. So even though st1 disks can match io1 disks in terms of data throughput, the IOPS required to provide shuffled data is the limiting factor.

Object Storage Sync Times

Before we close this section, we take a quick look at the second piece to training with data on attached disks: getting the data there in the first place. We now measure the time to sync our 700GB dataset from the object storage native to that particular node, in the same region. We see that for all 3 clouds, sync times can be significant if your training job is short, but none of the cloud providers are better or worse by large margins.

To wrap up the section: each cloud provider has local storage capable of keeping up with the most data intensive TensorFlow benchmarks, but not all disk types are suitable. Most notably, the general-purpose and throughput optimized disks end up not being good choices for systems with many V100s. Operationally, moving training data around for every batch job, just to be close to the compute is impractical in many situations, so we move on to schemes where we can read data from a shared repository.

Training Direct from Object Storage

Direct Object Storage Training

For our next evaluation, instead of syncing data first from object storage to an attached disk, we stream data directly into the GPU. We see that while the attached storage still provides much better performance compared to streaming from object storage, when training more complex models, object storage can deliver performance that is good enough.

We make use of the native TensorFlow integration with AWS S3 detailed here. Google Cloud Storage (GCS) integration is available via a similar native interface with the URI prefix gs:// rather than s3://.

In the above results, we leave in the Google persistent SSD results to compare against image throughput from Google Cloud Storage (GCS, Google’s object storage). GCS does quite well, matching performance for the compute-heavy model (ResNet-152) and about 70% of the SSD performance on ResNet-50v2.

The surprising result here is what happens when reading data directly from S3 to AWS V100s: the performance is very poor. Upon seeing this, we took a closer look at what might be happening in CloudWatch metrics for the S3 bucket:

4xx Errors from S3

From this, we see why the performance is relatively poor. The CloudWatch metrics for the S3 bucket holding the dataset show a good number of throttling errors coming from S3. This limitation in blob GETs severely impacts performance.

In a future post, we will investigate caching mechanisms to get around these throttling issues and get better performance when pulling data from object storages.

For now, let’s move on and look at some other options for hosting training data.

Training from a Distributed File System

Distributed File System - Single Node

Next we host our training data on a distributed parallel file system to see if we can provide training data with higher throughput to our GPUs. For this experiment, we survey 2 separate filesystems:

  • Lustre: high-performance, POSIX filesystem from the HPC community
  • HDFS: big-data focused virtual filesystem

In the case of Lustre, TensorFlow can pull files directly from the POSIX filesystem. The only constraint on the training node is that the Lustre client kernel module needs to be compatible with the Linux kernel in use.

For HDFS, TensorFlow comes with a custom reader. TensorFlow references HDFS files when you specify a file or directory with the hdfs:// prefix and you setup associated environment to reference Java, Hadoop libraries, and your HDFS cluster as noted here.

In terms of ease-of-deployment, even though Lustre has the simplicity of presenting a POSIX filesystem, the kernel module requirements make it a bit more restrictive compared to connecting to HDFS, which just requires a JVM.

Distributed File System - Single Node Detail

To limit the scope of this distributed file system study, we limit ourselves to AWS infrastructure. We start with filesystem clusters with a single metadata node (MDS for Lustre, Namenode for HDFS), and a single data node (OSS for Lustre, Datanode for HDFS).

The storage cluster is deployed within a single placement group to minimize latency between nodes. The storage cluster and training node is deployed within the same availability zone to maximize throughput.

The metadata instance type (m4.16xlarge) is selected to provide a large amount of memory (256 GB) and have high network bandwidth (25 Gb/s). Metadata operations tend to be more memory-limited than storage-limited, but we provisioned the metadata instances with high-performing io1 disks anyway, so we were not limited by metadata performance. We found storage to be under-utilized on these systems.

The data node (i3.16xlarge) is selected to provide fast storage (local NVMe) and also high network bandwidth (25 Gb/s). The actual specs from AWS are vague: “up to” 3.3 million IOPS and 16GB/s throughput across 8 disks, which are much better than even the highest-performing io1 EBS disks. We found the i3 local NVMe disks to provide substantially better performance than any EBS volumes in practice as well..

Lustre version 2.10.2 was used with a ZFS backend, TCP transport and default options for the most part otherwise. For HDFS, we used the Hadoop 3.0.0 distribution. We ran the same set of TensorFlow models using the same augmented dataset and the results are as follows:

Lustre vs. HDFS

Running all three benchmark models on our 8xV100 AWS node, we see that HDFS (red) beats out Lustre (blue) in TensorFlow performance on AWS infrastructure, especially when we move to simpler models where data throughput is more critical. This could be both due to HDFS being better suited for the ethernet fabric and more optimization of file access patterns within the TensorFlow connector.

Attached Disk Performance - ResNet50

Referring back to the previous, attached storage io1 EBS results (purple), we see that HDFS is capable of matching the required data throughput (4000+ images/second) for ResNet-50 and more complex models.

Distributed Training

For the last set of benchmarks, we scale up the training cluster to have multiple nodes. We choose the Horovod framework to handle the parameter updates. This was pretty easy to deploy on a Rescale cluster since:

  1. Clusters all ship with some MPI flavor
  2. The intra-cluster network is configured for MPI traffic
  3. Machinefiles are always generated for our clusters

For these experiments, we use Horovod version 0.11 with Intel MPI. We opt to not test distributed training performance on a sync-to-local first scenario, since our distributed filesystem was able to match that performance and is a much more convenient mode of deployment for many nodes.

Direct from object storage

Horovod from GCS

We first revisit our tests where TensorFlow is pulling data directly from some object storage. Google Cloud Storage provided good performance here so we choose to focus on that case. Each node in the cluster pulls TFRecords from the same GCS bucket.

Horovod performance GCS

Surprisingly, we are able to achieve linear scaling for all 3 models up to 32 P100 GPUs. The average image size in the ILSVRC dataset is approximately 100KB, so we estimate we were able to read up to 3GB/s (aggregate) directly from our dataset bucket into the nodes in a single zone. There seemed to be additional headroom available in pulling data directly from GCS. This is pretty impressive. Future work will include more investigation of these limits.

HDFS Served

Horovod from HDFS

Since HDFS proved to be an effective way of providing high training data bandwidth to even the most hungry 8xV100 nodes, we see how it scales in the multi-node case. In anticipation of increased demand, we deployed a second i3 Datanode and setup HDFS to fully replicate all data across the 2 nodes.

Horovod performance HDFS

HDFS storage scales up to at least 64 V100s (8 nodes) for the complex model (ResNet-152v2), and up to 32 V100s for simpler models (ResNet-50v2 and Trivial). More data-intensive workloads on 64 V100s hit the limit of what our HDFS setup can service. From this view, it looks like our 2 Datanode cluster can service up to approximately 2.5GB/s (almost 25000 images/second at ~100KB per image).

Horovod HDFS BW

We can further validate this estimate by looking at data transfer out of the Datanodes. For the Trivial model servicing 64 V100s, the maximum bandwidth out of a single Datanode is approximately 11Gbit/s. With 2 Datanodes in this HDFS cluster, our aggregate bandwidth is ~22Gbit/s, which comes out to about 27000 images/second (again, 100KB per image), which matches the above results of 25000 images/second plus some overhead.

The i3.16xlarges used as Datanodes host data on 6 RAIDed NVMe drives over a network connection specified as 25Gbit/s, which suggests there may be further tuning to our network settings and/or HDFS configuration to extract additional performance.

…but for now, let’s just throw more hardware at it:

Bigger HDFS cluster

We double the number of Datanodes (still fully replicated) and see how much further we can scale our training cluster. Assuming linear scaling, this new storage cluster should be able to provide 44Gbit/s aggregate bandwidth, which comes out to 55000 images/second (5.5GB/s, 100KB per image).

Bigger HDFS performance

From the new purple bar, it turns out that 4 Datanodes is more than adequate here. In the ResNet-50v2 case we are surprisingly slightly above the best per-node performance for a single 8xV100 node when aggregated across 8 p3.16xlarge instances. We seem to have reached the limit of overall system performance and no longer storage-bandwidth constrained.

In the case of the Trivial model, we are once again are able to double throughput reaching 48000 images/second. This is below the 55000 images/second estimated storage limit which suggests that training the simplest model here is constrained by other limits rather than storage at this point.

While we were able to get good performance from our V100+HDFS configuration, the real surprise here is the direct-from-GCS configuration. Looking at training performance/$, it seems like a very promising option.

Future Work

Based on the initial findings here, there are a bunch of avenues that we plan to explore:

New TensorFlow Releases

TensorFlow is still rapidly evolving, with many additional performance improvements added since the version used for these experiments (1.6) (TODO: LIKE WHAT??). We will continue periodic benchmarking of new versions as released.

GCP V100s

Google Cloud recently added support for V100s. We are particularly interested in seeing how the Google Object Storage benchmarks scale with the updated GPUs.

TPUs

With recent advances in training performance on Google’s Tensor Processing Unit, we are interested in reproducing recent results and see how they stack up against “traditional” GPU-accelerated workloads.

FUSE Caching for Object Storage

We can access object storage as the primary data source while using local storage to cache already accessed TFRecords. All 3 cloud providers surveyed here have FUSE filesystems to mount an object storage bucket to the local filesystem. s3fs (for AWS S3) and blobfuse (for Azure Storage) allow use of local storage to cache objects and save future round-trips to object storage.

The caching mechanism in s3fs is especially interesting because locally caching TFRecord objects could reduce the chance of hitting the throttling conditions seen in our previous S3 experiment.

Tuning Distributed Filesystems

We would like to better understand the performance gap between our Lustre and HDFS storage clusters. Is this a difference in efficiency between the TensorFlow POSIX vs. HDFS readers? Or perhaps more likely, there is room for improvement in our Lustre configuration. Futhermore, we were not able to saturate the network connections for either storage clusters, suggesting there are

In future experiments, we will see how close we can get to the single 8xV100 node 4200 images/second limit training ResNet-50v2 in aggregate across multiple nodes. In the case of the 8 node configuration, if storage bandwidth is the limit factor, we should be able to handle up to about 33000 images/second.

Performance/$

In trying to stretch the limits of cloud data storage, we paid little attention to how much each configuration was costing per hour. For example, our io1 EBS disks were provisioned with the maximum number of PIOPS and thus ended up costing ~$15/hour per disk. For all our distributed file system studies, we picked i3.16xlarge instances to get one of the best performing local NVMe disks arrays, but the bottleneck in these systems is more likely the network link to the client training nodes, so picking a cheaper instance type with similar network performance might do just as well.

We would like to look at the cheapest way to hit certain images/second targets or the configurations with the highest images/second/$.

Cross Zone/Region Considerations

In the above studies, we assumed our GPU training clusters were either: a) In the same region as the dataset in object storage or b) In the same availability zone as the DFS storage cluster

We plan to look at costs and performance considerations of making data available across zones and regions, including both redundant storage costs necessary, as well as replication delay and transit costs.

Performance IS possible with cloud training data

That ends the first in a series on high-performance deep learning in the cloud. In this study we found a couple of good training data configurations that yielded good TensorFlow performance, even when servicing a node full of the latest GPUs:

  • AWS io1 disks offer great attached storage performance even when training data-intensive models on all 8 V100s
  • Google Cloud Storage scales very well for multi-node training jobs. We have yet to fully test the throughput capabilities.
  • A HDFS cluster within the same availability zone as the training cluster is an effective way to share training data across multi-node GPU clusters.

If you are a software developer and are interested in high-performance computing in the cloud, we are hiring!