At Rescale, we spend a lot of time optimizing cloud computing for HPC workloads. With the advent of cloud-enabled GPU systems, it is now practical to train deep learning models with a high degree of performance. In this article, we look at a variety of cloud GPU systems and evaluate performance of a deep learning workload on these systems.
When training on recent generation GPUs like P100s and V100s, it is not sufficient to just have high-performance accelerators in isolation. They must be connected to storage that can supply training data at high throughput. Picking the “default” storage options for some cloud providers will likely lead to sub-optimal performance.
When software engineers look at a piece of code, the first question they ask themselves is “what is it doing?” The next question is “why is it doing it?” In fact, there is a deep connection between these two – why becomes what at the next level of abstraction. That is, what and why are mirrors of each other across an abstraction boundary. Understanding this can help engineers write more maintainable, readable software.
We have made a number of blog posts over the years where we have run some MPI microbenchmarks against the offerings from the major public cloud providers. All of these providers have made a number of networking improvements during this time so we thought it would be useful to rerun these microbenchmarks against the latest generation of VMs. In particular, AWS has released a new version of “Enhanced Networking” that supports up to 20Gbps, and Azure has released the H-series family of VMs which offers virtualized FDR InfiniBand.
subscribe via RSS