HPC 101 - Parallelization

Introduction

New users are often confronted by a bevy of hardware specifications when first introduced to an HPC. However, an often overlooked component which determines the actual performance increase observed on HPC infrastructures is how well the intended software to be used is optimized to take advantage of multiple nodes, threads and special accelerators. This is in essence the concept of parallelization.

For example, if you take a piece of software that was created in the 1980s, when most CPUs were single processing units, and run it on a billion-core HPC infrastructure, you will only see one core being utilized (it will probably still run faster because the performance of a modern CPU core is magnitudes above that of a processing core of a CPU in the 1980s).

Software frameworks for parallelization

Parallelization schemes come in many shapes and sizes, however, most parallelized applications make use of common software frameworks to distribute workloads on an HPC. The MPI (Message Passing Interface) framework is generally used distribute workloads across nodes and OpenMP framework is often used to distribute workloads across multiple cores (threads). To accelerate workloads on and across GPUs, the CUDA framework provided by NVIDIA is commonly used. These frameworks can and is often used in combination, as in the case of the biomolecular simulation software suite GROMACS.

Parallelization is software dependent

However, results will vary on a per software basis, and the procedure of breaking up a research workload is often rooted in the research field the user is operating in. Thus, the user is in the best position to understand how their workload will be broken up and distributed across parallel computing resources. The take home message is thus that if a user wish to utilize any HPC infrastructure to its fullest potential, they will need to have an in-depth understanding of how the software they use de-construct and distribute a workload in a parallel computing environment