A complete and highly customizable HPC software stack
Sponsored Feature. There are many things that the HPC centers and hyperscalers of the world have in common, and one of them is their attitude towards software. They like to control their system software as much as possible because it allows them to squeeze as much performance out of their systems as possible. However, the time and money resources and level of expertise required to create what amounts to custom operating systems, middleware and runtimes are too onerous for most other organizations that stand to benefit from the HPC in its many forms.
With a rapidly growing number and types of compute engines in the data center and a growing set of HPC applications – which include traditional simulation and modeling as well as data analytics and machine learning and, increasingly, a hodgepodge of these techniques piled into a workflow that constitutes a new type of application – building and maintaining a full HPC software stack is a tall order.
What if it could be more of a group effort? What if there was a way to create a complete HPC software stack that was still optimizable for very specific use cases? Wouldn’t that be a benefit to the wider HPC community, and especially to academic, government, and corporate centers that don’t have the resources to build and maintain their own HPC stack?
It’s hard to argue against customization and optimization in HPC, so don’t think that’s what we’re doing here. Quite the contrary. But we’re thinking of a sort of curated mass customization that benefits more HPC users and more diverse architectures – and that’s because system architectures become more homogeneous over time, not less.
Each processor or GPU or FPGA accelerator manufacturer, not to mention custom ASIC vendors, creates its own compilers and often its own application development and runtime environments in the never-ending task of extracting more performance expensive HPC clusters that organizations build from their compute engines and networks. (After all, it’s hard to separate compute performance and network performance in a clustered system. That’s one of the reasons Nvidia paid $6.9 billion for Mellanox.)
The list of important HPC compilers and runtimes is not long, but it is varied.
Intel had its historical parallel studio XE stacks, which include C++ and Fortran compilers and a Python interpreter plus the Math Kernel library, data analysis acceleration library, built-in performance primitives (for algorithm acceleration for specific domains ), thread building blocks (for shared-memory parallel programming), plus an MPI library for implementing message-passing scale-out clustering, optimizations for TensorFlow and PyTorch machine learning frameworksnow included in Intel’s oneAPI toolkits.
Nvidia created its Compute Unified Device Architecture, or CUDA, to make it easier to move computing tasks from CPUs to GPUs instead of having to resort to OpenGL. Over time, the CUDA development environment and runtime added support for the OpenMP, OpenACC, and OpenCL programming models. In 2013, Nvidia bought the venerable PGI C, C++ and Fortran compilers, which came out of mini-supercomputer maker Floating Point Systems decades ago, and for more than a year the PGI compilers were distributed in the Nvidia HPC SDK stack framework. .
AMD has the Radeon Open Compute Platform, or ROCm for short, which heavily leverages the heterogeneous system architecture runtime which has a compiler front end that can generate hybrid code to run on both CPUs and GPU accelerators, and most importantly, the tools that make up the ROCm environment are open source. ROCm supports OpenMP and OpenCL programming models, and has another Heterogeneous Interface Programming Model for Portability (HIP), which is a C++ kernel language and runtime for GPU offloading that can generate code that can run on AMD or Nvidia GPUs and can also convert code created from Nvidia’s CUDA environment to run on HIP and therefore have some sort of portability.
The Cray Linux environment and compiler set, now sold by Hewlett Packard Enterprise as the Cray Suite of Programming Environments, comes to mind immediately and can be used on HPE’s own Cray XE systems, Intel or AMD processors and Nvidia, AMD or Intel GPUs (incorporating each vendor’s tools) as well as Apollo 80 machines using Fujitsu’s heavily vectored A64FX ARM server processor. ARM has its Allinea compiler set, which is important for A64FX processors as well as Neoverse Arm processor designs that will come out with vector extensions in the coming years. Fujitsu also has its own C++ and Fortran compilers which can run on the A64FX chip and of course there is also the open source GCC compiler set.
There are other important HPC compiler and runtime stacks with acceleration libraries for all sorts of important algorithms in various fields of simulation, modeling, financial services, and analytics. More the merrier, the merrier. But here’s the important lesson exemplified by the launch of the Apollo 80 system with HPE’s A64FX processor: Not all compilers are good at compiling all types of code. This is something that all university and government supercomputing centers, especially those that modify architectures a lot, know very well. Diverse computing will mean diverse compilation.
And, therefore, it is better to have many different compilers and libraries in the toolkit to choose from. And, in fact, what the HPC market really needs is a hyper-compiler that can examine code and determine which compiler should be used across a wide range and possibly a diverse mix of compute engines to get the best performance. We don’t think the HPC industry needs many different comprehensive HPC SDKs developed by their vendor advocates as much as it needs compilers and libraries from many different experts all of which can be integrated into one framework Single, large and complete SDK for HPC. workloads.
Stepping up to the next level in the HPC software stack, and further complicating the situation, is the fact that each HPC system manufacturer has its own Linux environment, or one that has been anointed as the one chosen by IBM’s Red Hat unit or SUSE Linux or Scientific Linux, or whatever is cobbled together by the HPC center itself.
In an HPC world where security and efficiency are paramount, we need a stack of operating systems, middleware, compilers, and libraries designed as a whole, with options you can drag into and out of the stack as needed. , but which offers the widest optionality. This software does not need to be open source, but it must be able to be integrated consistently via APIs. For inspiration from this HPC stack, we take the OpenHPC effort led by Intel six years ago and the Tri-Lab Operating System Stack (TOSS) platform developed by the US Department of Energy – specifically, Lawrence Livermore National Laboratory, Sandia National Laboratories and Los Alamos National Laboratory. The TOSS platform is used on the convenience hubs shared by these HPC centers.
The OpenHPC initiative seemed to be gaining ground a year later but a few more years passed, and at that time no one was talking about OpenHPC. Instead, Red Hat was creating its own Linux distribution tailored to run traditional HPC simulation and modeling programs and the world’s two largest supercomputers, “Summit” at Oak Ridge National Laboratory and “Sierra” at Lawrence Livermore, were running Red Hat Enterprise Linux 7. The OpenHPC effort was a little too Intel-centric for many, but that goal was understandable to some extent without AMD CPUs or GPUs and ARM CPUs in the HPC hunt. But the mixed nature of the stack was okay.
Our thought experiment on an HPC stack goes further than just allowing anything to connect to OpenHPC. What we want is something more like TOSS, which was presented four years ago at SC17. With TOSS, the labs created a derivative of Red Hat Enterprise Linux that used consistent source code across X86, Power, and Arm architectures and a build system to eliminate elements of RHEL that were foreign to HPC clusters and to add other software that was needed.
In a conversation about exascale systems in 2019, Bronis de Supinski, CTO of Livermore Computing, said that Lawrence Livermore extracted 4,345 packages from the more than 40,000 packages that make up Red Hat Enterprise Linux, then patched and repackaged 37 more of them, then added another 253 packages that the Tri-Lab systems require building a TOSS platform with 4,598 packages. The software area is obviously greatly reduced, while supporting various CPUs and GPUs for computing, various networks, various types of middleware abstractions, and the Luster parallel file system.
What is also interesting about the TOSS platform is that it has a complementary development environment that overlays compilers, libraries, etc., called Tri-Lab Compute Environment:
If three of the major HPC labs in the United States can create a Linux HPC variant and development toolset that provides consistency across architectures, enables application portability, and lowers the total cost of ownership of product clusters that they use, what additional effect might a unified HPC stack, with all current vendors of compilers, libraries, middleware, and other participants, have on the HPC industry as a whole? Imagine a community-shared build system that could only eject components needed for a particular set of HPC application use cases and limit the security exposure of the entire stack. used. Imagine if math libraries and other algorithmic speedups were more portable across architectures. (That’s a topic for another day.)
It’s good that each HPC compute engine or operating system vendor has its own complete and highly optimized stack. We applaud this, and for many customers in many cases this will be enough to design, develop and maintain HPC applications appropriately. But that most likely won’t be enough to support a diverse set of applications on a diverse set of hardware. Ultimately what you want is to be able to have a consistent framework across compiler and library vendors, which would allow any math library to be used with any compiler and for a adjustable linux platform.
Sponsored by Intel.