Relief for the Solution Architect: Pushing Back on HPC Cluster Complexity with Warewulf and Apptainer – High-Performance Computing News Analysis

[SPONSORED CONTENT] As you, at heart and training as a research scientist, financial analyst, or product design engineer doing multi-physics CAE, how did you end up as a… system administrator? You started out as one thing and became something else entirely. You finished school and started working with some large HPC class clusters. One day, there’s a system problem and you, poor soul, step forward and fix it. Someone – probably someone more senior – pays you a compliment: “Wow, that’s impressive. Man, I could never I figured it out myself…,” that sort of thing.

Word gets around and it won’t be long before you’re the go-to person when something goes wrong with the cluster, which is often enough. Soon, there you are sitting in front of a bank of monitors controlling the system while everyone else is doing science, balancing hedge fund portfolios or simulating cool new product designs. And you may ask yourself, “Well, how did I get here?”**

Organizations that rely on clusters – whether they are 100 nodes or 1,000 – would be nowhere without system administrators, aka, solution architects. It’s head-splitting, arduous work that lacks the glamor that comes from actually doing it using clusters But everyone from the CEO on down knows that without good solution architects their organizations would grind to a halt.

And there aren’t nearly enough of them. Clusters are bigger, more complicated, more powerful and more heterogeneous than ever, and they become harder to manage as they take on bigger, more complex jobs.

“You don’t start out thinking, ‘I’m going to get into cluster management,'” Glen Otero, Ph.D. who is Director of Scientific Computing, Genomics AI and Machine Learning at CIQ, a technology company with expertise in HPC class clusters. “You start out as someone who is going to do something huge in science. But you end up in this space because – we’re kidding about this – you volunteered to set up the system. And then when you did, it’s like, ‘Hey, can you do this too? Can you do that too?’ And then you wake up one day and you say, ‘Where did my life go? I had to do some research.”

CIQ at SC22

Cluster provisioning and management has required solutions that smooth and automate – at least partially automate – those processes for as long as clusters have existed. Three prominent open source projects have embraced cluster complexity, all three are the brainchild of Greg Kurtzer, the founder and CEO of CIQ. The three projects are:

Also Read :  Marketers bring Web3 to the FIFA World Cup with augmented reality, NFTs and virtual worlds

– The Rocky Linux operating system, based on the CentOS Linux distribution that was started by Kurtzer and for which Red Hat withdrew support in December 2020 (see related story in HPC), widely adopted by organizations that build large, complex, HPC-class clusters

– Warewulf, a cluster provisioning solution developed by Kurtzer starting in 2001 when he ran Linux clusters at Lawrence Berkeley National Laboratory for the Department of Energy.

– Apptainer, also created at Berkeley Lab by Kurtzer, is a secure, efficient application container system that started life as “Singularity”, an HPC-tailored answer to Docker.

Kurtzer started CIQ to provide Rocky Linux, Warewulf and Apptainer support, services, tools and other value additions, and it is a driving force behind the open source communities contributing to the three projects. CIQ provides traditional HPC-related solutions and support, and it is behind a computing paradigm leading the way to cloud-native, hybrid, federated computing called HPC-2.0 (to be discussed in a later article on this site).

Greg Kurtzer

“Building and running clusters is hard, there’s no getting around that,” Brock Taylor, CIQ’s vice president of high performance computing and strategic partners, told us. “A set has thousands of components. When you add up all the hardware and software, the operating system alone has a lot going on in it. It takes a lot of effort to get there, a lot of expertise.”

When Beowulf clusters began in the early 1990s, provisioning was script-based, hands-on and build-it-yourself. Tools soon became available, open source tools such as Oscar, Rocks and Warewulf.

Also Read :  tech leaders turn to established technologies

“So you have these provisioning systems that help make it easier to deploy clusters,” Taylor said, “but over time, the complexity keeps increasing. It’s like entropy, right? With clusters, it never gets simpler, it gets harder . The complexity always trumps the solution.”

Commercial programs have also come to market, such as those from Platform Computing, based largely on Rocks and later acquired by IBM, and from Bright Computing, which NVIDIA added to its enterprise stack last January.

But for advocates of the open source movement, there is value in Warewulf and Apptainer remaining community-supported and vendor neutral. That said, they are not panaceas – cluster entropy always remains, and there is the problem of not enough system architects to meet the demand, especially those who can successfully wade into the HPC cluster alligator pit.

“This is a big problem in HPC,” Taylor said. “Finding people who can stay on top of all the technology, and retaining them, it’s a shrinking pool. And as they gain more expertise in managing HPC systems, their price can go up and they have a lot of opportunities to go elsewhere.”

Warewulf helps with cluster management in part by simplifying the addition of new cluster nodes through the use of “images,” which, as Taylor said, “is where all the magic happens.” Images contain a complete software stack, a “golden snapshot”, of the resources – the software that utilizes the performance of computing, memory, network, everything – within a node. Images allow for the addition of new nodes that are an exact copy of the other nodes it will work with, making sure all the “piping and wiring is connected correctly and consistently,” Taylor said, “which is pretty hard work.”

In Rocky Linux Warewulf Apptainer stores, Warewulf images are delivered as containers to spawn compute nodes on the cluster. These can also contain variations on existing cluster nodes – for example, a node with GPUs and CPUs, while the other nodes are CPU-only – but can still function as part of the cluster.

Also Read :  Africa Data Centres provides colocation for the first Internet Corporation for Assigned Names and Numbers (ICANN) Managed Root Server in Africa

Jonathan Anderson, CIQ’s chief HPC solution architect, describes why the combination of Apptainer and Warewulf is a powerful combination.

“Apptainer brings scientific computing end users into the container ecosystem, giving them full control over the operating system their applications run on,” he said. Warewulf 4 brings cluster administrators into that same container ecosystem by basing compute node images on standard operating system containers. Bringing both users and administrators together in the same ecosystem allows them to better collaborate and build on each other’s work.”

This is where CIQ can play an invaluable role at HPC shops. The company has expertise not only at the fundamental level of the operating system but also with Warewulf and Apptainer.

“Warewulf helps you keep your computing software consistent while all your individual users run different applications, ‘snowflake,’ in containers,” Otero said. The three (Rocky, Apptainer, Warewulf) combined into an integrated whole means that organizations can build and expand clusters at scale, quickly, in a lightweight way.

“Applications run in containers, and because those are platform-independent – because everything is wrapped in a container – it allows the administrator to manage these nodes as all the same,” said Otero. “Snowflake applications come up, some nodes have GPUs in them, for example, and the administrator might want to use Warewulf to create a slightly different Linux image that will run on those nodes. Warewulf allows them to push that container out to the node with the GPUs, and then Warewulf can just as easily reinstall that node back to its previous state.”

Node flexibility, scalability, provisioning and expansion of clusters, the relaxation of system management tasks – all this comes within the grasp of organizations that rely on HPC clusters to perform their work.

And who knows, maybe some of those researchers, analysts, and designers who morphed into system administrators can spend more time doing what they were always meant to do in the first place.

** Talking Heads, “Once in a Lifetime”


Leave a Reply

Your email address will not be published.

Related Articles

Back to top button