Skip to main content

HPC Days 2024 – Abstracts

HPC Days 2024 – Abstracts

Tuesday

Session 13:00-15:00

A tale as old as time: the challenges of porting performant code to new hardware

Iain Stenson

In the last 5-10 years, there has been an explosion in demand for dedicated machine learning hardware. As the incumbent, with as much as 70% of the market share, NVIDIA’s fortunes have soared in this time. However, they are not the only horse in the race, with AMD, Apple and Intel all offering GPUs with deep learning functionality and the major deep learning frameworks are keen to support them. With huge investments of time and effort from the hardware manufacturers and the framework developers, are we in a halcyon age of write-once-run-anywhere machine learning models or do the historic trade-offs of performance and portability still apply?
As a case study, we will focus on a recent effort at the Alan Turing Institute to run neural networks that were originally developed on cloud-hosted NVIDIA hardware on an HPC equipped with Intel GPUs. Despite the neural networks being written with PyTorch and huggingface/accelerate, which are large, mature frameworks with comprehensive documentation that promise convenience and usability, we encountered many hurdles. The problems ranged from the depressingly familiar to the new and unusual, requiring a range of debugging techniques, code adjustments and a good dose of trial and error.
This experience raises questions for users and makers of machine learning libraries and of computing systems: will porting high-performance code always be this difficult, is there anything we can do to improve matters and, ultimately, are the performance gains worth the effort?

Leveraging high-performance computing for bioimage analysis

Anita Karsa, Matt Archer, Jerome Boulanger and Leila Muresan

“This work aims to illustrate the impact of high-performance computing on bioimaging and bioimage analysis. The focus is on a data rich technique, lightsheet imaging (also called Selective Plane Illumination Microscopy or SPIM). Lightsheet microscopy has experienced a boom in the last decade, being designated technique of the year by Nature Methods in 2014 [1]. Due to its advantages such as fast imaging of large volumes and low phototoxicity, the role of the technique e.g. in developmental biology or fast calcium or membrane potential indicator imaging cannot be overstated. However, the blocking factor for light sheet microscopy to reach its full potential is a computational one: typical datasets consist of time sequences of multi-tile, multi-angle, multi-colour 3D data stacks totalling terabytes of data that need complex processing.
The processing steps we focus on can be categorized in pre-processing steps (denoising, deskewing, destriping, registration, stitching, deconvolution) and downstream analysis. We discuss how HPC enabled the design of a deconvolution algorithm based on a new space varying image formation model, that would have been computationally prohibitive even on high-end workstations [2]. Subsequent image analysis tasks are typically segmentation and tracking, followed by aggregating results across experiments (e.g. atlas creation). We exemplify our pipeline on a mouse embryo test-case trained on the Exascale Data Testbed (Cambridge Data Accelerator).

  1. Method of the Year 2014. Nat Methods 12, 1 (2015). https://doi.org/10.1038/nmeth.3251
  2. Toader, B., Boulanger, J., Korolev, Y. et al. Image Reconstruction in Light-Sheet Microscopy: Spatially Varying Deconvolution and Mixed Noise. J Math Imaging Vis 64, 968–992 (2022). https://doi.org/10.1007/s10851-022-01100-3″
Stronger Scaling Plane-Wave Density Functional Theory

Matthew Smith

Plane-wave density functional theory (DFT) codes (i.e. those using a Fourier basis) consistently account for around a third of the total software usage on the U.K.’s Tier 1 HPC (ARCHER2). Such codes typically decompose their Fourier domain by distributing sticks of plane-waves (i.e. Fourier wave-vectors) over MPI processes.
3D Fourier transforms are subsequently usually performed as sets of FFTs of lower dimensionality; the inherently quadratic-scaling communications commonly determine the strong scaling limit of large simulations.

We present our development of a domain decomposition, implemented in CASTEP (www.castep.org), which, to our knowledge, constitutes a novel solution in the DFT community. We describe how we approach the decomposition and load-balancing as a coupled two-dimensional optimisation problem. We explain how a logical 2D process grid effectively achieves linear-scaling communications, and how the consequent constraints on load-balancing are ameliorated by adopting and adapting several task-scheduling algorithms.

We illustrate how the development performs in real-world simulations, dramatically reducing communication costs and significantly increasing the strong scaling limit. We illustrate efficient scaling on ARCHER2 to 4x as many processes as previously possible, with 6x speed-ups regularly achieved.

Accelerating Fortran Codes: Merging Intel Coarray Fortran with CUDA and OpenMP

James McKevitt and Eduard Vorobyov

Fortran’s prominence in scientific computing requires strategies to ensure both that legacy codes are efficient on high-performance computing systems, and that the language remains attractive for the development of new high-performance codes. We demonstrate a novel and robust integration of Intel Coarray Fortran (CAF) — for distributed memory parallelism with familiar syntax — with Nvidia CUDA Fortran for GPU acceleration and OpenMP (OMP) for shared memory parallelism. We applied this to a nested grid potential solver [1], used in a protoplanetary disk code [2], showcasing significant performance improvements, comparable to those achieved with the Message Passing Interface (MPI) but while retaining Fortran syntax [3].

We consider three main aspects: how to best manage pageable and pinned memory to speed up transfers between CPU and GPU memory; how to optimise CPU-GPU affinity, considering the VSC’s NUMA architecture; and how C-pointers and C-bound subroutines can robustly interface the two compilers. We also discuss the limitations of our approach, and compare its performance with MPI through weak and strong scaling tests.

[1] Vorobyov E.I., McKevitt J.E., Kulikov I. and Elbakran V., A&A 671, A18 (2023).
[2] Vorobyov E.I., Kulikov I., Elbakran V. and McKevitt J.E., A&A, Accepted.
[3] McKevitt, J.E., Vorobyov E.I., and Kulikov I., J. of Parallel and Distributed Computing (Submitted).

Multigrid for ExaHyPE

Sean Baccas, Alexander Belozerov, Eike Mueller, Dmitry Nikolaenko and Tobias Weinzierl

The goal of this EXCALIBUR project is the development and implementation of multigrid solvers to extend the successful ExaHyPE solver engine to a wider class of computational problems. While ExaHyPE was originally designed for hyperbolic problems, many applications in science and engineering require the solution of elliptic problems: these arise, for example, from constraints equations in computational astrophysics or in semi-implicit time-stepping methods in fluid dynamics. This development will allow one to manage hyperbolic and elliptic solvers in a uniform computational framework naturally designed to operate on hierarchical meshes and HPC infrastructure, leveraging the advantages of the multigrid approach.

To make optimal use of modern computer hardware, we develop novel high-order discretisation methods in space and time. Mathematically, the goal is to use the hybridisable Discontinuous Galerkin (HDG) method to extend the ADER-DG space-time discretisation to problems with elliptic constraints. Multigrid is the elliptic solver of choice since it is the only method that scales algorithmically (computational cost grows linearly with the number of unknowns) to large problems on exascale machines. Over the last year, we have explored different variants of the DG method for elliptic problems and integrated the corresponding solvers into the Peano framework. For this we designed a suitable abstraction that separates the mathematical details of the (local) discretisation from the (global) matrix assembly and solution of the resulting large sparse linear system. Since both the discretisation and the solver algorithm are highly sophisticated, this approach is crucial to guarantee the productivity of mathematicians and software engineers in interdisciplinary projects like this.

We aim at optimal performance on exascale architectures by using bespoke implementations based on matrix-free methods, exploring efficient matrix storage formats and using reduced precision arithmetic.

Back to the programme overview.

Session 16:30-18:30

SiMLInt: Simulation and Machine Learning Integration

Anna Roubickova, Elena Breitmoser, Amrey Krause, Dave McKay, Moritz Linkmann and Jacob Page

Artificial intelligence (AI), and Machine Learning (ML) in particular, has proven good at tackling a wide range of even very complex problems, which leads to an increasing number of computational scientists who try to embed AI/ML into their workflows. The resulting group of AI/ML users in the context of HPC simulations is incredibly diverse in terms of their background as well as the use cases, giving rise to a range of approaches to the embedding of AI/ML itself. Designing and executing these mixed HPC+AI workloads efficiently and without creating additional bottlenecks requires a broad skillset including expert knowledge of the domain, of the AI modules used as well as system and software engineering.
Funded by the ExCALIBUR programme (https://excalibur.ac.uk), the SiMLInt (https://epcced.github.io/SiMLInt/) team has been investigating ways to build pipelines for the mixed HPC+ML workloads.
The work is inspired by the [Kochkov et al., 2021]’s successful application of ML to computational
fluid dynamics; however, in their case both the numerical solver as well as the ML model are implemented in JAX and deployed together. We summarise our experiences with state-of-the-art technologies that support the deployment of truly mixed HPC and AI workloads, where the different parts of the ML-supported HPC simulation are implemented using different tools (and programming languages) commonly used in the different communities.

We will demonstrate the practical steps needed to modify an existing HPC simulation to orchestrate its run with the deployment of an ML engine, and to facilitate the communication between the two, as this approach is more likely to be useful for embedding ML into current simulation codes. We will also discuss the overall feasibility of such mixed pipelines. We will present data related to the orchestration and communication overheads among the different tools, as well as time and resources needed to generate suitable data and to train the ML model. We will compare these with the speed-up facilitated by the ML embedding and consider its implications on the workflow’s uncertainty and reliability, and outline questions and challenges that remain open.

D. Kochkov, J. A. Smith, A. Alieva, Q. Wang, M. P. Brenner and S. Hoyer: Machine learning–accelerated computational fluid dynamics. Proceedings of the National Academy of Sciences, 2021. https://doi.org/10.1073/pnas.2101784118

AI accelerated computational imaging at exascale with uncertainty quantification

Kevin Mulder

The objective of computational imaging is the reconstruction of images from the data measured by some observational instrument. Whilst this naturally covers a large variety of applications, the leading edge is formed and driven by both advances in instrumentation and exascale computing resulting in an environment of novel and large scale datasets. At these scales traditional computational imaging techniques fail to produce interpretable, high-fidelity and timely image reconstructions. Therefore to solve this problem novel methodological approaches are required. Within the Learned EXascale Computational Imaging (LEXCI) project the methods, algorithms and software implementations which comprise these approaches are being developed. Leveraging knowledge of the physical instrumentational model and advances in machine learning, more effective image priors can be learned resulting in improved reconstruction accuracy, generalizability and uncertainty quantification. Whilst simultaneously supporting parallelized and distributed implementations in professional research software. Enabling the full usage on modern HPC architectures at reasonable computational, memory and storage levels. An example of one such approach is the development of the QuantifAI method. This method provides uncertainty quantification in radio-interferometric image reconstruction with learned (data-driven) priors for high-dimensional settings, such as a next-generation radio telescope e.g. the Square Kilometer Array (SKA). It achieves this through the exploitation of a learned convex data-driven prior which allows for the acquisition of information from the posterior without the need to use costly and poorly scaling MCMC sampling techniques. Instead convex optimization is employed to compute the Maximum-A-Posteriori (MAP) estimate and by extension MAP-based uncertainty quantification. Simulated radio-interferometric images are reconstructed using this method and shown to possess improved image quality and more meaningful uncertainty quantification.

Blueprinting AI for Science at Exascale (BASE-II)

Blueprinting AI for Science at Exascale (BASE-II)

The BASE-II Project is making use of large benchmarks units tests mini apps, i.e traditional HPC profiling methodology. In order to undertake and drive the following:

1.Optimisation of HPC, AI and ML workloads to deliver increased efficiency and scaling on HPC machines

2.The convergence of AI/ML and HPC applications and algorithms

3.How we can share hardware designed for HPC applications with AI/ML applications

This will be done by taking six or seven benchmarks supplied by the AI community which will be developed by the Scientific Machine Learning (SciML) Group based at Scientific Computing at STFC and these will be run on the Excalibur H&ES performance testbed at Cambridge. The profile in data for both the applications and the hardware response will be used to both optimise the source code and libraries and devise more optimised system architecture designs. We will report on this work particularly the profiling methodology, the performance testbed architecture and early results of our analysis.

Matching AI Research to HPC Resource through Benchmarking and Processes

David Llewellyn-Jones and Tomas Lazauskas

There are many good performance benchmarking tests for HPC systems, but in practice researchers find it challenging to judge how well performance on one system translates to performance on another.

This is especially true for AI workloads and model training, where hardware capabilities map across multiple axes, and where data size, data structure, data types, model size and hyperparameters all feed in to a complex performance picture.

As a result of this, our experience at the Turing has been that researchers find it challenging to understand which systems are best suited for their needs and how to judge the resources they’re likely to require for their work.

Despite this, in a world where HPC platforms are in constant flux and undergoing rapid advancement, as a community we still often expect researchers to preemptively specify their research needs.

In this talk we’ll present our work providing researchers with a more solid basis for their HPC allocation requests, approached across multiple dimensions, from providing first-hand information and advice to offering trial access.

As part of this work we’ve benchmarked multiple HPC systems using an open source transformer training model that reflects the workload of our researchers. We’ll present our results and experiences across a range Tier 2 (regional/specialist hubs) and Tier 3 (local/institutional systems) systems available in the UK. We’ll also talk about how we share this information with our users, guide their applications and attempt to provide a friction-free on-ramp for access to HPC systems.

Model calibration using ML and ML-derived code simplification of computationally expensive algorithms

Matthieu Schaller

Over the last 2 years, we have run a series of cosmological simulations on the DiRAC facility @ Durham, including the largest (by resolution elements) run ever performed. Besides the large HPC endeavours, one of the key developments that lead to this generation of simulations is the use of ML techniques both in the design and in the analysis of the runs. In this talk, I will describe the use of ML models to calibrate out the nuisance parameters (our imprecise understanding of galaxy evolution) blurring the interpretation of cosmological data which we then use as input to the HPC calculations. I’ll also present the AI-based analysis of these simulations to work at the field level when comparing results to data without the need of summary statistics. I will conclude by highlighting how, in our workflow, the AI techniques are already a key component surrounding the classical numerical approaches used on large HPC systems. If time permit, I’ll also show some results where key costly numerical kernels are replaced by neural network interpolators as a way to reduce the overall time-to-solution of the simulations themselves.

Back to the programme overview.

Wednesday

Session 9:00-9:45

Session 10:00-12:00

Integrating a High Performance Burst Buffer Filesystem with the Slurm Resource Scheduler

Bob Cregan, Yiannos Stathopoulos and Joshua Reed (Cambridge)

The introduction of high performance solid state storage technologies have weakened the link between capacity of a storage system and its performance. Now, it is not necessary to have a large number of high capacity spinning disk devices to provide the performance needed to service large scale parallel applications.
However, these small in capacity, high in performance, filesystems may not be suitable for the permanent storage of research data, and therefore may only be used during the computation phase of a research workflow. User-level management of the transfer of data between the permanent storage and the burst buffer increases wait times before the start of active computation, introduces issues with high performance storage capacity management and may complicate management in the case of certain categories of research data.
We use the LUA Burst buffer plugin present in the Slurm resource scheduler to automate data flows in and out of a high performance solid state parallel filesystem. The same mechanism is used to control data management on this filesystem.
This produces a highly managed high performance namespace that is tightly coupled to the resource scheduler, where data is only transferred and held when it is needed for active computation and is removed when this is not the case.

Blending Machine Learning and Numerical Simulation, with Applications to Climate Modelling

Jack Atkinson

The rise of machine learning (ML) has seen many scientists seeking to incorporate these techniques into numerical models.Doing so presents a number of challenges, however.The ICCS has explored this problem in the context of coupling ML components/parameterisations into climate models.In this talk I will give an overview of some challenges in this area, how ICCS has tackled them, and what has been learnt in the process.I will present FTorch, a library developed by ICCS to bridge the gap between Fortran (in which many large physics models are written) and PyTorch(in which much ML is performed) and lower the technical barrier to scientists seeking to leverage ML in their work.We will reflect on the process of creating hybrid models and a framework to aid in the process of coupling neural nets to large models followingsoftware design principles.Finally we will discuss ongoing work using the Community Earth System Model (CESM) to re-deploy a neural net trained using a high-resolution modelwith a different grid and variables into a new setting.

Computational Storage for Scientific Workflows

Grenvile Lister, Bryan Lawrence, Jean-Thomas Acquaviva, Konstantinos Chasapis, Mark Goddard, Scott Davidson, David Hassell, Valeriu Predoi, Matt Pryor and Stig Telfer

For over a decade data movement has been known to be more costly, in time and energy, than the actual computation involved in scientific analysis of that data. The research community has proposed multiple ways to mitigate the cost of data movement by bringing the processing to the data. Here we describe a practical deployment of “active storage”, i.e. remote storage which is capable of carrying out computational tasks, thereby removing the need to copy that data to the client so that it can do those same task. This concept has a long history, but previous attempts have not managed to address the problem of how the client can use active storage without changing to a complicated and bespoke workflow. We present a new Dask based API that integrates active storage functionality into cf-python, a domain specific analysis language for weather and climate. With this new API, the user does not need to make any changes to their code in order to make use active storage – if the data they need to operate on is stored in an active storage-enabled environment then active storage will be used, otherwise not. The new API demonstrates that active storage could be brought more easily to the user than before, but at this stage it is a proof-of-concept, since there are only limited operations that are possible – calculating the global maximum, minimum, sum and mean (the last two unweighted, and then only if no other operations have been applied by the client beforehand. However, a wider range of useful tasks is within reach now that we have a framework in which they can be implemented.

Introducing mdb – a debugger for parallel MPI applications

Tom Meltzer (Cambridge)

The current landscape of Free and Open-Source (FOS) MPI-debuggers is bleak. There are proprietary options available (e.g., linaro ddt, TotalView), but these require paid-for licenses. As a research software engineer, I work with my collaborators on whichever systems they have access to — meaning that I frequently move from one HPC system to another. Some HPC service providers simply cannot afford the license costs and may provide one tool, the other or none. As a result relying on these tools can leave you in the dark when they are no longer available. Furthermore, each of these proprietary tools has a different interface and commands that would need to be re-learnt when you move to a different system.

mdb is a new FOS debugger for MPI applications, supporting C, C++ and Fortran. It is essentially a wrapper around gdb (the GNU debugger) which itself is a de facto standard for serial program debugging. Due to the ubiquity of gdb, mdb allows users to re-use their existing knowledge of debugging commands to debug MPI code on pretty much any system they want to run on. mdb is a superset of gdb providing all the core-functionality with some parallel-specific additions that help when debugging MPI applications.

GitHub link: https://github.com/TomMelt/mdb

The HPC+AI Cloud: flexible and performant infrastructure for HPC and AI workloads

Matt Pryor and John Garbutt

In recent years, in particular with the rise of AI, the diversity of workloads that need to be supported by research infrastructures has exploded. Many of these workloads take advantage of new technologies, such as Kubernetes, that need to be run alongside the traditional workhorse of the large batch cluster. Some require access to specialist hardware, such as GPUs or network accelerators. Others, such as Trusted Research Environments, have to be executed in a secure sandbox.

Here, we show how a flexible and dynamic research computing cloud infrastructure can be achieved, without sacrificing performance, using OpenStack. By having OpenStack manage the hardware, we get access to APIs for reconfiguring that hardware, allowing the deployment of platforms to be automated with full control over the levels of isolation. Optimisations like CPU-pinning and SR-IOV allow us to take advantage of the efficiency gains from virtualisation without sacrificing performance where it matters.

The HPC+AI Cloud becomes even more powerful when combined with Azimuth, an open-source self-service portal for HPC and AI workloads. Using the Azimuth interface, users can self-service from a curated set of optimised platforms from web desktops through to Kubernetes apps such as Jupyter notebooks. Those applications are accessed securely, with SSO, via the open-source Zenith application proxy. Self-service platforms provisioned via Azimuth can co-exist with large bare-metal batch clusters on the same OpenStack cloud, allowing users to use the environments and tools that best suit their workflow.

Delivering performance and programmer productivity on energy efficient hardware

Gabriel Rodríguez Canal and Nick Brown (EPCC, Edinburgh)

From FPGAs to CGRAs, there are a range of novel architectures for HPC that promise to deliver a step change in energy efficiency, which is crucially important as we look to move towards NetZero and decarbonise our workloads. Recent advances in hardware mean that technologies such as FPGAs are now capable of delivering high performance, and CGRA technologies such as AMD’s AI engines are being integrated into latest generation AMD CPUs as accelerators. However, a major challenge with all these technologies is that currently programming these is extremely time consuming, in the domain of the few experts and optimised algorithms bear little resemblance to their original CPU counterparts. Put simply, for such technologies to become accepted by the HPC community, then the barrier to entry must be lowered.

In this talk we will describe our work that leverages MLIR to undertake automatic compiler transformation and optimisation of code. Taking as input CPU-based HPC code, such as those written in Fortran or Python, our approach identifies and extracts algorithmic patterns which are then transformed into their target-architecture specific highly tuned counterpart. Leveraging the ExCALIBUR xDSL project as the underlying framework, we will describe the HLS MLIR dialect that we have developed and AMD’s AIE dialect that we have ported to xDSL, how these interoperate with LLVM frontend technologies such as Flang and AMD’s backend compilers, before demonstrating that our approach is able to deliver algorithms that perform an order of magnitude faster than the state of the art based upon unmodified user code.

Back to the programme overview.

Session 13:00-15:00

Establishing the Accessible Computational Regimes for Biomolecular Simulations at Exascale

Daniel Cole, Robert Welch, James Gebbie-Rayet and Sarah Harris

The ExaBioSim (https://excalibur.ac.uk/projects/exabiosim/) community project is part of the ExCALIBUR UK research programme, and aims to establish the accessible computational regimes for biomolecular simulations at exascale. Computer simulations are frequently the only tool capable of providing dynamical information essential to understanding how biomacromolecules function. As we move into the exascale, the size and complexity of what we can simulate will greatly increase. These new systems will utilise a wider array of methods, including ML, code coupling and enhanced sampling, and a more diverse range of hardware, including GPUs. Next-generation HPC systems must also be run differently, incorporating virtualisation and containerisation, requiring a new generation of DevOps engineer with a broader set of technical skills. To address these problems, we are creating a selection of showcase simulations that are representative of the community’s HPC needs (https://github.com/HECBioSim/hpcbench). We are quantifying the performance, scaling and energy consumption of common simulation software, and using this information to help HPC users configure their simulations optimally. We will explore the spatial and temporal limits of current parallel computing and see if these limits can be increased using multiscale modelling and code coupling. We will use a variety of methods across different length scales to investigate how exascale computing can help us model DNA, and we will work to integrate cryo-EM structures into our existing simulation tools. Finally, we will explore the use of next-generation HPC systems for ensemble computing techniques, such as free energy perturbation for drug discovery​.

In this talk, I will give an overview of the ExaBioSim community project, as well as demonstrating an example HPC use case, coupling active learning with database searching to identify potential inhibitors of the main protease of SARS-CoV-2, the virus responsible for the COVID-19 pandemic.

Building the HPC RSE Community

Andrew Turner, Marion Weinzierl, Ed Hone and Nick Brown

There is a continuing explosion in the use of HPC and AI for research across many disciplines and a commensurate increase in the demand for RSE expertise with an HPC focus. This demand has recently been turbo-charged in the UK by the Government investment in new national services for both HPC and AI. Within the UK there are a number of HPC-focussed community groups (e.g. HPC-SIG, ExCALIBUR, CoSeC, HPC channel on RSE Slack) as well as international community groups (e.g. SIGHPC, hpc.social, Cray User Group) but there is no specific, well-defined community for RSEs working in HPC, although some efforts towards this goal have been made recently at international conferences and in online spaces (e.g. hpc.social). This lack of an obvious community initiative leads to problems for the many new RSEs entering the world of HPC. In particular, it is difficult for new entrants into the HPC sphere to identify how they can access peer support, how they can build their contact network, what HPC events to potentially attend, and where to find the best sources of information on different aspects of HPC. In this presentation we will describe recent efforts to instantiate a HPC RSE community in the UK and beyond, what the community could potentially look like and how to get involved in building and shaping this community to benefit RSEs working in HPC.

Benchmarking for the exascale

Ilektra Christidi, Tuomas Koskela, Mose Giordano, Emily Dubrovska, Jamie Quinn, Tom Deakin, Kaan Olgu, Chris Maynard and David Case

Benchmarking has always been a crucial activity for HPC. Being able to systematically assess the performance of important algorithms on different HPC machines is needed for planning the next steps in algorithm development, evaluating available and upcoming hardware technologies, and procuring new HPC systems with confidence. The benchmarking process though has been somewhat of a dark art, usually involving convoluted, machine-dependent scripts and configurations that only the few can run, re-run, and reproduce their results. The situation is only becoming worse with the advent of the quest for exascale, with the available hardware technologies becoming more heterogeneous.

We will present how our project, part of the ExCALIBUR program, is set to address the problem of automation and reproducibility in benchmarking by creating an Open Source, user-friendly, automated framework based on Spack and ReFrame, for building and running benchmarks and collecting and visualising their results. This framework includes a still-growing suite of benchmark codes, configurations to run them on popular HPC systems, as well as a package and library for post-processing benchmarking results. Preliminary results of performance portability studies, as well as of new benchmarks for algorithm and implementation variations developed as part of this project, obtained using our framework, will also be presented.

As part of this project, we aim to create a community of practice for benchmarking, which spans across ExCALIBUR, DiRAC, and other HPC developers and providers in the UK. We are making benchmarking straightforward for all by advertising our work in various fora, mailing lists and events, running training workshops, and inviting and collaborating with application and benchmark developers to add their benchmarks to our suite.

Testing and Benchmarking Machine Learning Accelerators using Reframe

Chris Rae, Joseph Lee and James Richings

With the rapid increase in machine learning workloads performed on HPC systems, it is beneficial to regularly perform machine learning specific benchmarks to monitor performance and identify issues. Furthermore, as part of the Edinburgh International Data Facility, EPCC currently hosts a wide range of machine learning accelerators including the Cerebras CS-2, which are managed via Slurm and Kubernetes. We use the Reframe framework to perform machine learning benchmarks, and we will discuss the results collected and challenges involved in integrating Reframe across multiple platforms and architectures.

Developing pathways and structures to support HPC training

Neil Chue Hong, Jeremy Cohen, Weronika Filinger, Martin Robinson and Steve Crouch

The ever-increasing demand for AI, data-intensive research processes and large-scale computation are putting a focus on the need for HPC skills. They’re also highlighting the extensive skills shortage that exists in this space within both the research and industrial environments.

One of the challenges in addressing this skills shortage is that developing specialist High Performance Computing skills takes a significant investment of time. Experts in this field often develop their skill sets in an ad hoc manner over a number of years, building expertise through on-the-job experience. If we are to address the lack of skills that we face right now, as well as supporting future generations of experts in building the necessary competencies to support and create the next stage of our AI-enabled world, we need better training structures and clear learning pathways.

The UNIVERSE-HPC project, funded as part of the ExCALIBUR programme, brings together the universities of Edinburgh, Southampton, Oxford and Imperial College London. The team are looking to better understand the training challenges in this space and working to develop pathways that are applicable to learners at a range of different existing skills. Alongside the pathways, we are capturing details of a variety of existing open source training material and filling in gaps by developing our own materials. These materials and the associated learning pathways are now accessible via a web-based tool developed by project team members at University of Oxford.

In this talk we’ll provide an overview of the challenges that UNIVERSE-HPC is looking to address, highlight our learning framework development work and update on our extensive community activities being used to support this. We’ll also highlight our aims for the next stage of the project including novel approaches to reviewing and developing new training materials through the use of community hackathons

Back to the programme overview.

Session 16:30-18:30

Going from GRChombo to GRTeclyn: An exascale journey

Juliana Kwan (DAMTP, Cambridge)

The numerical solution of Einstein’s equations for General Relativity presented a significant advancement to the scientific community almost 20 years ago, allowing extreme gravitational events such as binary black hole mergers to be simulated for the first time. Unfortunately, these simulations continue to be incredibly expensive, thus limiting the amount of scenarios that can be explored and predictions that can be made for observations of gravitational waves. In order to take advantage of exascale computing facilities as well as general improvements in performance, we found it necessary to port our existing numerical relativity code, GRChombo, to one based on the AMReX framework, which has been renamed GRTeclyn. AMReX offers impressive scaling and GPU offloading capabilities and is capable of parallelization via MPI, OpenMPI, hybrid MPI/OpenMP or hybrid MPI/(CUDA or HIP or SYCL). GRTeclyn uses all of these methods as we optimize our code for heterogeneous systems such as the Dawn supercomputer and the ExCALIBUR Hardware and Enabling Software testbed (Swirles) located at Cambridge.

Additionally, while the flop rate has grown by a factor of ~80 over the past 10 years, the I/O performance has only grown by about a factor of 5, thus necessitating the use of on the fly analysis pipelines in order to reduce both storage requirements and I/O overheads. Our implementation of in-situ visualization in AMReX
involves using ParaView Catalyst and Conduit APIs to interpret AMReX AMR data structures as Blueprint Conduit nodes which are then passed onto ParaView for visualization using the raytracing toolkit OSPRay.

Initial GPU vs CPU performance statistics for GRTeclyn are promising, and I will be outlining the steps taken to refactor our code for GPU performance and also develop our in-situ visualization capabilities.

On the impact of MPI on the performance portability of heterogeneous parallel programming models

Wei-Chen Lin, Tom Deakin and Simon McIntosh-Smith (Bristol)

Most shared-memory heterogeneous programming models focus on single-node performance only as scaling across nodes is beyond their design scope. However, a major concern for implementing large-scale HPC applications is the scalability beyond a single node with the MPI+X model. As such, it is important to understand how each model’s design constraints interact with cross-node communication concerns.
This study provides a comprehensive evaluation of shared-memory models in single-node and multi-node HPC environments. We select representative memory-bandwidth bound HPC mini-apps that are ported to the following models: CUDA, HIP, OpenMP (target), Kokkos, SYCL, and C++ PSTL/StdPar. The mini-apps use several different memory access patterns while maintaining identical MPI communication schemes to facilitate comparison. We evaluate benchmarks on a wide range of heterogeneous HPC systems, including the latest CPUs and GPUs from Intel, NVIDIA,
AMD, and AWS. Finally, we analyse and discuss performance portability and productivity for each of the programming models, and how this interacts with MPI.

t.b.c.
SYCL compute kernels for ExaHyPE

Chung Ming Loi, Heinrich Bockhorst and Tobias Weinzierl (Durham & Intel)

We discuss three SYCL realisations of a simple Finite Volume scheme over multiple Cartesian patches. The realisation flavours differ in the way how they map the compute steps onto loops and tasks: We compare an implementation that is exclusively using a sequence of for-loops to a version that uses nested parallelism, and finally benchmark these against a version modelling the calculations as task graph. Our work proposes realisation idioms to realise these flavours within SYCL. The results suggest that a mixture of classic task and data parallelism performs if we map this hybrid onto a solely data-parallel SYCL implementation, taking into account SYCL specifics and the problem size.

Portable SYCL Math Libraries for pre and post exascale

Rod Burns (Codeplay)

With the recent emergence of numerous accelerators promising large-scale computational speed-ups over traditional CPU-based applications, heterogeneous computing has become of increasing interest to the scientific computing community. However, producing portable code without sacrificing the performance available from vendor-specific libraries remains a challenge. In this talk we will describe how the UXL Foundation projects based on the oneAPI specification can be used to target multi-vendor and multi-architecture accelerators from a single code base. We will talk about the GROMACS and NWChem projects that are benefitting from using the oneMKL library to target Intel, Nvidia and AMD GPUs, with discrete fourier transforms as an example. The oneMKL interface library makes this possible with minimal overhead using SYCL backend interoperability.

Back to the programme overview.

Thursday

Session 13:00-15:00

Optimising SPH kernels via compiler-driven AoS-SoA conversions

Pawel Radtke and Tobias Weinzierl

Array-of-Structures (AoS) is often the preferred memory layout for HPC codes, as it aligns closely with the Object-Oriented nature of C++, simplifies data management, and enables more efficient data movement compared to Structure-of-Arrays (SoA). At the same time, compute kernels that process AoS data often achieve subpar performance compared to their SoA analogues. In this work we propose a C++ language extension based upon C++ attributes that allows for localised AoS-SoA data conversions: for and for-range loops are extended with prologue and epilogue passes that perform the transformations while minimising data movement. The transformation allows existing codes to continue using AoS global buffers, and benefit from more optimised compute kernels.

Our proposals are realised via a compiler-based approach by extending the Clang/LLVM compiler toolchain. In the presentation we demonstrate the capabilities of our extension, as well as show the performance impacts on a set of smoothed-particle hydrodynamics simulation (SPH) compute kernels available in the Peano framework.

Towards obtaining a quantum advantage in practice

John Buckeridge, Omer Rathore, Rhonda Au Yeung and Viv Kendon

The QEVEC project assesses current and near-future quantum computing paradigms as potential disrupters in HPC, with an emphasis on scalability and how to best leverage quantum advantages in practice. Even in early stages of development, quantum computers may be used as co-processors to remove bottlenecks in existing and future exascale code [1]. We are developing quantum algorithms for exascale subroutines in computational fluids dynamics (CFD) and materials simulations. In our CFD work, we have repurposed the Harrow-Hassidim-Lloyd (HHL) algorithm, a powerful method for solving linear equations and offers exponential speed-up in theory, into a predictor-corrector instead of a solver and using it to complement classical simulations instead of supplanting them. This approach has immense relevance across a range of disciplines including smoothed particle hydrodynamics, plasma simulations and chemistry calculations. We have considered the use of quantum computing to address the load balancing challenge for massively parallel software, which is the problem of distributing work between processors. In our materials modelling work, we have shown that quantum annealing methods can be applied successfully for the configurational analysis of chemical structures [2]. We also perform verification, validation and uncertainty quantification [3]. A framework for harnessing these potential advantages in practice is presented and draws parallels with the current GPU/CPU synergies observed but with a QPU/CPU setup instead. Our work coordinates well with the CCP-QC network and the National Quantum Computing Centre.

References:

[1] R. Au-Yeung et al., Quantum algorithms for scientific applications (2023). arXiv:2312.14904

[2] B. Camino et al., Quantum computing and materials science: A practical guide to applying quantum annealing to the configurational analysis of materials. J. Appl. Phys. 133:221102 (2023). doi: 10.1063/5.0151346

[3] Th. Kapourniotis et al., Unifying Quantum Verification and Error-Detection: Theory and Tools for Optimisations (2022). arXiv:2206.00631

SWIFT 2 : Keeping the Good, Discussing the Bad, Removing the Ugly

Mladen Ivkovic

In the past year, the SWIFT simulation code was able to celebrate remarkable successes and cement itself as cutting edge state-of-the-art simulation software in the astrophysical and cosmological landscape. In this talk, I will discuss some of the key aspects that lead to SWIFT’s success as HPC software, but also some currently present caveats and weaknesses, and how we plan on addressing them using the PEANO framework. More precisely, a core aspect of SWIFT’s success is the underlying task-based parallelism framework. The current implementation of the framework however is simultaneously a great caveat with regards to future works: it is a) opaque and difficult to deal with; b) deeply embedded into SWIFT; and c) it only works on CPU architectures. Through the replacement of SWIFT’s engine by the PEANO framework, we aim to address all these issues while retaining the benefits SWIFT draws from the task-based parallelism paradigm.

Using HPC to model policy scenarios with agent-based models

Alison Heppenstall, Gary Polhill, Mike Batty, Doug Salt, Richard Milton, Hatt Hare and Ric Colsanti

Agent-based modelling is computer simulation that represents the dynamic interactions of heterogeneous individuals (people, households etc) over space and time. Each agent can have specific data simple rules determining their behaviour, or more sophisticated algorithms that implement cognitive architectures with memory, planning, reasoning and/or rules. Algorithms can have context-sensitive degrees of computational complexity. Interactions can be mediated through space, fixed social networks, or on an ad hoc basis (e.g. based on proximity); and scheduling can be synchronized or asynchronous. These are all features that make agent-based models interesting and useful for studying societal and socio-environmental complexity, but also make them ‘awkward customers’ for HPC because resource needs are unpredictable.

Over the past 10-20 years, agent-based models have transitioned from being more typically theoretical studies based on ‘stylized facts’ and general principles, to being more typically fitted to empirical case studies — using GIS, census, questionnaire and other data to initialize synthetic populations of agents and see how they respond to scenarios. This kind of use case is most common in land use, planning, and epidemiology, but any scenario where there are ‘cascading consequences’ from trying to address ‘wicked problems’ is potentially relevant.

Much of this work is still done on personal computing devices rather than using HPC, which is a pity, as HPC can be used to give more saturated samples of model parameter spaces, upscale case studies from the local to the regional or national, simulate larger populations of agents, and implement more cognitively plausible decision-making algorithms. Finding ways to make HPC easier to access for social scientists would open HPC resources up to a wider range of applications and potential use cases, have knock-on benefits for those who cannot predict their resource needs, and increase the potential for HPC use to have societal impact.

Efficient Experimental Design using Multi-fidelity Bayesian Optimisation

Andrew Mole and Sylvain Laizet

High-fidelity simulations on HPC can be useful in capturing the physics involved when solving optimisation problems however can be computationally expensive.
When conducting an optimisation, an iterative approach is needed which can lead to the running of many simulations within the parameter space.
Bayesian optimisation (BO) presents a useful strategy for efficiently choosing the design of experiments whilst searching for an optimum solution.
A Multi-fidelity (MF) BO approach will be presented that allows a computationally cheaper, but inaccurate, lower-fidelity model of the system to guide the optimisation procedure and reduce the number of accurate high-fidelity simulations required to find optimal solutions.
The MF-BO is implemented by constructing a MF surrogate model using the nonlinear auto-regressive Gaussian process approach to capture the relationship between the model fidelities.
A MF acquisition function is constructed to determine the configuration and fidelity of successive experiments that learn more information about the optimum solution.

This methodology is tested on the wake steering optimisation problem for power maximisation of wind farms. Low-fidelity approximations in the form of analytical steady state wake models using FLORIS are used to guide high-fidelity large eddy simulations (LES) using the finite difference based XCompact3d. Discovering the optimum yaw configuration of a wind farm has typically been addressed using analytical wake models due to their computational efficiency allowing for many different configurations to be tested. By capturing the complexities of non-linear and unsteady fluid dynamics, LES delivers solutions closer to the true optimum. Using the MF-BO allows for an improvements in wind farm power outputs to be found, whilst using a limited number of LES evaluations and reducing the computational cost and power consumption.

Back to the programme overview.

Friday

Session 9:00-10:00

Alison Kennedy

Alison Kennedy served as Director of the Science and Technology Facilities Council (STFC) Hartree Centre, based at the UK’s STFC Daresbury Laboratory. The Hartree Centre is backed by over £170 million of government funding and significant strategic partnerships​, with a remit to work with companies of all sizes to improve the global competitiveness of UK industry by enabling the adoption of High Performance Computing (HPC), data analytics and artificial intelligence (AI) technologies and expertise. Before that, she was the Executive Director of EPCC, the national HPC Centre based at the University of Edinburgh. Alison began her working life as a real time systems programmer in industry and has now worked in HPC for almost 25 years, managing large collaborative projects in HPC, Data and AI. Alison might be retiered from permanent employment, but is still active on a range of projects and committees.

Follow Alison on Twitter.

Session 10:00-12:00

Experiences and Challenges in achieving Green AI with HPC systems

Joseph Lee and Eleanor Broadway

The use of Artificial Intelligence (AI) and Machine Learning (ML) in scientific workloads is increasingly widespread. It is imperative to understand the environmental impact of running these applications on HPC systems, and optimize their performance and energy efficiency. In many respects, typical machine learning workloads differ significantly from traditional HPC applications, including software libraries used, parallelisation strategies, I/O requirements, and power consumption signatures. Therefore, we need to investigate opportunities for optimising energy efficiency specifically for AI/ML workloads. However, this comes with a unique set of challenges, from defining appropriate accuracy metrics to acquiring measurement data from novel accelerator hardware. The aim of this talk is to share our experience and discuss some of the difficulties encountered in achieving greener and more efficient AI across a range of HPC systems at EPCC.

RISC-V: A potential game changer for delivering green HPC?

Maurice Jamieson and Nick Brown

RISC-V is an open, community led, Instruction Set Architecture (ISA). With over 1 billion RISC-V devices shipped by Qualcomm alone, this technology has seen phenomenal growth since it was first proposed just over a decade ago. Whilst RISC-V has been more popular in embedded computing, it is yet to gain ubiquity in HPC, but we are now seeing hardware more capable of HPC workloads (e.g. the 64-core SG2042) and furthermore there is a variety of hardware planned for 2024 release by vendors.

Whilst many associate RISC-V with CPUs, and indeed this has been where it has demonstrated most success to date, using aspects of the standard such as vectorisation it is also possible to develop accelerators and program these in a unified manner. For instance, Esperanto have released the ET-SoC-1 which is an accelerator chip that contains 1000 RISC-V cores, and they have a demonstrator machine that packages these to provide over 80,000 RISC-V cores.

The ability that RISC-V provides to specialise the microarchitecture to the workload is crucially important, and has the potential to deliver significant performance and energy efficiency benefits. Ultimately, it means that vendors are able to develop CPUs and/or accelerators that are tuned for specific workloads, and save chip area and energy on facets which are less important. For instance, the Esperanto ET-SoC-1 delivers 250 GFlops/Watt under real world workloads.

A major benefit of RISC-V is that the ISA standard enables a common software ecosystem to be developed, with a large range of participants (e.g. vendors, researchers, individuals) actively contributing to tools and optimising libraries. These can be then run across the range of RISC-V hardware, and indeed very many HPC libraries have already been ported to the architecture and work “out of the box”, enabling effective use of these technologies often by simply recompiling code.

In this talk we will describe the potential benefits that RISC-V can deliver for the HPC community, explore some of the related success stories, highlight current areas of focus that the HPC community can get involved in and describe the ExCALIBUR RISC-V testbed that attendees can use to gain free access to experiment with RISC-V for their workloads

Back to the programme overview.

ClusterCockpit and EE-HPC: A way to more energy efficiency on HPC systems?

Thomas Gruber

Users and support personal greatly appreciate job-specific monitoring data for jobs to get an impression how well the application makes use of the compute nodes. By including hardware performance counter measurements, the ressource usage and power consumption can be visualised. But this data can also be used for optimising the compute nodes individually to increase energy efficiency. The german-funded project EE-HPC extends the ClusterCockpit frameworks by components that analyse the monitoring data to manipulate hardware settings of the compute nodes like CPU/Uncore frequencies and power capping while maintaining a high system performance. First tests on the a production level system show energy savings of around 5%, approximately 20 GWh/year.