News
NSF Awards SDSC for Innovative Cosmos System to Democratize the Accelerator Ecosystem
Published July 14, 2024
The San Diego Supercomputer Center (SDSC) at the University of California San Diego (UCSD) received a U.S. National Science Foundation (NSF) award to develop the Cosmos supercomputer built on the HPE Cray Supercomputing EX2500 incorporating innovative AMD Instinct™ MI300A accelerated processing units (APUs), HPE Slingshot interconnect and a flash-based filesystem. The initial NSF award is for $5 million to acquire and deploy the system; a subsequent award is expected for about the same amount for the five-year operations.
Accelerated computing enables use of specialized hardware to significantly speed up computationally intensive parts of simulations, with typical designs featuring discrete accelerator and central processing unit (CPU) instances connected via data movement interfaces. The new Cosmos system will be built by Hewlett Packard Enterprise (HPE) and features the AMD Instinct MI300A APU that provides both CPU and graphics processing unit (GPU) capabilities with unified memory, and is the first datacenter-class stacked die, multi-chip APU, according to the project team, led by SDSC’s User Support Lead Mahidhar Tatineni, who is joined by Co-PIs Subhashini Sivagnanam, Andreas Goetz, Igor Sfiligoi and Christopher Irving.
The APU uniquely features an in-chip memory layout, which is integrated and shared between CPU and GPU resources. This type of memory architecture facilitates an incremental programming approach, which enables many communities to adopt GPUs and ease the process of porting and optimizing a range of applications. The directive-based programming approach was demonstrated in a case study on the widely used OpenFOAM computational fluid dynamics code.
“The APU advantage provides AMD Instinct MI300A APUs with unified memory and cache resources to deliver an easily programmable platform, highly performant compute, and impressive energy efficiency for workloads sitting at the convergence of HPC and AI,” said Brent Gorda, senior director, High Performance Computing, Data Center Accelerator Business, AMD. “We look forward to SDSC – along with the collaborative research community at large – demonstrating the full potential of our MI300A APUs as they deploy the Cosmos supercomputer.”
“Cosmos enables researchers to exploit this innovative and powerful accelerator technology via an incremental programming approach, democratizing access to accelerated computing and significantly increasing the range of applications that can effectively use accelerators,” said Tatineni, principal investigator (PI) on the project. “The benefits of accelerating applications will aid discoveries in AI, astrophysics, genomics, large language models, materials science and many other domains.”
Cosmos nodes contain the AMD Instinct APUs, integrated into 4-socket nodes with all-to-all connectivity using the AMD high-speed interconnect, which provides a socket-to-socket global memory interface. The system architecture is a 100 percent direct liquid-cooled HPE Cray EX system.
“HPE is proud to build Cosmos for SDSC to help advance the research center’s exploration in science and engineering fields,” said Trish Damkroger, senior vice president and general manager, HPC & AI Infrastructure Solutions, HPE. “Working in collaboration with AMD, we’ll be building one of our HPE Cray EX2500 systems featuring our industry-leading liquid cooling technology and HPE Slingshot Interconnect, enabling us to deliver a system based on open architecture that is not only powerful, but space-saving and energy efficient.”
A high-performance, flash-based storage provides the high IOPS, and bandwidth needed for the anticipated mixed-application workload. “The flash filesystem system can be cross-mounted to other SDSC systems to facilitate data sharing, software development and benchmarking. Capacity storage is provided via a Ceph filesystem,” said Tatineni.
The project is structured as a three-year testbed phase, followed by a two-year allocations phase. During the testbed phase Cosmos project staff will collaborate with research teams covering several model science and engineering applications including those from astronomy and neuroscience, molecular biology and machine learning, and more. During the testbed phase, Cosmos project staff will also work with the National Artificial Intelligence Research Resource (NAIRR) Pilot program to identify and support projects that are good fits for the architecture and capabilities of the machine.
Collaborations specifically target community codes, science gateways and enabling middleware, where success in porting a single application brings along many users and institutions. Integration with the Open Science Grid aims to further extend the benefits of the APU to thousands of users in the high-throughput computing community.
“Many scientific communities face hurdles with programming complexity and need for people time/resources to re-engineer their large legacy software bases to benefit from accelerators. This inhibits their scientific progress because most of the hardware performance increases in recent years has come in form of accelerators. The motto of Cosmos – “no software left behind,” expresses the project goal of focusing on these communities, including but not limited to the high-throughput computing communities on OSG” said SDSC Director Frank Würthwein.
Lessons learned and best practices developed from the research collaborations will be shared with the wider user community through project workshops, user training events, and participation in the AMD User Forum. The allocations phase will incorporate insights gained from the testbed phase regarding application porting to the APU, leading to software development resources, training materials and publications that allow others to migrate their applications to realize the benefits of accelerated computing. During the allocations phase, Cosmos will be available to researchers through an NSF-approved allocation process.
Cosmos is supported by the NSF, Office of Advanced Cyberinfrastructure (grant award no. 2404323).