User Guide

Jump to:

Technical Summary Technical Details System Access and Project Management Job Scheduling, Monitoring and Management Storage Composable Systems Software

Technical Summary

The National Research Platform (NRP) is a heterogeneous, nationally distributed, open system that features CPUs, FP32- and FP64-optimized GPUs, and FPGAs, arranged into two types of subsystems at three sites (one high-performance and two FP32-optimized), specialized for a wide range of data science, simulations, and machine learning or artificial intelligence, allowing data access through a federated national-scale content delivery network (CDN).

NRP is a Category II NSF-funded system operated jointly by the San Diego Supercomputer Center at UC San Diego, the Massachusetts Green High Performance Computing Center (MGHPCC) and the University of Nebraska–Lincoln (UNL). The system features a novel, extremely low-latency fabric from GigaIO that allows dynamic composition of hardware, including FPGAs, GPUs, and NVMe storage. Each of the three sites (SDSC, UNL, and MGHPCC) includes ~1 PB of usable disk space. The three storage systems function as data origins of the CDN, providing data access anywhere in the country within a round-trip delay of ~10ms via use of network caches at three sites and five Internet2 network colocation facilities.

NRP architecture includes research and education networks, compute accelerators, heterogeneous computing resources (e.g., edge computing near the instrument location), a CDN to provide easy access to any data, anytime, from anywhere, and an overall systems software stack that allows for central management of the system while remaining open for growth. Innovative technical features of the HPC subsystem include a mix of field-programmable gate array (FPGA) chips, graphics-processing units (GPUs) with memory and storage, and the fully integrated low-latency fabric.

UNL will lead NRP’s infrastructure operations using Kubernetes (K8S) to manage remote systems at the three primary sites and Internet2, as well as the BYOR locations.

Resource Allocation Policies

3 year prototype phase
Followed by 2 year allocation phase
Interested parties may contact the SDSC Help Desk

Technical Details

NRP HPC System (located at SDSC)

System Component	Configuration
NVIDIA A100 HGX GPU Servers
HGX A100 servers	1
NVIDIA GPUs/server	8
HBM2Memory per GPU	80GB
Host CPU (2 per server)	AMD EPYC 7742
Host CPU memory (per server)	512 GB @ 3200 MHz
FabreX Gen 4 Network Adapter (per server)	8
Solid State Disk (2 per server)	1TB
XILINX Alveo FPGA Servers
GigIO Gen4 Pooling Appliance	4
FPGAAs/appliance	8
FPGA	Alveo U55C
High Core Count CPU Servers
Number of Servers	2
Processor (2 per server)	AMD Epyc 7742
Memory (per server)	1TB @ 3200 MHz
FabreX Network Adapter (per server)	1
Mellanox Connect-X6 Network Adapter (per server)	1
Low Core Count CPU Servers
Number of Servers	2
Processors (2 per server)	AMD Epyc 7F72
Memory (per server)	1TB @ 3200 MHz
FabreX Network Adapter (per server)	1
Mellanox Connect-X Network Adapter (per server)	1
Network Infrastructure
GigaIO FabreX 24port Gen4 PCIe switches	18
GigaIO FabreX Network Adapters	36
Mellanox Connect-X6 Network Adapters	10
FabreX-Connected NVMe Resources
GigaIO Gen3 NVMe POoling Appliance for NVMe resource	4
Capacity per NVMe resource	122 TB
Ancillary Systems
Home File System	1.6 PB
Service Nodes	2
Data Cache (1 at SDSC, 5 Internet2 Sites)	50TB each

NRP FP32 Subsystem (1 each at UNL and MGHPCC)

System Component	Configuration
NVIDIA A10 GPUs (One Each at UNL and MGHPCC)
GPU servers	18
NVIDIA A10 GPUs/node	8
Host CPU (2 per server)	AMD EPYC 7502
Host CPU memory	512 GB @ 3200 MHz
Node-local NVMe	8 TB
Network adapters	1x1Gbps; 2x10Gbps
Data Cache(1)	50TB each
Ancillary Systems
Service Nodes (2 per site)	AMD EPYC 7402P
Home File System	1.6 PB

System Access and Project Management

NRP is accessible via Nautilus, a HyperCluster for running containerized Big Data Applications using K8S, for facilitating System access and project management. NRP allows users to use their institutional credentials via federated logon (CILogin) for access to Nautilus. Instructions for accessing NRP are available via the Nautilus Documentation: Get Access to the PRP Nautilus cluster.

Job Scheduling, Monitoring and Management

NRP uses Kubernetes (K8S) for job scheduling, monitoring and management. K8S is an open-source platform for managing containerized workloads and services. A K8S cluster consists of a set of worker machines, called nodes, that run containerized applications. The application workloads are executed by placing containers into Pods to run on nodes. The resources required by the Pods are specified in YAML files. Some basic K8S commands and examples of running jobs are available via the Nautilus documentation: Basic K8S tutorial.

The K8S command-line tool, kubectl, allows you to run commands against K8S clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and view logs. For configuration, kubectl looks for a file named config in the $HOME/.kube directory. For more information including a complete list of kubectl operations, see the kubectl reference documentation.

Storage

Users are responsible for backing up all important data

NRP is designed to develop Innovative Data Infrastructure. NRP plans to allow for building on the OSG Data Federation and the PRP Regional Ceph clusters as well as experimenting with BYOB storage as part of the infrastructure and including replication. NRP includes 3 Ceph cluster & 8 Data Federation Caches, that will have automated replication between Ceph clusters and allow users to specify what part of their namespace should be replicated where.

Composable Systems

NRP also supports Composable Systems, allowing researchers to create a virtual "tool set" of resources for a specific project and then recompose it as needed. Composable systems are enabled by NRPs interconnect based on GigaIO's FabreX PCIe switch technology. This unique technology enables dynamic disaggregation of system components, like GPUs and FPGAs, and then composition into customized configurations designed for a specific workflow. The Composable Systems feature allows researchers on the NRP HPC partition to create customized hardware configurations, with a combination of FPGAs, GPUs, and NVMe storage, that matches their project needs.

Software

All resources on NRP are available via Kubernetes. The software stack will be containerized and users can utilize their own custom containers hosted on external repositories (like Docker hub). NRP staff can provide a limited number of baseline containers with the right drivers and software matching the NRP hardware (for example for FPGAs).