User Guide

Technical Summary

The National Research Platform (NRP) is a heterogenous, nationally distributed, open system that features CPUs, FP32- and FP64-optimized GPUs, and FPGAs, arranged into two types of subsystems at three sites (one high-performance and two FP32-optimized), specialized for a wide range of data science, simulations, and machine learning or artificial intelligence, allowing data access through a federated national-scale content delivery network (CDN). 

NRP is a Category II NSF-funded system operated jointly by the San Diego Supercomputer Center at UC San Diego, the Massachusetts Green High Performance Computing Center (MGHPCC) and the University of Nebraska–Lincoln (UNL). The system features a novel, extremely low-latency fabric from GigaIO that allows dynamic composition of hardware, including FPGAs, GPUs, and NVMe storage. Each of the three sites (SDSC, UNL, and MGHPCC) includes ~1 PB of usable disk space. The three storage systems function as data origins of the CDN, providing data access anywhere in the country within a round-trip delay of ~10ms via use of network caches at three sites and five Internet2 network colocation facilities.

NRP architecture includes research and education networks, compute accelerators, heterogenous computing resources (e.g., edge computing near the instrument location), a CDN to provide easy access to any data, anytime, from anywhere, and an overall systems software stack that allows for central management of the system while remaining open for growth. Innovative technical features of the HPC subsystem include a mix of field-programmable gate array (FPGA) chips, graphics-processing units (GPUs) with memory and storage, and the fully integrated low-latency fabric.

UNL will lead NRP’s infrastructure operations using Kubernetes (K8S) to manage remote systems at the three primary sites and Internet2, as well as the BYOR locations.

Resource Allocation Policies

  • 3 year prototype phase
  • Followed by 2 year allocation phase
  • Interested parties may contact the SDSC Help Desk

Technical Details

NRP HPC System (located at SDSC)

System Component Configuration

NVIDIA A100 HGX GPU Servers

HGX A100 servers

1

NVIDIA GPUs/server

8

HBM2Memory per GPU

80GB

Host CPU (2 per server)

AMD EPYC 7742

Host CPU memory (per server)

512 GB @ 3200 MHz

FabreX Gen 4 Network Adapter (per server)

8

Solid State Disk (2 per server)

1TB

XILINX Alveo FPGA Servers

GigIO Gen4 Pooling Appliance

4

FPGAAs/appliance

8

FPGA

Alveo U55C

High Core Count CPU Servers

Number of Servers

2

Processor (2 per server)

AMD Epyc 7742

Memory (per server)

1TB @ 3200 MHz

FabreX Network Adapter (per server)

1

Mellanox Connect-X6 Network Adapter (per server)

1

Low Core Count CPU Servers

Number of Servers

2

Processors (2 per server)

AMD Epyc 7F72

Memory (per server)

1TB @ 3200 MHz

FabreX Network Adapter (per server)

1

Mellanox Connect-X Network Adapter (per server)

1

Network Infrastructure

GigaIO FabreX 24port Gen4 PCIe switches

18

GigaIO FabreX Network Adapters

36

Mellanox Connect-X6 Network Adapters

10

FabreX-Connected NVMe Resources

GigaIO Gen3 NVMe POoling Appliance for NVMe resource

4

Capacity per NVMe resource

122 TB

Ancillary Systems

Home File System

1.6 PB

Service Nodes

2

Data Cache (1 at SDSC, 5 Internet2 Sites)

50TB each

NRP FP32 Subsystem (1 each at UNL and MGHPCC)

System Component Configuration

NVIDIA A10 GPUs (One Each at UNL and MGHPCC)

GPU servers

18

NVIDIA A10 GPUs/node

8

Host CPU (2 per server)

AMD EPYC 7502

Host CPU memory

512 GB @ 3200 MHz

Node-local NVMe

8 TB

Network adapters

1x1Gbps; 2x10Gbps

Data Cache(1)

50TB each

Ancillary Systems

Service Nodes (2 per site)

AMD EPYC 7402P

Home File System

1.6 PB

System Access and Project Management

NRP is accessible via Nautilus, a HyperCluster for running containerized Big Data Applications using K8S, for facilitating System access and project management. NRP allows users to use their institutional credentials via federated logon (CILogin) for access to Nautilus. Instructions for accessing NRP are available via the Nautilus Documentation: Get Access to the PRP Nautilus cluster.

Job Scheduling, Monitoring and Management

NRP uses Kubernetes (K8S) for job scheduling, monitoring and management. K8S is an open-source platform for managing containerized workloads and services. A K8S cluster consists of a set of worker machines, called nodes, that run containerized applications. The application workloads are executed by placing containers into Pods to run on nodes. The resources required by the Pods are specified in YAML files. Some basic K8S commands and examples of running jobs are available via the Nautilus documentation: Basic K8S tutorial.

The K8S command-line tool, kubectl, allows you to run commands against K8S clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and view logs. For configuration, kubectl looks for a file named config in the $HOME/.kube directory. For more information including a complete list of kubectl operations, see the kubectl reference documentation.

Storage

Users are responsible for backing up all important data

NRP is designed to develop Innovative Data Infrastructure. NRP plans to allow for building on the OSG Data Federation and the PRP Regional Ceph clusters as well as experimenting with BYOB storage as part of the infrastructure and including replication. NRP includes 3 Ceph cluster & 8 Data Federation Caches, that will have automated replication between Ceph clusters and allow users to specify what part of their namespace should be replicated where.

Composable Systems

NRP also supports Composable Systems, allowing researchers to create a virtual "tool set" of resources for a specific project and then recompose it as needed. Composable systems are enabled by NRPs interconnect based on GigaIO's FabreX PCIe switch technology. This unique technology enables dynamic disaggregation of system components, like GPUs and FPGAs, and then composition into customized configurations designed for a specific workflow. The Composable Systems feature allows researchers on the NRP HPC partition to create customized hardware configurations, with a combination of FPGAs, GPUs, and NVMe storage, that matches their project needs.

Software

All resources on NRP are available via Kubernetes. The software stack will be containerized and users can utilize their own custom containers hosted on external repositories (like Docker hub). NRP staff can provide a limited number of baseline containers with the right drivers and software matching the NRP hardware (for example for FPGAs).

National Research Platform

Back to top