User Guide
Technical Summary
The National Research Platform (NRP) is a heterogenous, nationally distributed, open system that features CPUs, FP32- and FP64-optimized GPUs, and FPGAs, arranged into two types of subsystems at three sites (one high-performance and two FP32-optimized), specialized for a wide range of data science, simulations, and machine learning or artificial intelligence, allowing data access through a federated national-scale content delivery network (CDN).
NRP is a Category II NSF-funded system operated jointly by the San Diego Supercomputer Center at UC San Diego, the Massachusetts Green High Performance Computing Center (MGHPCC) and the University of Nebraska–Lincoln (UNL). The system features a novel, extremely low-latency fabric from GigaIO that allows dynamic composition of hardware, including FPGAs, GPUs, and NVMe storage. Each of the three sites (SDSC, UNL, and MGHPCC) includes ~1 PB of usable disk space. The three storage systems function as data origins of the CDN, providing data access anywhere in the country within a round-trip delay of ~10ms via use of network caches at three sites and five Internet2 network colocation facilities.
NRP architecture includes research and education networks, compute accelerators, heterogenous computing resources (e.g., edge computing near the instrument location), a CDN to provide easy access to any data, anytime, from anywhere, and an overall systems software stack that allows for central management of the system while remaining open for growth. Innovative technical features of the HPC subsystem include a mix of field-programmable gate array (FPGA) chips, graphics-processing units (GPUs) with memory and storage, and the fully integrated low-latency fabric.
UNL will lead NRP’s infrastructure operations using Kubernetes (K8S) to manage remote systems at the three primary sites and Internet2, as well as the BYOR locations.
Resource Allocation Policies
- 3 year prototype phase
- Followed by 2 year allocation phase
- Interested parties may contact the SDSC Help Desk
Technical Details
NRP HPC System (located at SDSC)
System Component | Configuration |
---|---|
NVIDIA A100 HGX GPU Servers |
|
HGX A100 servers |
1 |
NVIDIA GPUs/server |
8 |
HBM2Memory per GPU |
80GB |
Host CPU (2 per server) |
AMD EPYC 7742 |
Host CPU memory (per server) |
512 GB @ 3200 MHz |
FabreX Gen 4 Network Adapter (per server) |
8 |
Solid State Disk (2 per server) |
1TB |
XILINX Alveo FPGA Servers |
|
GigIO Gen4 Pooling Appliance |
4 |
FPGAAs/appliance |
8 |
FPGA |
Alveo U55C |
High Core Count CPU Servers |
|
Number of Servers |
2 |
Processor (2 per server) |
AMD Epyc 7742 |
Memory (per server) |
1TB @ 3200 MHz |
FabreX Network Adapter (per server) |
1 |
Mellanox Connect-X6 Network Adapter (per server) |
1 |
Low Core Count CPU Servers |
|
Number of Servers |
2 |
Processors (2 per server) |
AMD Epyc 7F72 |
Memory (per server) |
1TB @ 3200 MHz |
FabreX Network Adapter (per server) |
1 |
Mellanox Connect-X Network Adapter (per server) |
1 |
Network Infrastructure |
|
GigaIO FabreX 24port Gen4 PCIe switches |
18 |
GigaIO FabreX Network Adapters |
36 |
Mellanox Connect-X6 Network Adapters |
10 |
FabreX-Connected NVMe Resources |
|
GigaIO Gen3 NVMe POoling Appliance for NVMe resource |
4 |
Capacity per NVMe resource |
122 TB |
Ancillary Systems |
|
Home File System |
1.6 PB |
Service Nodes |
2 |
Data Cache (1 at SDSC, 5 Internet2 Sites) |
50TB each |
NRP FP32 Subsystem (1 each at UNL and MGHPCC)
System Component | Configuration |
---|---|
NVIDIA A10 GPUs (One Each at UNL and MGHPCC) |
|
GPU servers |
18 |
NVIDIA A10 GPUs/node |
8 |
Host CPU (2 per server) |
AMD EPYC 7502 |
Host CPU memory |
512 GB @ 3200 MHz |
Node-local NVMe |
8 TB |
Network adapters |
1x1Gbps; 2x10Gbps |
Data Cache(1) |
50TB each |
Ancillary Systems |
|
Service Nodes (2 per site) |
AMD EPYC 7402P |
Home File System |
1.6 PB |
System Access and Project Management
NRP is accessible via Nautilus, a HyperCluster for running containerized Big Data Applications using K8S, for facilitating System access and project management. NRP allows users to use their institutional credentials via federated logon (CILogin) for access to Nautilus. Instructions for accessing NRP are available via the Nautilus Documentation: Get Access to the PRP Nautilus cluster.
Job Scheduling, Monitoring and Management
NRP uses Kubernetes (K8S) for job scheduling, monitoring and management. K8S is an open-source platform for managing containerized workloads and services. A K8S cluster consists of a set of worker machines, called nodes, that run containerized applications. The application workloads are executed by placing containers into Pods to run on nodes. The resources required by the Pods are specified in YAML files. Some basic K8S commands and examples of running jobs are available via the Nautilus documentation: Basic K8S tutorial.
The K8S command-line tool, kubectl, allows you to run commands against K8S clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and view logs. For configuration, kubectl looks for a file named config in the $HOME/.kube directory. For more information including a complete list of kubectl operations, see the kubectl reference documentation.
Storage
Users are responsible for backing up all important data
NRP is designed to develop Innovative Data Infrastructure. NRP plans to allow for building on the OSG Data Federation and the PRP Regional Ceph clusters as well as experimenting with BYOB storage as part of the infrastructure and including replication. NRP includes 3 Ceph cluster & 8 Data Federation Caches, that will have automated replication between Ceph clusters and allow users to specify what part of their namespace should be replicated where.
Composable Systems
NRP also supports Composable Systems, allowing researchers to create a virtual "tool set" of resources for a specific project and then recompose it as needed. Composable systems are enabled by NRPs interconnect based on GigaIO's FabreX PCIe switch technology. This unique technology enables dynamic disaggregation of system components, like GPUs and FPGAs, and then composition into customized configurations designed for a specific workflow. The Composable Systems feature allows researchers on the NRP HPC partition to create customized hardware configurations, with a combination of FPGAs, GPUs, and NVMe storage, that matches their project needs.
Software
All resources on NRP are available via Kubernetes. The software stack will be containerized and users can utilize their own custom containers hosted on external repositories (like Docker hub). NRP staff can provide a limited number of baseline containers with the right drivers and software matching the NRP hardware (for example for FPGAs).