User Guide
Technical Summary
Voyager is an Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) innovative AI system designed specifically for science and engineering research at scale. Voyager is focused on supporting research in science and engineering that is increasingly dependent upon artificial intelligence and deep learning as a critical element in the experimental and/or computational work. Featuring the Habana Gaudi training and first-generation Habana inference processors, along with a high-performance, low latency 400 gigabit-per-second interconnect from Arista. Voyager will give researchers the ability to work with extremely large data sets using standard AI tools, like TensorFlow and PyTorch, or develop their own deep learning models using developer tools and libraries from Habana Labs.
Voyager is an NSF-funded system, developed in collaboration with Supermicro, and Intel’s Habana Lab and operated by the San Diego Supercomputer Center at UC San Diego, and began a 3-year testbed phase in early 2022.
Resource Allocation Policies
- Current Status: Testbed Phase
- 3-year testbed phase will be available to select focused projects, as well as workshops and industry interactions.
- The testbed phase will be followed up with a 2 year allocation phase to the broader NSF community and User workshops.
- To get access to Voyager, please send a request to HPC Consulting.
Job Scheduling Policies
- Currently no policies set
- Kubernetes for job scheduling
Technical Details
System Component | Configuration |
---|---|
Supermicro X12 Gaudi Training Nodes |
|
CPU Type | Intel Xeon Gold 6336 |
Habana Gaudi processors | 336 |
Nodes | 42 |
Training processors/Node | 8 |
Host x86 processors/node | 2 |
Sockets | 2 |
Memory capacity |
* 512 GB DDR4 DRAM |
Memory/training processor |
32 GB HDM2 |
Local Storage |
6.4 TB local NVMe |
Max CPU Memory bandwidth | ** GB/s |
Intel First Generation Habana Inference Nodes | |
CPU Type | Xeon Gold 6240 |
First-Generation Habana Inference Processors | 16 |
Nodes | 2 |
First-Generation Habana Inference Cards/node | 8 |
Cores/socket | 20 |
Sockets | 2 |
Clock speed | 2.5 GHz |
Flop speed | 34.4 TFlop/s |
Memory capacity | *384 GB DDR4 DRAM |
Local Storage |
1.6TB Samsung PM1745b NVMe PCIe SSD |
Max CPU Memory bandwidth | 281.6 GB/s |
Standard Compute Nodes | |
CPU Type | Intelx86 |
Nodes | 36 |
x86 processors/node | 2 |
Memory Capacity | 384 GB |
Local NVMe | 3.2 TB |
Interconnect | |
Topology | Full bi-section bandwidth switch |
Per Node bandwidth | 6*400 Gb/s (bidirectional) |
DISK I/O Subsystem | |
File Systems | Ceph |
Ceph Storage | 1 PB |
Systems Software Environment
Software Function | Description |
---|---|
Cluster Management | Bright Cluster Manager |
Operating System | Ubuntu 20.04 LTS |
File Systems | Ceph |
Scheduler and Resource Manager | Kubernetis |
User Environment | Lmod, Containers |
System Access
Logging in to Voyager
Voyager uses ssh key pairs for access. Approved users will need to send their ssh public key to consult@sdsc.edu to gain access to the system.
To log in to Voyager from the command line, use the hostname:
login.voyager.sdsc.edu
The following are examples of Secure Shell (ssh) commands that may be used to log in to Expanse:
ssh <your_username>@login.voyager.sdsc.edu ssh -l <your_username> login.voyager.sdsc.edu
Notes and hints
- Voyager will not maintain local passwords, your public key will need to be appended to your ~/.ssh/authorized_keys file to enable access from authorized hosts. We accept RSA, ECDSA and ed25519 keys. Make sure you have a strong passphrase on the private key on your local machine.
- You can use ssh-agent or keychain to avoid repeatedly typing the private key password.
- Hosts which connect to SSH more frequently than ten times per minute may get blocked for a short period of time
- Do not use the login node for computationally intensive processes, as hosts for running workflow management tools, as primary data transfer nodes for large or numerous data transfers or as servers providing other services accessible to the Internet. The login nodes are meant for file editing, simple data analysis, and other tasks that use minimal compute resources. All computationally demanding jobs should be run using kubernetes.
Voyager cnvrg.io
Voyager will feature cnvrg.io. cnvrg.io is a machine learning(ML) platform to help manage, build and automate ML workflows. cnvrg.io will provide a quick and easy way for Voyager users to collaborate, integrate, manage files and submit and monitor jobs.
Adding Users to a Project
Approved Voyager project PIs and co-PIs can add/remove users(accounts) to/from a Voyager. Please submit a support ticket to consult@sdsc.edu to add/remove users.
Modules
Environment Modules provide for dynamic modification of your shell environment. Module commands set, change, or delete environment variables, typically in support of a particular application. They also let the user choose between different versions of the same software or different combinations of related codes.
Voyager uses Lmod, a Lua-based module system. Users will ONLY need the kubernetes module, which is loaded by default.
Running Jobs on Voyager
Voyager runs Kubernetes. Kubernetes is an open-source platform for managing containerized workloads and services. A Kubernetes cluster consists of a set of worker machines, called nodes, that run containerized applications. The application workloads are executed by placing containers into Pods to run on nodes. The resources required by the Pods are specified in YAML files. Some basic Kubernetes commands and examples of running jobs are provided below
The Kubernetes command-line tool, kubectl, allows you to run commands against Kubernetes clusters. You can use kubectl to deploy applications, inspect and manage cluster resources, and view logs. For configuration, kubectl looks for a file named config in the $HOME/.kube directory. For more information including a complete list of kubectl operations, see the kubectl reference documentation.
Set up Kubectl environment
On login.voyager.sdsc.edu set up the kubectl environment by loading the kubernetes module:
$vgr-1-20:~$ module load kubernetes/voyager
Review your currently loaded modules:
$vgr-1-20:~$ module list
Currently Loaded Modules:
-
shared
-
dot
-
default-environment
-
DefaultModules
-
kubernetes/voyager/1.18.15
Usage Guidelines
There are currently no limits set on Voyager resources. The limits for each partition noted in the table below are the maximum available. Resource limits will be set based on Early User Period evaluation.
Resource Name | Max Walltime |
Max Nodes/Job |
Nodes | Notes |
---|---|---|---|---|
First-Generation Habana Inference | 48 hrs | 2 | vgr-10-01, vgr-10-02 | Inference |
Gaudi | 48 hrs | 42 | vgr-[2-4]-[01-06],vgr-[6-9]-[01-06] | Training |
Compute | 48 hrs | 36 | vgr-10-[02-38] | Preprocessing/Postprocessing |
Kubernetes objects are persistent entities in the Kubernetes system. Objects describe:
- The containerized applications to run, the software stack is in the container.
- The resources needed by the applications, including CPUs, Memory, Gaudis, First-Generation Habana Inferences, etc.
- The policies to control how application behaves.
Most commonly used objects on Voyager are a pod or a job.
kind: pod
- A pod is the smallest deployable unit of computing - includes resources, containers, storage, run policies.
kind: job
- A job creates one or more pods and will continue to retry execution of the pods until a specified number of them successfully terminate. In the event of a node failure, pods that are node managed by a Job will be rescheduled on other nodes. In addition, Jobs allow users to run multiple instances using completions and parallelism features.
kind: MPIJob
- A MPIJob is currently required for running multi-node jobs.
Other k8s objects available but not recommended for routine application runs on Voyager: deployments, RelicaSet, DaemonSet, CronJob, ReplicationController.
Ain't Markup Language/Yet another markup language (YAML) is a human-readable data serialization language for all programming languages, often used as a format for configuration files. YAML uses colon-centered syntax, used for expressing key-value pairs. The official recommended filename extension for YAML files is .yaml
Example YAML file and available containers
SDSC User Services staff have developed sample run scripts for common applications available on Voyager in directory:
/cm/shared/examples/sdsc
YAMLs formatting is very sensitive to outline indentation and whitespace, please do not try to copy and paste examples from this user guide.
Basic Gaudi YAML
This job runs with 2 nodes, 128 cores per node for a total of 256 tasks.
apiVersion: v1
kind: Pod
metadata:
name: hpu-test-pod
spec:
restartPolicy: Never
containers:
- name: gaudi-container
image: vault.habana.ai/gaudi-docker/1.8.0/ubuntu22.04/habanalabs/tensorflow-installer-tf-cpu-2.8.4:1.8.0-690-20230214
command: ["hl-smi"]
resources:
limits:
habana.ai/gaudi: 1
hugepages-2Mi: 3800Mi
memory: 32G
cpu: 1
requests:
memory: 32G
cpu: 1