High-performance computing (HPC) systems are specialized resources in use and shared by many researchers across all domains of science, engineering, and beyond. In order to distribute these advanced computing resources in an efficient, fair, and organized way, most of the computational workloads run on these systems are executed as batch jobs, which are simply prescripted sets of commands that are executed on a subset of an HPC system’s compute resources for a given amount of time. Researchers submit these batch jobs as scripts to a batch job scheduler, the software that controls and tracks where and when the batch jobs submitted to the system will eventually be run. However, if this is your first time using an HPC system and interacting with a batch job scheduler like Slurm, then writing and submitting your first batch job scripts to them may be somewhat intimidating due to the inherent complexity of these systems. Moreover, the schedulers can be configured in many different ways and will often have unique features and options that vary from system to system, which you will also need to consider when writing and submitting your batch jobs.
In this second part of our series on Batch Computing, we will introduce you to the concept of a distributed batch job scheduler — what they are, why they exist, and how they work — using the Slurm Workload Manager as our reference implementation and testbed. You will then learn how to write your first job script and submit it to an HPC System running Slurm as its scheduler. We will also discuss the best practices for how to structure your batch job scripts, teach you how to leverage Slurm environment variables, and provide tips on how to request resources from the scheduler to get your work done faster.
To complete the exercises covered in Part II webinar session, you will need access to an HPC system running the Slurm Workload Manager as its batch job scheduler.
----
COMPLECS (COMPrehensive Learning for end-users to Effectively utilize CyberinfraStructure) is a new SDSC program where training will cover non-programming skills needed to effectively use supercomputers. Topics include parallel computing concepts, Linux tools and bash scripting, security, batch computing, how to get help, data management and interactive computing. Each session offers 1 hour of instruction followed by a 30-minute Q&A. COMPLECS is supported by NSF award 2320934.