Regular expressions (regexes) provide a way to identify strings that match a specified pattern. They are extremely useful for preprocessing text and extracting results from high-performance computing and data science workloads. Primarily in the context of the Linux grep utility, we incrementally introduce the main features of regexes: string literals, specifying multiple characters, quantifiers, wildcards, anchors, character classes, grouping, and alternation. We also explore more advanced topics such as word boundaries, lazy and greedy matching, regex flavors (basic, extended, and Perl compatible), regexes with awk and sed, searching compressed files, and using large language models (LLMs) to create regexes.
A working knowledge of grep, and optionally awk and sed, will help you make the most of this session. We recommend attending the COMPLECS Linux tools for file processing webinar or reviewing the associated materials (recording, slides, GitHub repo) if you are not familiar with these topics.
---
COMPLECS (COMPrehensive Learning for end-users to Effectively utilize CyberinfraStructure) is a new SDSC program where training will cover non-programming skills needed to effectively use supercomputers. Topics include parallel computing concepts, Linux tools and bash scripting, security, batch computing, how to get help, data management and interactive computing. Each session offers 1 hour of instruction followed by a 30-minute Q&A. COMPLECS is supported by NSF award 2320934.