Using Regular Expressions with Linux Tools (COMPLECS)

Remote event

Regular expressions (regexes) provide a way to identify strings that match a specified pattern. They are extremely useful for preprocessing text and extracting results from high-performance computing and data science workloads. Primarily in the context of the Linux grep utility, we introduce the main features of regular expressions (regexes) incrementally: string literals, specifying multiple characters, quantifiers, wildcards, anchors, character classes, grouping, and alternation. We also explore more advanced topics, such as word boundaries, lazy and greedy matching, regex flavors (basic, extended, and Perl-compatible), regexes with awk and sed, searching compressed files, and utilizing large language models (LLMs) to create regexes.

A working knowledge of grep, and optionally awk and sed, will help you make the most of this session. We recommend attending the COMPLECS Linux tools for file processing webinar or reviewing the associated materials (recording, slides, GitHub repo) if you are not familiar with these topics.

--- 
COMPLECS (COMPrehensive Learning for end-users to Effectively utilize CyberinfraStructure) is a new SDSC program where training will cover non-programming skills needed to effectively use supercomputers. Topics include parallel computing concepts, Linux tools and bash scripting, security, batch computing, how to get help, data management and interactive computing. Each session offers 1 hour of instruction followed by a 30-minute Q&A. COMPLECS is supported by NSF award 2320934.

Instructor

Fernando Garzon

Computational and Data Science Research Specialist, SDSC

Fernando Garzon has a background in Experimental Physics and transitioned into Software Development and Operations. He worked as part of the transfer team for the CMS experiment from the LHC during an internship at Fermilab National Lab. Currently, Fernando is with TSCC, contributing to user support and the new TSCC Open On Demand portal. He is also involved in software development for the Open Science Chain project, focusing on developing new functionalities and enhancing the CI pipeline for Quality Assurance. Fernando is completing his master's program in Software Engineering. His professional interests include DevOps, automation, cloud computing, and Infrastructure as Code (IaC).