COMPLECS: Using Regular Expressions with Linux Tools

Regular expressions (regexes) provide a way to identify strings that match a specified pattern. They are extremely useful for preprocessing text and extracting results from high-performance computing and data science workloads. Primarily in the context of the Linux grep utility, we incrementally introduce the main features of regexes: string literals, specifying multiple characters, quantifiers, wildcards, anchors, character classes, grouping, and alternation. We also explore more advanced topics such as word boundaries, lazy and greedy matching, regex flavors (basic, extended, and Perl compatible), regexes with awk and sed, searching compressed files, and using large language models (LLMs) to create regexes.
A working knowledge of grep, and optionally awk and sed, will help you make the most of this session. We recommend attending the COMPLECS Linux tools for file processing webinar or reviewing the associated materials (recording, slides, GitHub repo) if you are not familiar with these topics.

---
COMPLECS (COMPrehensive Learning for end-users to Effectively utilize CyberinfraStructure) is a new SDSC program where training will cover non-programming skills needed to effectively use supercomputers. Topics include parallel computing concepts, Linux tools and bash scripting, security, batch computing, how to get help, data management and interactive computing. Each session offers 1 hour of instruction followed by a 30-minute Q&A. COMPLECS is supported by NSF award 2320934.

Instructor

Robert Sinkovits

Director - Scientific Computing Applications, SDSC

Robert Sinkovits, Ph.D. leads the scientific applications efforts at the San Diego Supercomputer Center. He has collaborated with researchers spanning many fields including physics, chemistry, astronomy, structural biology, finance, ecology, climate, immunology and the social sciences, always with an emphasis on making the most effective use of high end computing resources. Before returning to SDSC, he was the primary developer of the AUTO3DEM and IHRSR++ software packages used for solving the structures of icosahedral and helical macromolecular structures, respectively. He is the co-PI for the NSF Comet and Voyager supercomputer awards, co-PI of the XSEDE project and the co-lead of the Human Vaccines Project’s Bioinformatics Hub. He’s also an avid cyclist and mountain climber, having summited more nearly 400 peaks.

Questions?

Contact SDSC Events Coordinator