You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

6.9 KiB

Regular Expression

Regular expression (shortened regex or regexp) is a kind of mathematical expression, very often used in programming, that can be used to define simple patterns in strings of characters (usually text). Regular expressions are typically used for searching patterns (i.e. not just exact matches but rather sequences of characters which follow some rules, e.g. numeric values), substitutions (replacement) of such patterns, describing syntax of computer languages, their parsing etc. (though they may also be used in more wild ways, e.g. for generating strings). Regular expression is itself a string of symbols which however describes potentially many (even infinitely many) other strings thanks to containing special symbols that may stand for repetition, alternative etc. For example a.*.b is a regular expression describing a string that starts with letter a, which is followed by a sequence of at least one character and then ends with b (so e.g. aab, abbbb, acaccb etc.).

WATCH OUT: do not confuse regular expressions with Unix wildcards used in file names (e.g. sourse/*.c is a wildcard, not a regexp).

{ A popular online tool for playing around with regular expressions is https://regexr.com/, though it requires JS and is bloated; if you want to stay with Unix, just grep (possibly with -o to see just the matched string). ~drummyfish }

Regular expressions are widely used in Unix tools, programming languages, editors etc. Especially notable are grep (searches for patterns in files), sed (text processor, often used for search and replacement of patterns), awk, Perl, Vim etc.

From the point of view of theoretical computer science and formal languages regular expressions are computationally weak, they are equivalent to the weakest models of computations such as regular grammars or finite state machines -- in fact regular expressions are often implemented as finite state machines. This means that regular expressions can NOT describe any possible pattern, only relatively simple ones; however it turns out that very many commonly encountered patterns are simple enough to be described this way, so we have a good enough tool. The advantage of regular expressions is exactly that they are simple, yet very often sufficient.

Details

WIP

There exist different standards and de-facto standards for regular expressions, some using different symbols, some having extra syntactic sugar (which however usually only make the syntax more comfortable, NOT more computationally powerful) and features (typically e.g. so called capture groups that allow to extract specific subparts of given matched pattern). There are cases where a feature makes regexes more computationally powerful, namely the backreference \n present in extended regular expressions (source: Backreferences in practical regular expressions, 2020). Most relevant standards are probably Posix and Perl (with specific implementations sometimes adding their own flavor, e.g. GNU, Vim etc.): Posix specifies basic and extended regular expression (extended usually turned on with the -E CLI flag). The following table sums up the most common constructs used in regular expressions:

construct matches availability example
char this exact character everywhere a matches a
. any single character everywhere . matches a, b, 1 etc.
expr* any number (even 0) of repeating expr everywhere a* matches empty, a, aa, aaa, ...
^ start of expression (usually start of line) everywhere ^a matches a at the start of line
$ end of expression (usually end of line) everywhere a$ matches a at the end of line
expr+ matches 1 or more repeating expr escape (\+) in basic a+ matches a, aa, aaa, ...
expr? matches 0 or 1 expr escape (\?) in basic a? matches either empty or a
[S] matches anything character from set S everywhere [abc] matches a, b or c
(expr) marks group (for capt. groups etc.) escape (\(, \)) in basic a(bc)d matches abcd with group bc
[A-B] like [ ] but specifies a range everywhere [3-5] matches 3, 4 and 5
[^S] matches any char. NOT from set S everywhere [^abc] matches d, e, A, 1 etc.
{M,N} M to N repetitions of expr escape (\{, \}) in basic a{2,4} matches aa, aaa, aaaa
e1|e2 e1 or e2 escape in basic ab|cd match. ab or cd
\n backref., nth matched group (starts with 1) extended only (..).*\1 matches e.g. ABcdefAB
[:alpha:] alphabetic, a to z, A to Z Posix (GNU has [[ ]] [:alpha:]* matches e.g. abcDEF
[:alnum:] same as above
[:digit:] same as above
[:blank:] same as above
[:lower:] same as above
[:space:] same as above
\w like [:alnum:] plus also _ char. Perl
\d digit, 0 to 9 Perl
\s like [:space:] Perl

TODO