less_retarded_wiki/regex.md

46 lines
6.9 KiB
Markdown
Raw Normal View History

2023-11-05 22:10:49 +01:00
# Regular Expression
Regular expression (shortened *regex* or *regexp*) is a kind of mathematical [expression](expression.md), very often used in [programming](programming.md), that can be used to define simple patterns in [strings](string.md) of characters (usually text). Regular expressions are typically used for searching patterns (i.e. not just exact matches but rather sequences of characters which follow some rules, e.g. numeric values), substitutions (replacement) of such patterns, describing [syntax](syntax.md) of computer languages, their [parsing](parsing.md) etc. (though they may also be used in more wild ways, e.g. for generating strings). Regular expression is itself a string of symbols which however describes potentially many (even [infinitely](infinite.md) many) other strings thanks to containing special symbols that may stand for repetition, alternative etc. For example `a.*.b` is a regular expression describing a string that starts with letter `a`, which is followed by a sequence of at least one character and then ends with `b` (so e.g. `aab`, `abbbb`, `acaccb` etc.).
2023-11-06 22:05:02 +01:00
WATCH OUT: do not confuse regular expressions with Unix [wildcards](wildcard.md) used in file names (e.g. `sourse/*.c` is a wildcard, not a regexp).
2023-11-05 22:10:49 +01:00
2023-11-06 22:05:02 +01:00
{ A popular online tool for playing around with regular expressions is https://regexr.com/, though it requires JS and is bloated; if you want to stay with Unix, just grep (possibly with -o to see just the matched string). ~drummyfish }
Regular expressions are widely used in [Unix](unix.md) tools, [programming languages](programming_language.md), editors etc. Especially notable are [grep](grep.md) (searches for patterns in files), [sed](sed.md) (text processor, often used for search and replacement of patterns), [awk](awk.md), [Perl](perl.md), [Vim](vim.md) etc.
From the point of view of [theoretical computer science](theoretical_compsci.md) and [formal languages](formal_language.md) **regular expressions are computationally weak**, they are equivalent to the weakest models of computations such as regular [grammars](grammar.md) or **[finite state machines](finite_state_machine.md)** -- in fact regular expressions are often implemented as finite state machines. This means that **regular expressions can NOT describe any possible pattern**, only relatively simple ones; however it turns out that very many commonly encountered patterns are simple enough to be described this way, so we have a [good enough](good_enough.md) tool. The advantage of regular expressions is exactly that they are simple, yet very often sufficient.
## Details
WIP
There exist different standards and de-facto standards for regular expressions, some using different symbols, some having extra [syntactic sugar](syntactic_sugar.md) (which however usually only make the syntax more comfortable, NOT more computationally powerful) and features (typically e.g. so called *capture groups* that allow to extract specific subparts of given matched pattern). There are cases where a feature makes regexes more computationally powerful, namely the backreference `\n` present in extended regular expressions (source: *Backreferences in practical regular expressions, 2020*). Most relevant standards are probably [Posix](posix.md) and [Perl](perl.md) (with specific implementations sometimes adding their own flavor, e.g. [GNU](gnu.md), [Vim](vim.md) etc.): Posix specifies **basic** and **extended** regular expression (extended usually turned on with the `-E` CLI flag). The following table sums up the most common constructs used in regular expressions:
| construct | matches | availability | example |
| ------------- | ------------------------------------------- | -------------------------- | ----------------------------------------- |
| *char* | this exact character | everywhere | `a` matches `a` |
| `.` | any single character | everywhere | `.` matches `a`, `b`, `1` etc. |
| *expr*`*` | any number (even 0) of repeating *expr* | everywhere |`a*` matches *empty*, `a`, `aa`, `aaa`, ...|
| `^` | start of expression (usually start of line) | everywhere | `^a` matches `a` at the start of line |
| `$` | end of expression (usually end of line) | everywhere | `a$` matches `a` at the end of line |
| *expr*`+` | matches 1 or more repeating *expr* |escape (`\+`) in basic | `a+` matches `a`, `aa`, `aaa`, ... |
| *expr*`?` | matches 0 or 1 *expr* |escape (`\?`) in basic | `a?` matches either *empty* or `a` |
| `[`*S*`]` | matches anything character from set *S* | everywhere | `[abc]` matches `a`, `b` or `c` |
|`(`*expr*`)` | marks group (for capt. groups etc.) |escape (`\(`, `\)`) in basic| `a(bc)d` matches `abcd` with group `bc` |
|`[`*A*`-`*B*`]`| like `[` `]` but specifies a range | everywhere | `[3-5]` matches `3`, `4` and `5` |
| `[^`*S*`]` | matches any char. NOT from set *S* | everywhere | `[^abc]` matches `d`, `e`, `A`, `1` etc. |
|`{`*M*`,`*N*`}`| *M* to *N* repetitions of *expr* |escape (`\{`, `\}`) in basic| `a{2,4}` matches `aa`, `aaa`, `aaaa` |
|*exp1*`|`*exp2*| *exp1* or *exp2* | escape (`\|`) in basic | `abc|def` matches `abc` or `def` |
| `\`*n* |backref., *n*th matched group (starts with 1)| extended only | `(..).*\1` matches e.g. `ABcdefAB` |
|`[:alpha:]` | alphabetic, *a* to *z*, *A* to *Z* | Posix (GNU has `[[` `]]` | `[:alpha:]*` matches e.g. `abcDEF` |
|`[:alnum:]` | | same as above | |
|`[:digit:]` | | same as above | |
|`[:blank:]` | | same as above | |
|`[:lower:]` | | same as above | |
|`[:space:]` | | same as above | |
| `\w` | like `[:alnum:]` plus also `_` char. | Perl | |
| `\d` | digit, *0* to *9* | Perl | |
| `\s` | like `[:space:]` | Perl | |
TODO