Update

2025-04-19 23:59:35 +02:00 · 2025-04-19 23:59:35 +02:00 · 165d7890e6
commit 165d7890e6
parent d666a3d2b5
23 changed files with 2150 additions and 1985 deletions
--- a/regex.md
+++ b/regex.md
@ -8,13 +8,13 @@ WATCH OUT: be careful not to confuse regular expressions with [Unix](unix.md) [w

 Regular expressions are encountered in many [Unix](unix.md) tools, [programming languages](programming_language.md), editors etc. Especially worthy of mention are [grep](grep.md) (searches for patterns in files), [sed](sed.md) (text processor, often used for search and replacement of patterns), [awk](awk.md), [Perl](perl.md), [Vim](vim.md) etc.

-From the point of view of [theoretical computer science](theoretical_compsci.md) and [formal languages](formal_language.md) **regular expressions are computationally weak**, they are equivalent to the weakest models of computations such as regular [grammars](grammar.md) or **[finite state machines](finite_state_machine.md)** (both [deterministic](deterministic.md) and nondeterministic) -- in fact regular expressions are often implemented as finite state machines. This means that **regular expressions can NOT describe any possible pattern** (for example they can't capture a math expression with nested brackets), only relatively simple ones; however it turns out that very many commonly encountered patterns are simple enough to be described this way, so we have a [good enough](good_enough.md) tool. The advantage of regular expressions is exactly that they are simple, yet very often sufficient.
+From the viewpoint of [theoretical computer science](theoretical_compsci.md) and [formal languages](formal_language.md) **regular expressions are computationally weak**, they are equivalent to the weakest models of computation such as regular [grammars](grammar.md) or **[finite state machines](finite_state_machine.md)** (both [deterministic](deterministic.md) and nondeterministic) -- in fact regular expressions are often implemented as finite state machines. This means that **regular expressions can NOT describe any possible pattern** (for example they can't capture a math expression with nested brackets), only relatively simple ones; however it turns out that very many commonly encountered patterns are simple enough to be described this way, so we have a [good enough](good_enough.md) tool. The advantage of regular expressions is exactly that they are simple, yet very often sufficient.

 **Are there yet simpler pattern describers than regular expressions?** Yes, of course, the simplest example is just a string directly describing the pattern, e.g. "abc" matching exactly just the string "abc" -- this is called a *fixed string*. Next we can think of case-insensitive pattern, so "abc" would match "abc", "ABC", "AbC" etc. Notable subclass of regular expressions are so called *star-free* languages/expressions which are regular expressions without the star (repetition) operator. Star-free expressions can be used as a [simpler](kiss.md) variant to regular expressions, they may still describe many patterns and are easier to implement.

 ## Details

-WIP
+OK, let's now dive into how exactly regular expressions work, shall we? Imagine regexp as an extension of a fixed string pattern (a string describing exactly itself, i.e. "abc" describes just the string "abc"). The extension is in giving certain characters a special meaning -- the most common are `.` (dot) and `*` (asterisk). Dot means "any character is allowed here", so "a.c" will describe strings "aac", "abc", "acc" etc. Asterisk means "the previous character repeated any number of times (even zero)", so "abc*" describes strings "ab", "abc", "abcc", "abccc" etc. If we want a regex to contain any special character as such, without its special meaning, we have to escape it -- for example "a\.c" will describe just a string "a.c". This is a super swift high altitude introduction, more detail will follow.

 There exist different standards and de-facto standards for regular expressions, some using different symbols, some having extra [syntactic sugar](syntactic_sugar.md) (which however usually only make the syntax more comfortable, NOT more computationally powerful) and features (typically e.g. so called *capture groups* that allow to extract specific subparts of given matched pattern). There are cases where a feature makes regexes more computationally powerful, namely the backreference `\n` present in extended regular expressions (source: *Backreferences in practical regular expressions, 2020*). Most relevant standards are probably [Posix](posix.md) and [Perl](perl.md) (with specific implementations sometimes adding their own flavor, e.g. [GNU](gnu.md), [Vim](vim.md) etc.): Posix specifies **basic** and **extended** regular expression (extended usually turned on with the `-E` CLI flag). The following table sums up the most common constructs used in regular expressions: