This commit is contained in:
Miloslav Ciz 2025-04-03 21:49:43 +02:00
parent 490ffab10e
commit ee83d8a6b6
38 changed files with 2053 additions and 2030 deletions

View file

@ -1,16 +1,16 @@
# Regular Expression
Regular expression (shortened *regex* or *regexp*) is a kind of [mathematical](math.md) [expression](expression.md), very often used in [programming](programming.md), that can be used to define simple patterns in [strings](string.md) of characters (usually text). Regular expressions are typically used for searching patterns (i.e. not just exact matches but rather sequences of characters which follow some rules, e.g. numeric values), substitutions (replacement) of such patterns, describing [syntax](syntax.md) of computer languages, their [parsing](parsing.md) etc. (though they may also be used in more wild ways, e.g. for generating strings). Regular expression is itself a string of symbols which however describes potentially many (even [infinitely](infinite.md) many) other strings thanks to containing special symbols that may stand for repetition, alternative etc. For example `a.*.b` is a regular expression describing a string that starts with letter `a`, which is followed by a sequence of at least one character and then ends with `b` (so e.g. `aab`, `abbbb`, `acaccb` etc.).
Regular expression (shortened *regex* or *regexp*) is a kind of [mathematical](math.md) [expression](expression.md), plentifully used in [programming](programming.md), that defines simple patterns in [strings](string.md) of characters (usually text). Regular expressions are typically used for searching patterns (i.e. not just exact matches but rather sequences of characters which follow some rules, e.g. numeric values or web [URLs](url.md)), substitutions (replacement) of such patterns, describing [syntax](syntax.md) of computer languages, their [parsing](parsing.md) etc. (though more creative uses aren't out of question either, e.g. generating [random](randomness.md) strings). Regular expression is itself a string of symbols which however describes potentially many (even [infinitely](infinite.md) many) other strings thanks to containing special symbols that may stand for repetition, alternative etc. For example `a.*.b` is a regular expression describing a string that starts with letter `a`, which is followed by a sequence of at least one character and then ends with `b` (so e.g. `aab`, `abbbb`, `acaccb` etc.).
WATCH OUT: do not confuse regular expressions with [Unix](unix.md) [wildcards](wildcard.md) used in file names (e.g. `sourse/*.c` is a wildcard, not a regexp).
WATCH OUT: be careful not to confuse regular expressions with [Unix](unix.md) [wildcards](wildcard.md) used in file names (e.g. `sourse/*.c` is a wildcard, not a regexp).
{ A popular online tool for playing around with regular expressions is https://regexr.com/, though it requires JS and is bloated; if you want to stay with Unix, just grep (possibly with -o to see just the matched string). ~drummyfish }
Regular expressions are widely used in [Unix](unix.md) tools, [programming languages](programming_language.md), editors etc. Especially notable are [grep](grep.md) (searches for patterns in files), [sed](sed.md) (text processor, often used for search and replacement of patterns), [awk](awk.md), [Perl](perl.md), [Vim](vim.md) etc.
Regular expressions are encountered in many [Unix](unix.md) tools, [programming languages](programming_language.md), editors etc. Especially worthy of mention are [grep](grep.md) (searches for patterns in files), [sed](sed.md) (text processor, often used for search and replacement of patterns), [awk](awk.md), [Perl](perl.md), [Vim](vim.md) etc.
From the point of view of [theoretical computer science](theoretical_compsci.md) and [formal languages](formal_language.md) **regular expressions are computationally weak**, they are equivalent to the weakest models of computations such as regular [grammars](grammar.md) or **[finite state machines](finite_state_machine.md)** (both [deterministic](deterministic.md) and nondeterministic) -- in fact regular expressions are often implemented as finite state machines. This means that **regular expressions can NOT describe any possible pattern** (for example they can't capture a math expression with nested brackets), only relatively simple ones; however it turns out that very many commonly encountered patterns are simple enough to be described this way, so we have a [good enough](good_enough.md) tool. The advantage of regular expressions is exactly that they are simple, yet very often sufficient.
**Are there yet simpler pattern describers than regular expressions?** Yes, of course, the simplest example is just a string directly describing the pattern, e.g. "abc" matching exactly just the string "abc" -- this is called a *fixed string*. Notable subclass of regular expressions are so called *star-free* languages/expressions which are regular expressions without the star (repetition) operator. Star-free expressions can be used as a [simpler](kiss.md) variant to regular expressions, they may still describe many patterns and are easier to implement.
**Are there yet simpler pattern describers than regular expressions?** Yes, of course, the simplest example is just a string directly describing the pattern, e.g. "abc" matching exactly just the string "abc" -- this is called a *fixed string*. Next we can think of case-insensitive pattern, so "abc" would match "abc", "ABC", "AbC" etc. Notable subclass of regular expressions are so called *star-free* languages/expressions which are regular expressions without the star (repetition) operator. Star-free expressions can be used as a [simpler](kiss.md) variant to regular expressions, they may still describe many patterns and are easier to implement.
## Details
@ -155,7 +155,7 @@ start --->| outside_tag |------>| inside_tag |
'-->|______|
```
Here we start in the `outside_tag` state and move between states depending on what characters we read from the input string we are checking (indicated next to the arrows). If we end up in the `outside_tag` state state again (marked as *accepting* state) when all is read, the input string matched the regular expression, otherwise it didn't. We'll translate this automaton to a C program:
Here we start in the `outside_tag` state and move between states depending on what characters we read from the input string we are analyzing (indicated next to the arrows). If we end up in the `outside_tag` state state again (marked as *accepting* state) when all is read, the input string matched the regular expression, otherwise it didn't. We'll translate this automaton to a C program:
```
#include <stdio.h>
@ -208,3 +208,8 @@ Just compile this and pass a string to the standard input (e.g. `echo "<testing>
Maybe it seems a bit overcomplicated -- you could say you could program the above even without regular expressions and state machines. That's true, however imagine dealing with a more complex regex, one that matches a quite complex real world file format. Consider that in [HTML](html.md) for example there are pair tags, non-pair tags, attributes inside tags, entities, comments and many more things, so here you'd have great difficulties creating such parser intuitively -- the approach we have shown can be completely automatized and will work as long as you can describe the format with regular expression.
TODO: regexes in some langs. like Python
## See Also
- [wildcard](wildcard.md)
- [formal language](formal_language.md)