less_retarded_wiki/regex.md

211 lines
17 KiB
Markdown
Raw Normal View History

2023-11-05 22:10:49 +01:00
# Regular Expression
Regular expression (shortened *regex* or *regexp*) is a kind of mathematical [expression](expression.md), very often used in [programming](programming.md), that can be used to define simple patterns in [strings](string.md) of characters (usually text). Regular expressions are typically used for searching patterns (i.e. not just exact matches but rather sequences of characters which follow some rules, e.g. numeric values), substitutions (replacement) of such patterns, describing [syntax](syntax.md) of computer languages, their [parsing](parsing.md) etc. (though they may also be used in more wild ways, e.g. for generating strings). Regular expression is itself a string of symbols which however describes potentially many (even [infinitely](infinite.md) many) other strings thanks to containing special symbols that may stand for repetition, alternative etc. For example `a.*.b` is a regular expression describing a string that starts with letter `a`, which is followed by a sequence of at least one character and then ends with `b` (so e.g. `aab`, `abbbb`, `acaccb` etc.).
2023-11-06 22:05:02 +01:00
WATCH OUT: do not confuse regular expressions with Unix [wildcards](wildcard.md) used in file names (e.g. `sourse/*.c` is a wildcard, not a regexp).
2023-11-05 22:10:49 +01:00
2023-11-06 22:05:02 +01:00
{ A popular online tool for playing around with regular expressions is https://regexr.com/, though it requires JS and is bloated; if you want to stay with Unix, just grep (possibly with -o to see just the matched string). ~drummyfish }
Regular expressions are widely used in [Unix](unix.md) tools, [programming languages](programming_language.md), editors etc. Especially notable are [grep](grep.md) (searches for patterns in files), [sed](sed.md) (text processor, often used for search and replacement of patterns), [awk](awk.md), [Perl](perl.md), [Vim](vim.md) etc.
2024-03-10 02:02:07 +01:00
From the point of view of [theoretical computer science](theoretical_compsci.md) and [formal languages](formal_language.md) **regular expressions are computationally weak**, they are equivalent to the weakest models of computations such as regular [grammars](grammar.md) or **[finite state machines](finite_state_machine.md)** -- in fact regular expressions are often implemented as finite state machines. This means that **regular expressions can NOT describe any possible pattern** (for example they can't capture a math expression with brackets in which start brackets have to match end brackets), only relatively simple ones; however it turns out that very many commonly encountered patterns are simple enough to be described this way, so we have a [good enough](good_enough.md) tool. The advantage of regular expressions is exactly that they are simple, yet very often sufficient.
2023-11-06 22:05:02 +01:00
2024-03-23 20:01:25 +01:00
**Are there yet simpler pattern describers than regular expressions?** Yes, of course, the simplest example is just a string directly describing the pattern, e.g. "abc" matching exactly just the string "abc" -- this is called a *fixed string*. Notable subclass of regular expressions are so called *star-free* languages/expressions which are regular expressions without the star (repetition) operator. Star-free expressions can be used as a [simpler](kiss.md) variant to regular expressions, they may still describe many patterns and are easier to implement.
2023-11-06 22:05:02 +01:00
## Details
WIP
There exist different standards and de-facto standards for regular expressions, some using different symbols, some having extra [syntactic sugar](syntactic_sugar.md) (which however usually only make the syntax more comfortable, NOT more computationally powerful) and features (typically e.g. so called *capture groups* that allow to extract specific subparts of given matched pattern). There are cases where a feature makes regexes more computationally powerful, namely the backreference `\n` present in extended regular expressions (source: *Backreferences in practical regular expressions, 2020*). Most relevant standards are probably [Posix](posix.md) and [Perl](perl.md) (with specific implementations sometimes adding their own flavor, e.g. [GNU](gnu.md), [Vim](vim.md) etc.): Posix specifies **basic** and **extended** regular expression (extended usually turned on with the `-E` CLI flag). The following table sums up the most common constructs used in regular expressions:
| construct | matches | availability | example |
| ------------- | ------------------------------------------- | -------------------------- | ----------------------------------------- |
| *char* | this exact character | everywhere | `a` matches `a` |
| `.` | any single character | everywhere | `.` matches `a`, `b`, `1` etc. |
| *expr*`*` | any number (even 0) of repeating *expr* | everywhere |`a*` matches *empty*, `a`, `aa`, `aaa`, ...|
| `^` | start of expression (usually start of line) | everywhere | `^a` matches `a` at the start of line |
| `$` | end of expression (usually end of line) | everywhere | `a$` matches `a` at the end of line |
| *expr*`+` | matches 1 or more repeating *expr* |escape (`\+`) in basic | `a+` matches `a`, `aa`, `aaa`, ... |
| *expr*`?` | matches 0 or 1 *expr* |escape (`\?`) in basic | `a?` matches either *empty* or `a` |
| `[`*S*`]` | matches anything character from set *S* | everywhere | `[abc]` matches `a`, `b` or `c` |
|`(`*expr*`)` | marks group (for capt. groups etc.) |escape (`\(`, `\)`) in basic| `a(bc)d` matches `abcd` with group `bc` |
|`[`*A*`-`*B*`]`| like `[` `]` but specifies a range | everywhere | `[3-5]` matches `3`, `4` and `5` |
| `[^`*S*`]` | matches any char. NOT from set *S* | everywhere | `[^abc]` matches `d`, `e`, `A`, `1` etc. |
|`{`*M*`,`*N*`}`| *M* to *N* repetitions of *expr* |escape (`\{`, `\}`) in basic| `a{2,4}` matches `aa`, `aaa`, `aaaa` |
2023-11-07 21:39:59 +01:00
|*e1*<code>&#124;</code>*e2*| *e1* or *e2* | escape in basic |<code>ab&#124;cd</code> match. `ab` or `cd`|
2023-11-06 22:05:02 +01:00
| `\`*n* |backref., *n*th matched group (starts with 1)| extended only | `(..).*\1` matches e.g. `ABcdefAB` |
2024-01-07 02:43:35 +01:00
|`[:alpha:]` | alphabetic, *a* to *z*, *A* to *Z* | Posix (GNU has `[[` `]]`) | `[:alpha:]*` matches e.g. `abcDEF` |
2023-11-06 22:05:02 +01:00
|`[:alnum:]` | | same as above | |
|`[:digit:]` | | same as above | |
|`[:blank:]` | | same as above | |
|`[:lower:]` | | same as above | |
|`[:space:]` | | same as above | |
| `\w` | like `[:alnum:]` plus also `_` char. | Perl | |
| `\d` | digit, *0* to *9* | Perl | |
| `\s` | like `[:space:]` | Perl | |
2024-03-04 23:14:06 +01:00
## Examples And Fun
Here we'll demonstrate some practical uses of regular expressions. Most common [Unix](unix.md) tools associated with regular expressions are probably **[grep](grep.md)** (for searching) and **[sed](sed.md)** (for replacing).
The most basic use case is you just wanting to **search** for some pattern in a file, i.e. for example you are looking for all [IP addresses](ip_address.md) in a file, for a certain exact word inside source code comment etc.
The following uses grep to find and count all occurences of the word `capitalism` or `capitalist` (disregarding case with the `-i` flag) in a plain text version of [this wiki](lrs_wiki.md) and passes them to be counted with *wc*.
`grep -i -o "capitalis[mt]" ~/Downloads/lrs_wiki.txt | wc -l`
We find out there are `829` such occurrences.
Of course, quite frequently you may want to see the lines that match along with files and line numbers, try also e.g. `grep -m 10 -H -n "Jesus" ~/Downloads/lrs_wiki.txt`.
Now let's search for things that suck with (`-o` prints out just the matches instead of whole line, `-m 10` limits the output to 10 results at most):
```
grep -o -m 10 "[^ ]* \(sucks\|is shit\)" ~/Downloads/lrs_wiki.txt
```
Currently we get the following output:
```
body sucks
OS sucks
everything sucks
Everything is shit
language sucks
it sucks
D&D sucks
writing sucks
Fediverse sucks
This sucks
```
Now let's try **replacing** stuff with *sed* -- this is done with a very common format (which you should remember as it's often used in common speech) `s/PATTERN/REPLACE/g` where `PATTERN` is the regular expression to match, `REPLACE` is a string with which to replace the pattern (`s` stands for substitute and `g` for global, i.e. "replace all"). Let's say we are retarded and obsessed with [muh privacy](privacy.md) and want to censor all names in a portion of the wiki we want to print, so we'll just replace all words composed of letters that start with uppercase letter (and continue with lowercase letters) -- this will also censor other words than names but let's [keep it simple](kiss.md) for now. The command may look something like:
```
cat ~/Downloads/lrs_wiki.txt | tail -n +5003 | head -n 10 | sed "s/[A-Z][a-z]\+/<BEEP>/g"
```
Here we may get e.g.:
```
they typically have a personal and not easy to describe faith. [19]<BEEP>
was a <BEEP>. [20]<BEEP> often used the word "[21]<BEEP>" instead of
"nature" or "universe"; even though he said he didn't believe in the
traditional personal <BEEP>, he also said that the laws of physics were like
books in a library which must have obviously been written by someone or
something we can't comprehend. [22]<BEEP> <BEEP> said he was "deeply
religious, though not in the orthodox sense". <BEEP> are also very hardcore
religious people such as [23]<BEEP> <BEEP>, the inventor of [24]<BEEP>
language, who even planned to be a <BEEP> missionary. <BEEP> "true
atheists" are mostly second grade "scientists" who make career out of the
```
A more advanced feature is what we call **capture groups** that allow us to reuse parts of the matched pattern in the replacement string -- this is needed in some cases, for example if you just want to insert some extra characters in the pattern. Capture groups are parts of the pattern inside brackets (`(` and `)` which sometimes have to be escaped to `\(` and `\)`); in the replacement string we then reference them with `\1` (first group), `\2` (second group) etc. Let's demonstrate this on the following example that will highlight all four letter words:
```
cat ~/Downloads/lrs_wiki.txt | tail -n +8080 | head -n 7 | sed "s/ \([^ ]\)\([^ ]\)\([^ ]\)\([^ ]\) / \!\!\!\1-\2-\3-FUCKING-\4\!\!\! /g"
```
The result may look something like this:
```
Bootstrapping as a general principle can aid us in creation of extremely
!!!f-r-e-FUCKING-e!!! technology by greatly minimizing all its [7]dependencies, we are able
to create a small amount of !!!c-o-d-FUCKING-e!!! that !!!w-i-l-FUCKING-l!!! self-establish our whole
computing environment !!!w-i-t-FUCKING-h!!! only !!!v-e-r-FUCKING-y!!! small effort during the process. The
topic mostly revolves around designing [8]programming language
[9]compilers, but in general we may be talking about bootstrapping whole
computing environments, operating systems etc.
```
Now a pretty rare use case is **generating [random](randomness.md) patterns** -- we can imagine we have a regular expression describing for example a valid username in a game and for some reason we want to generate 1000 random strings that match this pattern (e.g. for bots). Now going into depth on this topic could take a long time because e.g. considering [probability distributions](probability_distribution.md) we may get into some mathematical rabbit holes (considering that for example the regex `.*` matches an arbitrarily long string, what will be the average length of a string randomly generated by this pattern? 10? 1000? 1 billion?). Anyway let's leave this aside -- if we do, there is actually a quite simple and natural way of generating random patterns from regular expressions. We can convert the regexp into [nondeterministic finite automaton](nfa.md), the make a [random traverse](random_walk.md) of the graph and generate the string along the way. There don't even seem to be many Unix tools for doing this -- at the time of writing this one of the simplest way seems to be with [Perl](perl.md) and one of its libraries, which currently still has some great limitations (no groups, no special characters in square brackets, ...), but it's better than nothing. The command we'll be using is:
```
perl -e "use String::Random; for (1..20) { print String::Random->new->randregex('REGEX') . \"\n\"; }" 2>/dev/null
```
Here are some strings generated with different `REGEX`es:
- `....*`: `\n\"; }"`, `/dB|^hRR/3za~`, `![5q-%ONK`, `"oJT.UzSa}0`, `t"}Yq]sWZjIv`, `Fq<xs~_e`, `H=5yt9q<>29XW`, `<EoVf)ORH{m`, ...
- `[A-Z][a-z]{3,8}`: `Ponic`, `Xwawo`, `Hgrtky`, `Brmcitpsw`, `Qrogdze`, `Olhyb`, `Gqeelz`, `Igppehljz`, `Azrdava`, ...
- `[:;=8B][-o]?[)({}/|PS]`: `;o}`, `:-)`, `;(`, `B(`, `8o)`, `=/`, `:o|`, `=}`, `=-(`, ...
- `[lL][oue]+lz?`: `Loeeolz`, `Luel`, `luuuolz`, `lol`, `Leelz`, `Leoeuoeoueulz`, `luueeoolz`, ...
- ...
2024-03-10 02:02:07 +01:00
Let's now try to **[program](programming.md)** a very simple regular expression in [C](c.md). You can do this in quite fancy ways, serious regex libraries will typically let you specify arbitrary regular expression with a string at runtime (for example `char *myRegex = "(abc|ABC).*d+";`), then compile it to some fast, efficient representation like the mentioned state machine and use that for matching and replacing patterns. We'll do nothing like that here as that's too complex, we will simply make a program that has one hard wired regular expression and it will just say if given input string matches or not. Let's consider the following regular expression:
```
(<[^<>]*>|[^<>]*)*
```
It describes an "[XML](xml.md)"-like text; the text can contain tags that start with `<` and end with `>`, but there mustn't e.g. be a tag inside another tag. For example `<hello> what <world>` will match, but `hello > world << bruh` won't match. OK, so the first thing to do is to convert the regular expression to a [finite state automaton](finite_state_automaton.md) -- this can be done intuitively but there is also an exact algorithm that can do this with any regular expression (look it up if you need it). Our automaton will look like this:
```
.---. .---.
| | else | | else
____V___|____ '>' ____V___|___
| (accept) |<------| |
start --->| outside_tag |------>| inside_tag |
|_____________| '<' |____________|
| |
| '>' '<' |
__V___ |
.---| | |
any | | fail |<------------'
'-->|______|
```
Here we start in the `outside_tag` state and move between states depending on what characters we read from the input string we are checking (indicated next to the arrows). If we end up in the `outside_tag` state state again (marked as *accepting* state) when all is read, the input string matched the regular expression, otherwise it didn't. We'll translate this automaton to a C program:
```
#include <stdio.h>
#define STATE_OUTSIDE_TAG 0
#define STATE_INSIDE_TAG 1
#define STATE_FAIL 2
int main(void)
{
int state = STATE_OUTSIDE_TAG;
while (1)
{
int c = getchar();
if (c == EOF)
break;
switch (state)
{
case STATE_OUTSIDE_TAG:
if (c == '<')
state = STATE_INSIDE_TAG;
else if (c == '>')
state = STATE_FAIL;
break;
case STATE_INSIDE_TAG:
if (c == '>')
state = STATE_OUTSIDE_TAG;
else if (c == '<')
state = STATE_FAIL;
break;
case STATE_FAIL:
break;
}
}
puts(state == STATE_OUTSIDE_TAG ? "matches!" : "string didn't match :(");
return 0;
}
```
2024-03-10 17:17:47 +01:00
Just compile this and pass a string to the standard input (e.g. `echo "<testing> string" | ./program`), it will write out if it matches or not.
2024-03-10 02:02:07 +01:00
Maybe it seems a bit overcomplicated -- you could say you could program the above even without regular expressions and state machines. That's true, however imagine dealing with a more complex regex, one that matches a quite complex real world file format. Consider that in [HTML](html.md) for example there are pair tags, non-pair tags, attributes inside tags, entities, comments and many more things, so here you'd have great difficulties creating such parser intuitively -- the approach we have shown can be completely automatized and will work as long as you can describe the format with regular expression.
TODO: regexes in some langs. like Python