Update

2024-03-10 02:02:07 +01:00 · 2024-03-10 02:02:07 +01:00 · 66fe12a69c
commit 66fe12a69c
parent 6a169e2f14
16 changed files with 1814 additions and 1711 deletions
--- a/regex.md
+++ b/regex.md
@ -8,7 +8,7 @@ WATCH OUT: do not confuse regular expressions with Unix [wildcards](wildcard.md)

 Regular expressions are widely used in [Unix](unix.md) tools, [programming languages](programming_language.md), editors etc. Especially notable are [grep](grep.md) (searches for patterns in files), [sed](sed.md) (text processor, often used for search and replacement of patterns), [awk](awk.md), [Perl](perl.md), [Vim](vim.md) etc.

-From the point of view of [theoretical computer science](theoretical_compsci.md) and [formal languages](formal_language.md) **regular expressions are computationally weak**, they are equivalent to the weakest models of computations such as regular [grammars](grammar.md) or **[finite state machines](finite_state_machine.md)** -- in fact regular expressions are often implemented as finite state machines. This means that **regular expressions can NOT describe any possible pattern**, only relatively simple ones; however it turns out that very many commonly encountered patterns are simple enough to be described this way, so we have a [good enough](good_enough.md) tool. The advantage of regular expressions is exactly that they are simple, yet very often sufficient.
+From the point of view of [theoretical computer science](theoretical_compsci.md) and [formal languages](formal_language.md) **regular expressions are computationally weak**, they are equivalent to the weakest models of computations such as regular [grammars](grammar.md) or **[finite state machines](finite_state_machine.md)** -- in fact regular expressions are often implemented as finite state machines. This means that **regular expressions can NOT describe any possible pattern** (for example they can't capture a math expression with brackets in which start brackets have to match end brackets), only relatively simple ones; however it turns out that very many commonly encountered patterns are simple enough to be described this way, so we have a [good enough](good_enough.md) tool. The advantage of regular expressions is exactly that they are simple, yet very often sufficient.

 ## Details

@ -130,4 +130,79 @@ Here are some strings generated with different `REGEX`es:
 - `[lL][oue]+lz?`: `Loeeolz`, `Luel`, `luuuolz`, `lol`, `Leelz`, `Leoeuoeoueulz`, `luueeoolz`, ...
 - ...

-TODO: moar, code
+Let's now try to **[program](programming.md)** a very simple regular expression in [C](c.md). You can do this in quite fancy ways, serious regex libraries will typically let you specify arbitrary regular expression with a string at runtime (for example `char *myRegex = "(abc|ABC).*d+";`), then compile it to some fast, efficient representation like the mentioned state machine and use that for matching and replacing patterns. We'll do nothing like that here as that's too complex, we will simply make a program that has one hard wired regular expression and it will just say if given input string matches or not. Let's consider the following regular expression:
+
+```
+(<[^<>]*>|[^<>]*)*
+```
+
+It describes an "[XML](xml.md)"-like text; the text can contain tags that start with `<` and end with `>`, but there mustn't e.g. be a tag inside another tag. For example `<hello> what <world>` will match, but `hello > world << bruh` won't match. OK, so the first thing to do is to convert the regular expression to a [finite state automaton](finite_state_automaton.md) -- this can be done intuitively but there is also an exact algorithm that can do this with any regular expression (look it up if you need it). Our automaton will look like this:
+
+```
+               .---.                 .---.
+               |   | else            |   | else
+           ____V___|____   '>'   ____V___|___
+          |  (accept)   |<------|            |
+start --->| outside_tag |------>| inside_tag |
+          |_____________|  '<'  |____________|
+                |                 |
+                | '>'         '<' |
+              __V___              |
+         .---|      |             |
+     any |   | fail |<------------'
+         '-->|______|
+```
+
+Here we start in the `outside_tag` state and move between states depending on what characters we read from the input string we are checking (indicated next to the arrows). If we end up in the `outside_tag` state state again (marked as *accepting* state) when all is read, the input string matched the regular expression, otherwise it didn't. We'll translate this automaton to a C program:
+
+```
+#include <stdio.h>
+
+#define STATE_OUTSIDE_TAG 0
+#define STATE_INSIDE_TAG  1
+#define STATE_FAIL        2
+
+int main(void)
+{
+  int state = STATE_OUTSIDE_TAG;
+
+  while (1)
+  {
+    int c = getchar();
+
+    if (c == EOF)
+      break;
+
+    switch (state)
+    {
+      case STATE_OUTSIDE_TAG:
+        if (c == '<')
+          state = STATE_INSIDE_TAG;
+        else if (c == '>')
+          state = STATE_FAIL;
+
+        break;
+
+      case STATE_INSIDE_TAG:
+        if (c == '>')
+          state = STATE_OUTSIDE_TAG;
+        else if (c == '<')
+          state = STATE_FAIL;
+
+        break;
+
+      case STATE_FAIL:
+        break;
+    }
+  }
+
+  puts(state == STATE_OUTSIDE_TAG ? "matches!" : "string didn't match :(");
+  return 0;
+}
+```
+
+Just compile this and pass a string to the standard input (e.g. `echo "<testing> string | ./program"`), it will write out if it matches or not.
+
+Maybe it seems a bit overcomplicated -- you could say you could program the above even without regular expressions and state machines. That's true, however imagine dealing with a more complex regex, one that matches a quite complex real world file format. Consider that in [HTML](html.md) for example there are pair tags, non-pair tags, attributes inside tags, entities, comments and many more things, so here you'd have great difficulties creating such parser intuitively -- the approach we have shown can be completely automatized and will work as long as you can describe the format with regular expression.
+
+TODO: regexes in some langs. like Python