You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

17 KiB

Regular Expression

Regular expression (shortened regex or regexp) is a kind of mathematical expression, very often used in programming, that can be used to define simple patterns in strings of characters (usually text). Regular expressions are typically used for searching patterns (i.e. not just exact matches but rather sequences of characters which follow some rules, e.g. numeric values), substitutions (replacement) of such patterns, describing syntax of computer languages, their parsing etc. (though they may also be used in more wild ways, e.g. for generating strings). Regular expression is itself a string of symbols which however describes potentially many (even infinitely many) other strings thanks to containing special symbols that may stand for repetition, alternative etc. For example a.*.b is a regular expression describing a string that starts with letter a, which is followed by a sequence of at least one character and then ends with b (so e.g. aab, abbbb, acaccb etc.).

WATCH OUT: do not confuse regular expressions with Unix wildcards used in file names (e.g. sourse/*.c is a wildcard, not a regexp).

{ A popular online tool for playing around with regular expressions is https://regexr.com/, though it requires JS and is bloated; if you want to stay with Unix, just grep (possibly with -o to see just the matched string). ~drummyfish }

Regular expressions are widely used in Unix tools, programming languages, editors etc. Especially notable are grep (searches for patterns in files), sed (text processor, often used for search and replacement of patterns), awk, Perl, Vim etc.

From the point of view of theoretical computer science and formal languages regular expressions are computationally weak, they are equivalent to the weakest models of computations such as regular grammars or finite state machines (both deterministic and nondeterministic) -- in fact regular expressions are often implemented as finite state machines. This means that regular expressions can NOT describe any possible pattern (for example they can't capture a math expression with nested brackets), only relatively simple ones; however it turns out that very many commonly encountered patterns are simple enough to be described this way, so we have a good enough tool. The advantage of regular expressions is exactly that they are simple, yet very often sufficient.

Are there yet simpler pattern describers than regular expressions? Yes, of course, the simplest example is just a string directly describing the pattern, e.g. "abc" matching exactly just the string "abc" -- this is called a fixed string. Notable subclass of regular expressions are so called star-free languages/expressions which are regular expressions without the star (repetition) operator. Star-free expressions can be used as a simpler variant to regular expressions, they may still describe many patterns and are easier to implement.

Details

WIP

There exist different standards and de-facto standards for regular expressions, some using different symbols, some having extra syntactic sugar (which however usually only make the syntax more comfortable, NOT more computationally powerful) and features (typically e.g. so called capture groups that allow to extract specific subparts of given matched pattern). There are cases where a feature makes regexes more computationally powerful, namely the backreference \n present in extended regular expressions (source: Backreferences in practical regular expressions, 2020). Most relevant standards are probably Posix and Perl (with specific implementations sometimes adding their own flavor, e.g. GNU, Vim etc.): Posix specifies basic and extended regular expression (extended usually turned on with the -E CLI flag). The following table sums up the most common constructs used in regular expressions:

construct matches availability example
char this exact character everywhere a matches a
. any single character everywhere . matches a, b, 1 etc.
expr* any number (even 0) of repeating expr everywhere a* matches empty, a, aa, aaa, ...
^ start of expression (usually start of line) everywhere ^a matches a at the start of line
$ end of expression (usually end of line) everywhere a$ matches a at the end of line
expr+ matches 1 or more repeating expr escape (\+) in basic a+ matches a, aa, aaa, ...
expr? matches 0 or 1 expr escape (\?) in basic a? matches either empty or a
[S] matches anything character from set S everywhere [abc] matches a, b or c
(expr) marks group (for capt. groups etc.) escape (\(, \)) in basic a(bc)d matches abcd with group bc
[A-B] like [ ] but specifies a range everywhere [3-5] matches 3, 4 and 5
[^S] matches any char. NOT from set S everywhere [^abc] matches d, e, A, 1 etc.
{M,N} M to N repetitions of expr escape (\{, \}) in basic a{2,4} matches aa, aaa, aaaa
e1|e2 e1 or e2 escape in basic ab|cd match. ab or cd
\n backref., nth matched group (starts with 1) extended only (..).*\1 matches e.g. ABcdefAB
[:alpha:] alphabetic, a to z, A to Z Posix (GNU has [[ ]]) [:alpha:]* matches e.g. abcDEF
[:alnum:] same as above
[:digit:] same as above
[:blank:] same as above
[:lower:] same as above
[:space:] same as above
\w like [:alnum:] plus also _ char. Perl
\d digit, 0 to 9 Perl
\s like [:space:] Perl

Examples And Fun

Here we'll demonstrate some practical uses of regular expressions. Most common Unix tools associated with regular expressions are probably grep (for searching) and sed (for replacing).

The most basic use case is you just wanting to search for some pattern in a file, i.e. for example you are looking for all IP addresses in a file, for a certain exact word inside source code comment etc.

The following uses grep to find and count all occurences of the word capitalism or capitalist (disregarding case with the -i flag) in a plain text version of this wiki and passes them to be counted with wc.

grep -i -o "capitalis[mt]" ~/Downloads/lrs_wiki.txt | wc -l

We find out there are 829 such occurrences.

Of course, quite frequently you may want to see the lines that match along with files and line numbers, try also e.g. grep -m 10 -H -n "Jesus" ~/Downloads/lrs_wiki.txt.

Now let's search for things that suck with (-o prints out just the matches instead of whole line, -m 10 limits the output to 10 results at most):

grep -o -m 10 "[^ ]* \(sucks\|is shit\)" ~/Downloads/lrs_wiki.txt

Currently we get the following output:

body sucks
OS sucks
everything sucks
Everything is shit
language sucks
it sucks
D&D sucks
writing sucks
Fediverse sucks
This sucks

Now let's try replacing stuff with sed -- this is done with a very common format (which you should remember as it's often used in common speech) s/PATTERN/REPLACE/g where PATTERN is the regular expression to match, REPLACE is a string with which to replace the pattern (s stands for substitute and g for global, i.e. "replace all"). Let's say we are retarded and obsessed with muh privacy and want to censor all names in a portion of the wiki we want to print, so we'll just replace all words composed of letters that start with uppercase letter (and continue with lowercase letters) -- this will also censor other words than names but let's keep it simple for now. The command may look something like:

cat ~/Downloads/lrs_wiki.txt | tail -n +5003 | head -n 10 | sed "s/[A-Z][a-z]\+/<BEEP>/g"

Here we may get e.g.:

they typically have a personal and not easy to describe faith. [19]<BEEP>
was a <BEEP>. [20]<BEEP> often used the word "[21]<BEEP>" instead of
"nature" or "universe"; even though he said he didn't believe in the
traditional personal <BEEP>, he also said that the laws of physics were like
books in a library which must have obviously been written by someone or
something we can't comprehend. [22]<BEEP> <BEEP> said he was "deeply
religious, though not in the orthodox sense". <BEEP> are also very hardcore
religious people such as [23]<BEEP> <BEEP>, the inventor of [24]<BEEP>
language, who even planned to be a <BEEP> missionary. <BEEP> "true
atheists" are mostly second grade "scientists" who make career out of the

A more advanced feature is what we call capture groups that allow us to reuse parts of the matched pattern in the replacement string -- this is needed in some cases, for example if you just want to insert some extra characters in the pattern. Capture groups are parts of the pattern inside brackets (( and ) which sometimes have to be escaped to \( and \)); in the replacement string we then reference them with \1 (first group), \2 (second group) etc. Let's demonstrate this on the following example that will highlight all four letter words:

cat ~/Downloads/lrs_wiki.txt | tail -n +8080 | head -n 7 | sed "s/ \([^ ]\)\([^ ]\)\([^ ]\)\([^ ]\) / \!\!\!\1-\2-\3-FUCKING-\4\!\!\! /g"

The result may look something like this:

Bootstrapping as a general principle can aid us in creation of extremely
!!!f-r-e-FUCKING-e!!! technology by greatly minimizing all its [7]dependencies, we are able
to create a small amount of !!!c-o-d-FUCKING-e!!! that !!!w-i-l-FUCKING-l!!! self-establish our whole
computing environment !!!w-i-t-FUCKING-h!!! only !!!v-e-r-FUCKING-y!!! small effort during the process. The
topic mostly revolves around designing [8]programming language
[9]compilers, but in general we may be talking about bootstrapping whole
computing environments, operating systems etc.

Now a pretty rare use case is generating random patterns -- we can imagine we have a regular expression describing for example a valid username in a game and for some reason we want to generate 1000 random strings that match this pattern (e.g. for bots). Now going into depth on this topic could take a long time because e.g. considering probability distributions we may get into some mathematical rabbit holes (considering that for example the regex .* matches an arbitrarily long string, what will be the average length of a string randomly generated by this pattern? 10? 1000? 1 billion?). Anyway let's leave this aside -- if we do, there is actually a quite simple and natural way of generating random patterns from regular expressions. We can convert the regexp into nondeterministic finite automaton, the make a random traverse of the graph and generate the string along the way. There don't even seem to be many Unix tools for doing this -- at the time of writing this one of the simplest way seems to be with Perl and one of its libraries, which currently still has some great limitations (no groups, no special characters in square brackets, ...), but it's better than nothing. The command we'll be using is:

perl -e "use String::Random; for (1..20) { print String::Random->new->randregex('REGEX') . \"\n\"; }" 2>/dev/null 

Here are some strings generated with different REGEXes:

  • ....*: \n\"; }", /dB|^hRR/3za~, ![5q-%ONK, "oJT.UzSa}0, t"}Yq]sWZjIv, Fq<xs~_e, H=5yt9q<>29XW, <EoVf)ORH{m, ...
  • [A-Z][a-z]{3,8}: Ponic, Xwawo, Hgrtky, Brmcitpsw, Qrogdze, Olhyb, Gqeelz, Igppehljz, Azrdava, ...
  • [:;=8B][-o]?[)({}/|PS]: ;o}, :-), ;(, B(, 8o), =/, :o|, =}, =-(, ...
  • [lL][oue]+lz?: Loeeolz, Luel, luuuolz, lol, Leelz, Leoeuoeoueulz, luueeoolz, ...
  • ...

Let's now try to program a very simple regular expression in C. You can do this in quite fancy ways, serious regex libraries will typically let you specify arbitrary regular expression with a string at runtime (for example char *myRegex = "(abc|ABC).*d+";), then compile it to some fast, efficient representation like the mentioned state machine and use that for matching and replacing patterns. We'll do nothing like that here as that's too complex, we will simply make a program that has one hard wired regular expression and it will just say if given input string matches or not. Let's consider the following regular expression:

(<[^<>]*>|[^<>]*)*

It describes an "XML"-like text; the text can contain tags that start with < and end with >, but there mustn't e.g. be a tag inside another tag. For example <hello> what <world> will match, but hello > world << bruh won't match. OK, so the first thing to do is to convert the regular expression to a finite state automaton -- this can be done intuitively but there is also an exact algorithm that can do this with any regular expression (look it up if you need it). Our automaton will look like this:

               .---.                 .---.
               |   | else            |   | else
           ____V___|____   '>'   ____V___|___
          |  (accept)   |<------|            |
start --->| outside_tag |------>| inside_tag |
          |_____________|  '<'  |____________|
                |                 |
                | '>'         '<' |
              __V___              |
         .---|      |             |
     any |   | fail |<------------'
         '-->|______|

Here we start in the outside_tag state and move between states depending on what characters we read from the input string we are checking (indicated next to the arrows). If we end up in the outside_tag state state again (marked as accepting state) when all is read, the input string matched the regular expression, otherwise it didn't. We'll translate this automaton to a C program:

#include <stdio.h>

#define STATE_OUTSIDE_TAG 0
#define STATE_INSIDE_TAG  1
#define STATE_FAIL        2

int main(void)
{
  int state = STATE_OUTSIDE_TAG;

  while (1)
  {
    int c = getchar();

    if (c == EOF)
      break;

    switch (state)
    {
      case STATE_OUTSIDE_TAG:
        if (c == '<')
          state = STATE_INSIDE_TAG;
        else if (c == '>')
          state = STATE_FAIL;

        break;

      case STATE_INSIDE_TAG:
        if (c == '>')
          state = STATE_OUTSIDE_TAG;
        else if (c == '<')
          state = STATE_FAIL;

        break;

      case STATE_FAIL:
        break;
    }
  }

  puts(state == STATE_OUTSIDE_TAG ? "matches!" : "string didn't match :(");
  return 0;
}

Just compile this and pass a string to the standard input (e.g. echo "<testing> string" | ./program), it will write out if it matches or not.

Maybe it seems a bit overcomplicated -- you could say you could program the above even without regular expressions and state machines. That's true, however imagine dealing with a more complex regex, one that matches a quite complex real world file format. Consider that in HTML for example there are pair tags, non-pair tags, attributes inside tags, entities, comments and many more things, so here you'd have great difficulties creating such parser intuitively -- the approach we have shown can be completely automatized and will work as long as you can describe the format with regular expression.

TODO: regexes in some langs. like Python