less_retarded_wiki/formal_language.md
2024-03-23 20:01:25 +01:00

6.3 KiB

Formal Language

The field of formal languages tries to mathematically and rigorously view problems as languages; this includes probably most structures we can think of, from human languages and computer languages to visual patterns and other highly abstract structures. Formal languages are at the root of theoretical computer science and are important e.g. for the theory of computability/decidability, computational complexity, security and compilers, but they also find use in linguistics and other fields of science.

A formal language is defined as a (potentially infinite) set of strings (which are finite but unlimited in length) over some alphabet (which is finite). I.e. a language is a subset of E* where E is a finite alphabet (a set of letters). (* is a Kleene Star and signifies a set of all possible strings over E). The string belonging to a language may be referred to as a word or perhaps even sentence, but this word/sentence is actually a whole kind of text written in the language, if we think of it in terms of our natural languages. The C programming language can be seen as a formal language which is a set of all strings that are a valid C program that compiles without errors etc.

For example, given an alphabet [a,b,c], a possible formal language over it is [a,ab,bc,c]. Another, different possible language over this alphabet is an infinite language [b,ab,aab,aaab,aaaab,...] which we can also write with a regular expression as a*b. We can also see e.g. English as being a formal language equivalent to a set of all texts over the English alphabet (along with symbols like space, dot, comma etc.) that we would consider to be in English as we speak it.

What is this all good for? This mathematical formalization allows us to classify languages and understand their structure, which is necessary e.g. for creating efficient compilers, but also to understand computers as such, their power and limits, as computers can be viewed as machines for processing formal languages. With these tools researches are able to come up with proofs of different properties of languages, which we can exploit. For example, within formal languages, it has been proven that certain languages are uncomputable, i.e. there are some problems which a computer cannot ever solve (typical example is the halting problem) and so we don't have to waste time on trying to create such algorithms as we will never find any. The knowledge of formal languages can also guide us in designing computer languages: e.g. we know that regular languages are extremely simple to implement and so, if we can, we should prefer our languages to be regular.

Classification

We usually classify formal languages according to the Chomsky hierarchy, by their computational "difficulty". Each level of the hierarchy has associated models of computation (grammars, automatons, expressions, ...) that are able to compute all languages of that level (remember that a level of the hierarchy is a superset of the levels below it and so also includes all the "simpler" languages). The hierarchy is more or less as follows:

  • all languages: This includes all possible languages, even those that computers cannot analyze (e.g. the language representing the halting problem). These languages can only be computed by theoretical computers that cannot physically exist in our universe.
  • type 0, recursively enumerable languages: Most "difficult"/general languages that computers in our universe can analyze. These languages can be computed e.g. by a Turing machine, lambda calculus or a general unrestricted grammar. Example language: a^n where n is not a prime.
  • type 1, context sensitive languages: Computed e.g. by a linearly bounded non-deterministic Turing machine or a context sensitive grammars. Example language: a^(n)b^(n)c^(n), n >= 0 (strings of n as, followed by n bs, followed by n cs).
  • type 2, context free languages: Computed by e.g. non-deterministic pushdown automata or context free grammars. (Deterministic pushdown automata compute a class of languages that is between type 2 and type 3).
  • type 3, regular languages: The easiest, weakest kind of languages, computed e.g. by finite state automatas or regular expressions. This class includes also all finite languages.

Note that here we are basically always examining infinite languages as finite languages are trivial. If a language is finite (i.e. the set of all strings of the language is finite), it can automatically be computed by any type 3 computational model. In real life computers are actually always equivalent to a finite state automaton, i.e. the weakest computational type (because a computer memory is always finite and so there is always a finite number of states a computer can be in). However this doesn't mean there is no point in studying infinite languages, of course, as we're still interested in the structure, computational methods and approximating the infinite models of computation.

Also bear in mind these classes aren't exhaustive, there exist more classes and there are still undiscovered/unproven classes of languages, the Chomsky hierarchy enumerates just the important ones. For example regular languages have a further subclass of star-free languages.

NOTE: When trying to classify a programming language, we have to be careful about what we classify: one thing is what a program written in given language can compute, and another thing is the language's syntax. To the former all strict general-purpose programming languages such as C or JavaScript are type 0 (Turing complete). From the syntax point of view it's a bit more complicated and we need to further define what exactly a syntax is (where is the line between syntax and semantic errors): it may be (and often is) that syntactically the class will be lower. There is actually a famous meme about Perl syntax being undecidable.