This commit is contained in:
Miloslav Ciz 2024-10-06 20:23:00 +02:00
parent 4725b968bd
commit e113dbfa66
10 changed files with 1839 additions and 1835 deletions

View file

@ -4,7 +4,7 @@
Unicode is a successful, constantly evolving standard aiming to organize symbols and characters (letters, digits, graphical symbols, [emoji](emoji.md), ...) of all the world's writing systems and to define and standardize ways of encoding them as [digital](digital.md) [data](data.md), i.e. it's a big [project](project.md) promising to unify the encoding of any possible [text](text.md) in [computers](computer.md). As of writing the lastest version is 16.0 from 2024, defining over 150000 characters. The effort dates back to 1980s and was started to do away with the mess and headaches induced by a plethora of existing incompatible text encoding systems -- in this it succeeded, Unicode is nowadays everywhere and it's the standard way of encoding text wherever you look, probably owing a lot to its backwards compatibility with plain [ASCII](ascii.md) encoding which was the most popular encoding of English back in the day (i.e. any old ASCII text is still a valid Unicode text, provided we use UTF-8 encoding). The standard is made by the Unicode Consortium whose members are basically all the big companies.
In Unicode every character is unique like a unicorn.
In Unicode every character is unique like a unicorn. It has all the diverse characters such as the penis (𓂸), ejaculating penis (𓂺), swastika (卐), hammer and sickle (☭), white power sign (👌), middle finger (🖕) etc. **Here is a lulzy part of Unicode**: it's possible to combine some characters together with so called *combining characters*, so purely IN THEORY one can for example combine the prohibition symbol (U+20Ex) with [LGBT](lgbt.md) propaganda characters and other fascist symbols to create anti-fascist emjis likes so: 🏳️‍🌈⃠👨🏿⃠👩⃠. Of course this created some controversies :D
**It is important to distinguish between Unicode codepoints (the abstract character codes) and Unicode encodings**, they are two different things. For example the Unicode codepoint for character *A* is 65 (same as in ASCII), or (written the Unicode way) *U+0041* (41 is [hexadecimal](hexadecimal.md) 65), but this value of 65 may then be represented in several different ways in the computer file, depending on the Unicode encoding we use (in UTF-8 it will be a single byte while in UTF-16 it will be two bytes). Currently Unicode defines these encodings (additional unofficial encodings exist as well):