less_retarded_wiki/unicode.md
2024-10-03 22:09:52 +02:00

9.5 KiB
Raw Blame History

Unicode

".̶̧̦̼̱͙̥̲̝̥̭͍̪̈́̍͌̋͑͒̒̅͂͒̀̊̓̕͠.̴̛̰̯͚͚̳͍̞̯̯͊̒͂̌̃̎̒̐̅͗́̂͠͝.̸̜̀͊̀̒̾̐̔͑̚̕͠a̲̬̪͙̖̬̖ͭͫͦ̀̄̆̍ͦͨͦ͗̅͋ͦͤͯͫ̔̚l̫̹̺̭̳͙̠̦͍̫̝͓͙̟̺͗̊̅ͬ̉͒̏͆͗͒̋ͤ̆̆ͥg̥̳̗͕̫ͪ͛̓̂ͫͮ̔͌̃̈͒̔̏ͭ͋͋ ⃝꙰⃝꙰⃝꙰⃝꙰⃝꙰⃝꙰⃝꙰⃝꙰⃝꙰⃝á́́́́́́́́́́́́́́́́́́́́́́́́́́́́́.̶̢̙̺̖͔̱͎̳̭̖̗̲̻̪̻͑̌͒̊̃̈̾̿̓̅̐́̀̋̔̏.̴̺͖͎͚̠̱̤͂̈́͜.̵̡̡͖̪̱̼͕̘̣̠̮̫͓̯͖̜̚͝͝͝.̷̧̨̥̦̥̱͉̼̗̰̪͍̱͎̑̾Z̳͍̩̻̙̪̲̞̭̦̙ͯ͛̂ͥͣͪͅͅͅl̷̢̛̩̰̹͔̣͗̅̇̍̏͑͐̇̋̑͜ͅǫ̶̢̫̟͙̖̩̽̀͆̽͌͘l̶̩̞̖̹͈͒͊̔̑̆<CCB9≯͎̺̳̄͂̊̒<CCBA>̶̸̵̶̴̸̸̴̶̸̷̶̴̴̡̢̡̢̡̢̧̧̡̧̡̨̡̨̢̧̧̡̢̛̛̛̛̼̻̣̗͔͉̩̪̞͎̖̙͍͚͍̼̰͖̺̤̗̘͕̳̻̖̳̻̗̯̭̙̳̲͕̮͇͕̼͉̞̣̟̖̘̟͕̗̼̙̻͇̝̪̦͚̤̦̣̗̤̪̟̠͖͓̟̬̲͙͇͉̘͙͙͚̜̜̮͈̞͓̰̫͍̙͙͙̱͓͖̠͇̪̭̮̤̺̗̙̘̫̤̥̳͇͔̣̩͕͍̦͈̬̯̗̘͔̻̗̘͔̪̹̬̲͇͕̻͎̣̩̻̖͉̱̝̼̞̪̠̮̤͓̥͊̔̈́̀̋̄̄̇́̋̎͛̓́̔̇̂̒̅͊̎̉͗̓̀͑̋͒͑̍̏̅̋͆̑̈̾͗̽͑̏̉̀͌͋̉̒̋̑̊̂̈́̈́͑̀͂́̈́̆̄̃͆͆̈́̊̿̌̋̍̈̒͂̀̈́͌̽͌̈́͋̈́̃̅͂͆́̍͑̓̎͋̅͂̽̈́̈́͗̆̑̔̎̈́́̆͂̉̀̒͌̿̽͊̍̃̕̚͘̚̕̚͘̕̕̕͘͜͜͝͠͝͝͠͝͝͝͠͝͠͠͠ͅͅͅͅ.̸̷̷̷̴̸̶̵̴̶̵̸̴̴̷̸̷̵̷̵̴̴̷̧̨̢̨̡̨̧̡̨̧̧̨̡̧̢̧̢̧̨̛̛̛̛̛̛̛̤͈̯̤͙̻̫̼̱̦̮̙̤̝̖̗͉̘̫̟̗̹͉͇͖̘͙̻̫̫̫̰̝̭̤͈͓͔̱̭͙͔͔̼̖̬̰̳̗͖͖̯̮͔̝̞̬̳͇͈̥̘͙͇̺̪̞̞̙͈̮͔̞̭͎̩͎̦̞̝͎̗͚͈̖̣͖̹̜̞̤̺̱̱̰͔̼̭̮̰̖͔͔͈̥͎̜̭̪̺̲͔̲̻̰̳̲̖̤̳̙̥̼̩͈̥̗̟͙̥̗̳͍̥̝̫͚̘̱̱̹̺̣̝̳̣͇̹̫̝̫̟̯̺͇̞̳͖̫͔̲̗͔̟̩̦̳͎̳͖̎̓͂̀̀́̌͗̐̅̈́̓̿̓̌́̓́͋͊͛̄͊̂̒͌̀͗̔̀̑̔͒̐̀͌̋̍͗͛̂̆̈́͛͋͆̐̌̓̄͊̑̑̅̑̿̏̈́̀̊̆̈̔̃̽̀̎̐́̎̾͐̀̌̒̑́̇̑̊͑́̓̓̔̆͐́̅̓̔̃̅̂̐͗́̎͌́̊͌͒͒̓́̀͒̍̽̂́̀̉̀̑̉̑̓́͗̓́̍̏̉͆̑͂̔̅̀͊̈́̀͑͛́̿͆͑̀͐̃̋̐̋̈́̉͊̿̌̾͗͛̉́̓̓̏̈́͂̋͌͆̓̑͗͗̍̇̕̚̚͘̕͘̚̚̕͘͘͜͜͜͜͜͜͜͜͠͠͠͝͝͠͝͠͝͠͝ͅͅͅͅͅͅ.̸̷̸̴̸̸̶̶̵̵̸̵̴̡̡̡̡̧̢̢̧̧̧̧̡̢̡̛̛̛̛̬͇̜̘̗̗̲̟̗̤̤̜̹͎̣̹̺͉̯̼̭̟̮̖͕̻̰̬̼̮̮̬̪̥̤̘̣̺̥̪̠̥̳̰͇̫͔̜̫͚͖͔̩̙̪͖̥͍̗͍͉͙̣͔̠̭̞̩̱̠̻̹͎͔̯̻̘͖̦̘͕͉͈͈̞̖̬͔͈̗͓͖͚̤̬̤̘̠̱͆̍̍͆͗͋̇͗̓͐̉͋̈́̀̍̈̇̀̀̎͋̾̇̎͐̌̌̿̽̾̃̑͆̎̾̾̈́̆̐̂̅́̓̔̇̔̑̔͑̓̍͊͌͋̔̐̑͌̓̒̎̍̃͐̀͊̿̓͋̌͐̋̂̽̿̒̋̎́͒̋͘͘͘̕̕͘͝͠͝͝ͅͅa̲̬̪͙̖̬̖ͭͫͦ̀̄̆̍ͦͨͦ͗̅͋ͦͤͯͫ̔̚l̫̹̺̭̳͙̠̦͍̫̝͓͙̟̺͗̊̅ͬ̉͒̏͆͗͒̋ͤ̆̆ͥ𒈙.̴̢̟̩̗͊.̴̹͎̦̘͇͎̩̮̻̾͛̐ͅ𰻞.̷̧̫͙̤̗͇̔̂̀̄͗̍̈͋̈́̕.̷̨̛͈̤͈̲̥̱̹̲͖͗͛͆̓͊̅̈̕͠.̷̻̺͔͍̭͋̾̐̔͑̔̌̂͛͆̽͘͜͠͝͠.̷̧̨͉̝̳̲̫̙̻͎̬͚̒̀̄͒.̶̨͙̩̦̪͋̄͆͌̈́́͐̈̈́̕ͅ.̸̡̠̙̪͔͍̬̘̖̗̙̞̬͇̐͋͊͐̋̚ͅ.̷̢̮̮̖̹̟̖̩̗͙̝̺́̑̈̉͘͘͠ͅ.̴̨̡̧̤̳͖̰̼̺̮͉͖̲̫̳̜̹̄.̵̢̤̦̞͙̝̬͍̞̤͇̽̾̈́̔̋̋̓̌̋̐̓̅͜͝.̷͙͊.̵̠̜̞̭̘͉͓̞̤͍̝̈́̋̃́̈́͐̃̉͆̚͜.̴͉͈͓͈͉͎̺͍͕̥̦̙͙͕̈́̏̿́̏̔.̶͕̟̤͔͑̉̽̓̇̐́̃̿͜.̶̧̨̨̱̪̞̞̯̹̤̘̭̠͓̀̓̐̓́͑͂̉.̴̛̙̮͚̊͗̏̈́͗̅͆̑̂̌̐̃̊̂̓.̴̙͎̔͑̿͗̃̒́̏̏͑͘̕á́́́́́́́́́́́́́́́́́́́́́́́́́́́́́" --creator of 🎮𝕌𝕟ι𝕔𝗼d̢̪̲̬̳̩̟̍ĕ̸͓̼͙͈͐🚀

WORK IN PROGRESS

Unicode is a successful standard that aims to organize symbols and characters (letters, digits, graphical symbols, emoticons, ...) of all world's writing systems and to define several ways of encoding them as digital data, i.e. it's a big project that wants to unify digitization and encoding of any possible text in computers. The effort dates back to 1980s and was started to do away with the mess of many existing incompatible text encoding systems -- in this it succeeded, Unicode is nowadays everywhere and it's the standard way of encoding text wherever you look, probably owing a lot to its backwards compatibility with plain ASCII encoding which was the most popular encoding of English back in the day (i.e. any old ASCII text is still a valid Unicode text, provided we use UTF-8 encoding). The standard is made by the Unicode Consortium whose members are basically all the big companies.

In Unicode every character is unique like a unicorn.

It is important to distinguish between Unicode codepoints (the abstract character codes) and Unicode encodings, they are two different things. For example the Unicode codepoint for character A is 65 (same as in ASCII), or (written the Unicode way) U+0041 (41 is hexadecimal 65), but this value of 65 may then be represented in several different ways in the computer file, depending on the Unicode encoding we use (in UTF-8 it will be a single byte while in UTF-16 it will be two bytes). Currently Unicode defines these encodings:

  • UTF-8: Most widely used, backwards compatible with 7-bit ASCII, probably most suckless (you can literally ignore it for ASCII text and it won't inflate plain ASCII text). Character codes have variable width (they obviously have to), i.e. the basic characters take 1 byte but more complex ones may take up to 4 bytes (this may complicate or slow down e.g. counting string length). Generally codepoints are encoded like this (notice that not all values are valid, which may help detect non-UTF-8 text or corrupted data):
    • first 128 codepoints: 0xxxxxxx (same as ASCII)
    • next 1920 codepoints: 110xxxxx10xxxxxx
    • next 61440 codepoints: 1110xxxx10xxxxxx10xxxxxx
    • the rest: 11110xxx10xxxxxx10xxxxxx10xxxxxx
  • UTF-16: Quite shitty encoding, uses either 16 or 32 bits to encode each character, i.e. it is variable length like UTF-8 but also wastes space like UTF-32. The encoding is also a bit messy. Probably avoid.
  • UTF-32: Uses literally 32 bits to encode the exact codepoint with leading bits being 0. Of course this wastes space but may be useful sometimes, for example in quickly finding Nth character or counting string length. Sucks for storage but may be useful for quick processing of text.

More detail: Unicode codepoints go from U+0000 to U+10FFFF (1114111 in decimal), i.e. there is a place for over a million characters (only 1112064 are actually valid characters, a few are used for other purposes). These codes are subdivided into 17 planes by 2^16 (65536) characters, i.e. U+0000 to U+FFFF are plane 0, U+10000 to U+1FFFF are plane 1 etc. Planes are further subdivided to blocks that group related characters. There are even so called "private areas" (perverts BTFO), for example U+E000 to U+F8FF, which are left for third party use (for example you may use them to add custom emoticons in your game's chat).

The Unicode project is indeed highly ambitious, it's extremely difficult to do what they aim to do because many hard to answer questions come up, such as what even IS a character (Do we include every possible emoticon? Icons and pictograms used on road signs? Fictional alien language characters from sci-fi movies? ...), which characters to distinguish (Are same looking characters in different scripts the same character or a different one? Are the same characters in Chinese and Japanese different if they have different meaning in each language? ...), how to organize and assign the codes (How much space to allocate? What meaning will the code have? ...) AND there are many crazy writing systems all over the world (Some write right to left, some top to bottom, some compose characters of multiple other characters etcetc.). And, of course, writing systems evolve and change constantly, new ones are being discovered by archaeologists, new ones are invented by the Internet and so on and so forth. And what if we make a mistake? Do we correct it and break old documents or leave it in for backwards compatibility?

Is Unicode crap and bloat? Yes, it inevitably has to be, there's a lot of obscurity and crap in Unicode and many systems infamously can't handle malicious Unicode text and will even crash. However it can also be avoided well, it must be said it seems to be relatively well made for what it's trying to do -- for LRS it's important that we can just still keep using ASCII and we're good, i.e. we aren't forced to use the bloated part of Unicode, and if we get Unicode text, we can easily filter out non-ASCII characters. Full Unicode compliance will be bloated and shouldn't be practiced, but it's possible to partially comply with only minimum added complexity. To a degree Unicode also fucked up many texts because soyboys and bloat fans now try to use the "correct" characters for everything, so they will for example use the correct "multiplication sign" instead of just x or * which won't display well in ASCII readers, but again, this can at least be automatically corrected. Unicode is also controversial because SJWs push it too hard, claiming that ASCII is racist to people who can only write in retarded languages like Chinese -- we say it's better for the Chinese to learn English than to fuck computers up. Unicode also allowed noobs to make what they call "ASCII_art" without having any actual skill at it.

TODO: Unicode funny characters?