less_retarded_wiki/unicode.md
2024-10-08 20:06:43 +02:00

16 KiB
Raw Permalink Blame History

Unicode

".̶̧̦̼̱͙̥̲̝̥̭͍̪̈́̍͌̋͑͒̒̅͂͒̀̊̓̕͠.̴̛̰̯͚͚̳͍̞̯̯͊̒͂̌̃̎̒̐̅͗́̂͠͝.̸̜̀͊̀̒̾̐̔͑̚̕͠a̲̬̪͙̖̬̖ͭͫͦ̀̄̆̍ͦͨͦ͗̅͋ͦͤͯͫ̔̚l̫̹̺̭̳͙̠̦͍̫̝͓͙̟̺͗̊̅ͬ̉͒̏͆͗͒̋ͤ̆̆ͥg̥̳̗͕̫ͪ͛̓̂ͫͮ̔͌̃̈͒̔̏ͭ͋͋ ⃝꙰⃝꙰⃝꙰⃝꙰⃝꙰⃝꙰⃝꙰⃝꙰⃝꙰⃝á́́́́́́́́́́́́́́́́́́́́́́́́́́́́́.̶̢̙̺̖͔̱͎̳̭̖̗̲̻̪̻͑̌͒̊̃̈̾̿̓̅̐́̀̋̔̏.̴̺͖͎͚̠̱̤͂̈́͜.̵̡̡͖̪̱̼͕̘̣̠̮̫͓̯͖̜̚͝͝͝.̷̧̨̥̦̥̱͉̼̗̰̪͍̱͎̑̾Z̳͍̩̻̙̪̲̞̭̦̙ͯ͛̂ͥͣͪͅͅͅl̷̢̛̩̰̹͔̣͗̅̇̍̏͑͐̇̋̑͜ͅǫ̶̢̫̟͙̖̩̽̀͆̽͌͘l̶̩̞̖̹͈͒͊̔̑̆<CCB9≯͎̺̳̄͂̊̒<CCBA>̶̸̵̶̴̸̸̴̶̸̷̶̴̴̡̢̡̢̡̢̧̧̡̧̡̨̡̨̢̧̧̡̢̛̛̛̛̼̻̣̗͔͉̩̪̞͎̖̙͍͚͍̼̰͖̺̤̗̘͕̳̻̖̳̻̗̯̭̙̳̲͕̮͇͕̼͉̞̣̟̖̘̟͕̗̼̙̻͇̝̪̦͚̤̦̣̗̤̪̟̠͖͓̟̬̲͙͇͉̘͙͙͚̜̜̮͈̞͓̰̫͍̙͙͙̱͓͖̠͇̪̭̮̤̺̗̙̘̫̤̥̳͇͔̣̩͕͍̦͈̬̯̗̘͔̻̗̘͔̪̹̬̲͇͕̻͎̣̩̻̖͉̱̝̼̞̪̠̮̤͓̥͊̔̈́̀̋̄̄̇́̋̎͛̓́̔̇̂̒̅͊̎̉͗̓̀͑̋͒͑̍̏̅̋͆̑̈̾͗̽͑̏̉̀͌͋̉̒̋̑̊̂̈́̈́͑̀͂́̈́̆̄̃͆͆̈́̊̿̌̋̍̈̒͂̀̈́͌̽͌̈́͋̈́̃̅͂͆́̍͑̓̎͋̅͂̽̈́̈́͗̆̑̔̎̈́́̆͂̉̀̒͌̿̽͊̍̃̕̚͘̚̕̚͘̕̕̕͘͜͜͝͠͝͝͠͝͝͝͠͝͠͠͠ͅͅͅͅ.̸̷̷̷̴̸̶̵̴̶̵̸̴̴̷̸̷̵̷̵̴̴̷̧̨̢̨̡̨̧̡̨̧̧̨̡̧̢̧̢̧̨̛̛̛̛̛̛̛̤͈̯̤͙̻̫̼̱̦̮̙̤̝̖̗͉̘̫̟̗̹͉͇͖̘͙̻̫̫̫̰̝̭̤͈͓͔̱̭͙͔͔̼̖̬̰̳̗͖͖̯̮͔̝̞̬̳͇͈̥̘͙͇̺̪̞̞̙͈̮͔̞̭͎̩͎̦̞̝͎̗͚͈̖̣͖̹̜̞̤̺̱̱̰͔̼̭̮̰̖͔͔͈̥͎̜̭̪̺̲͔̲̻̰̳̲̖̤̳̙̥̼̩͈̥̗̟͙̥̗̳͍̥̝̫͚̘̱̱̹̺̣̝̳̣͇̹̫̝̫̟̯̺͇̞̳͖̫͔̲̗͔̟̩̦̳͎̳͖̎̓͂̀̀́̌͗̐̅̈́̓̿̓̌́̓́͋͊͛̄͊̂̒͌̀͗̔̀̑̔͒̐̀͌̋̍͗͛̂̆̈́͛͋͆̐̌̓̄͊̑̑̅̑̿̏̈́̀̊̆̈̔̃̽̀̎̐́̎̾͐̀̌̒̑́̇̑̊͑́̓̓̔̆͐́̅̓̔̃̅̂̐͗́̎͌́̊͌͒͒̓́̀͒̍̽̂́̀̉̀̑̉̑̓́͗̓́̍̏̉͆̑͂̔̅̀͊̈́̀͑͛́̿͆͑̀͐̃̋̐̋̈́̉͊̿̌̾͗͛̉́̓̓̏̈́͂̋͌͆̓̑͗͗̍̇̕̚̚͘̕͘̚̚̕͘͘͜͜͜͜͜͜͜͜͠͠͠͝͝͠͝͠͝͠͝ͅͅͅͅͅͅ.̸̷̸̴̸̸̶̶̵̵̸̵̴̡̡̡̡̧̢̢̧̧̧̧̡̢̡̛̛̛̛̬͇̜̘̗̗̲̟̗̤̤̜̹͎̣̹̺͉̯̼̭̟̮̖͕̻̰̬̼̮̮̬̪̥̤̘̣̺̥̪̠̥̳̰͇̫͔̜̫͚͖͔̩̙̪͖̥͍̗͍͉͙̣͔̠̭̞̩̱̠̻̹͎͔̯̻̘͖̦̘͕͉͈͈̞̖̬͔͈̗͓͖͚̤̬̤̘̠̱͆̍̍͆͗͋̇͗̓͐̉͋̈́̀̍̈̇̀̀̎͋̾̇̎͐̌̌̿̽̾̃̑͆̎̾̾̈́̆̐̂̅́̓̔̇̔̑̔͑̓̍͊͌͋̔̐̑͌̓̒̎̍̃͐̀͊̿̓͋̌͐̋̂̽̿̒̋̎́͒̋͘͘͘̕̕͘͝͠͝͝ͅͅa̲̬̪͙̖̬̖ͭͫͦ̀̄̆̍ͦͨͦ͗̅͋ͦͤͯͫ̔̚l̫̹̺̭̳͙̠̦͍̫̝͓͙̟̺͗̊̅ͬ̉͒̏͆͗͒̋ͤ̆̆ͥ𒈙.̴̢̟̩̗͊.̴̹͎̦̘͇͎̩̮̻̾͛̐ͅ𰻞.̷̧̫͙̤̗͇̔̂̀̄͗̍̈͋̈́̕.̷̨̛͈̤͈̲̥̱̹̲͖͗͛͆̓͊̅̈̕͠.̷̻̺͔͍̭͋̾̐̔͑̔̌̂͛͆̽͘͜͠͝͠.̷̧̨͉̝̳̲̫̙̻͎̬͚̒̀̄͒.̶̨͙̩̦̪͋̄͆͌̈́́͐̈̈́̕ͅ.̸̡̠̙̪͔͍̬̘̖̗̙̞̬͇̐͋͊͐̋̚ͅ.̷̢̮̮̖̹̟̖̩̗͙̝̺́̑̈̉͘͘͠ͅ.̴̨̡̧̤̳͖̰̼̺̮͉͖̲̫̳̜̹̄.̵̢̤̦̞͙̝̬͍̞̤͇̽̾̈́̔̋̋̓̌̋̐̓̅͜͝.̷͙͊.̵̠̜̞̭̘͉͓̞̤͍̝̈́̋̃́̈́͐̃̉͆̚͜.̴͉͈͓͈͉͎̺͍͕̥̦̙͙͕̈́̏̿́̏̔.̶͕̟̤͔͑̉̽̓̇̐́̃̿͜.̶̧̨̨̱̪̞̞̯̹̤̘̭̠͓̀̓̐̓́͑͂̉.̴̛̙̮͚̊͗̏̈́͗̅͆̑̂̌̐̃̊̂̓.̴̙͎̔͑̿͗̃̒́̏̏͑͘̕á́́́́́́́́́́́́́́́́́́́́́́́́́́́́́" --creator of 🎮𝕌𝕟ι𝕔𝗼d̢̪̲̬̳̩̟̍ĕ̸͓̼͙͈͐🚀

Unicode is a successful, constantly evolving standard aiming to organize symbols and characters (letters, digits, graphical symbols, emoji, ...) of all the world's writing systems and to define and standardize ways of encoding them as digital data, i.e. it's a big project promising to unify the encoding of any possible text in computers. As of writing this the lastest version is 16.0 from 2024, defining over 150000 characters. The effort dates back to 1980s and was started to do away with the mess and headaches induced by a plethora of existing incompatible text encoding systems -- in this it succeeded, Unicode is nowadays everywhere and it's the standard way of encoding text wherever you look, probably owing a lot to its backwards compatibility with plain ASCII encoding which was the most popular encoding of English back in the day (i.e. any old ASCII text is still a valid Unicode text, provided we use UTF-8 encoding). The standard is made by the Unicode Consortium whose members are basically all the big companies.

In Unicode every character is unique like a unicorn. It has all the diverse characters such as the penis (𓂸), ejaculating penis (𓂺), swastika (卐), hammer and sickle (☭), white power sign (👌), middle finger (🖕), pile of shit (💩) etc. Here is a lulzy part of Unicode: it's possible to combine some characters together with so called combining characters, so purely IN THEORY one can for example combine the prohibition symbol (U+20Ex) with LGBT propaganda characters and other fascist symbols to create interesting emojis likes so: 🏳️‍🌈👨🏿👩⃠. Of course this created some controversies :D { It now seems like some systems refuse to render combinations of characters that might go against current official world politics. See also: 1984. ~drummyfish }

It is important to distinguish between Unicode codepoints (the abstract character codes) and Unicode encodings, they are two different things. For example the Unicode codepoint for character A is 65 (same as in ASCII), or (written the Unicode way) U+0041 (41 is hexadecimal 65), but this value of 65 may then be represented in several different ways in the computer file, depending on the Unicode encoding we use (in UTF-8 it will be a single byte while in UTF-16 it will be two bytes). Currently Unicode defines these encodings (additional unofficial encodings exist as well):

  • UTF-8: Most widely used, backwards compatible with 7-bit ASCII, probably most suckless (you can literally ignore it for ASCII text and it won't inflate plain ASCII text). Character codes have variable width (they obviously have to), i.e. the basic characters take 1 byte but more complex ones may take up to 4 bytes (this may complicate or slow down e.g. counting string length). Generally codepoints are encoded like this (notice that not all values are valid, which may help detect non-UTF-8 text or corrupted data):
    • first 128 codepoints: 0xxxxxxx (same as ASCII)
    • next 1920 codepoints: 110xxxxx10xxxxxx
    • next 61440 codepoints: 1110xxxx10xxxxxx10xxxxxx
    • the rest: 11110xxx10xxxxxx10xxxxxx10xxxxxx
  • UTF-16: Quite shitty encoding, uses either 16 or 32 bits to encode each character, i.e. it is variable length like UTF-8 but also wastes space like UTF-32. The encoding is also a bit messy. Probably avoid.
  • UTF-32: Uses literally 32 bits to encode the exact codepoint with leading bits being 0. Of course this wastes space but may be useful sometimes, for example in quickly finding Nth character or counting string length. Sucks for storage but may be useful for quick processing of text.

More detail: Unicode codepoints go from U+0000 to U+10FFFF (1114111 in decimal), i.e. there is a place for over a million characters (only 1112064 are actually valid characters, a few are used for other purposes). These codes are subdivided into 17 planes by 2^16 (65536) characters, i.e. U+0000 to U+FFFF are plane 0, U+10000 to U+1FFFF are plane 1 etc. Planes are further subdivided to blocks that group together related characters. There are even so called "private areas" (perverts BTFO), for example U+E000 to U+F8FF, which are left for third party use (for example you may use them to add custom emoji in your game's chat). As mentioned, the first 128 codepoints are equivalent to ASCII; furthermore the first 256 codepoints are equivalent to ISO 8859-1. This is for high backwards compatibility.

The Unicode project is indeed highly ambitious, it's extremely difficult to do what they aim to do because, naturally, many hard to answer questions come up, such as what even IS a character (Do we include every possible emoji? Icons and pictograms used on road signs? Their upside down versions? Fictional alien language characters from sci-fi movies? ...), which characters to distinguish (Are same looking characters in different scripts the same character or a different one? Are the same characters in Chinese and Japanese different if they have different meaning in each language? ...), how to organize and assign the codes (How much space to allocate? What meaning will the code have? ...), how to deal with things such as accents, AND there are many crazy writing systems all over the world (Some write right to left, some top to bottom, some may utilize color, some compose characters by combining together multiple other characters etcetc.). And, of course, writing systems evolve and change constantly, new ones are being discovered by archaeologists, new ones are invented by the Internet and so on and so forth. And what if we make a mistake? Do we correct it and break old documents or leave it in for backwards compatibility?

It's also crucial for Unicode to very clearly state its goals and philosophies so that all the issues and questions that come up may be answered and decided in accordance with them. For example part of the Unicode philosophy is to treat the symbols as abstract entities defined by their usage and meaning rather than their exact graphical representation (this is left to specific typesetting/rendering systems, fonts etc.).

Is Unicode crap and bloat? Yes, it inevitably has to be, there's a lot of obscurity and crap in Unicode and many systems infamously can't handle malicious (or even legit) Unicode text and will possibly even crash (see e.g. the infamous black dot of death). A lot of that mess previously caused by different encodings now just poured over to Unicode itself: for example there are sometimes multiple versions of the exact same character (e.g. those with accents -- one versions is a composed plain character plus accent character, the other one a single "precomposed" character) and so it's possible to encode exactly the same string in several ways and a non-trivial Unicode normalization is required to fix this. Unicode can be raped and abused in spectacular ways, for example using homoglyphs (characters that graphically look like other characters but are in fact different) one may create text that won't be detected by simple exact-comparison algorithms (for example you may be able to register a username that graphically looks like someone else's already registered username). There are also ways to combine characters in queer ways, e.g. make very tall text by creating chains of exponents or something (see the rabbithole around so called composing characters), which can just similarly nuke many naive programs. With Unicode things that were previously simple (such as counting string length or computing the size of rectangle into which a text will fit) now become hard (and slow) to do. Still it has to be said that Unicode is designed relatively well (of course minus the fascist political bias in its choice of characters) for what it's trying to do, it's just that the goal is ultimately an untameable beast, a bittersweet topic and a double edged sword -- for LRS it's important especially that we don't have to care much about it, we can just still keep using ASCII and we're good, i.e. we aren't forced to use the bloated part of Unicode and if we get Unicode text, we can quite easily filter out non-ASCII characters. Full Unicode compliance is always bloat and shouldn't be practiced, but it's possible to partially comply with only minimum added complexity. On one hand it just werks -- back in the 90s we still had to trial/error different encodings to correctly display non-English texts, nowadays everything just displays correctly, but comfort comes with a price tag. Unicode has, to some degree, fucked up many texts because soyboys and bloat fans now try to use the "correct" characters for everything, so they will for example use the correct "multiplication sign" instead of just x or * which won't display well in ASCII readers, but again, this can at least be automatically corrected. Terminal emulators now include ugly Unicode bullcrap and have to drag along huge fonts and a constantly updating Unicode library. Unicode is also controversial because SJWs push it too hard, claiming that ASCII is racist to people who can only write in retarded languages like Chinese -- we say it's better for the Chinese to learn English than to fuck computers up. Other controversies revolve around emojis and other political symbols, SJWs push crap like images of pregnant men and want to censor "offensive" symbols. Unicode also allowed noobs to make what they call "ASCII_art" without having any actual skill at it.

Here are some examples of Unicode characters:

regular ASCII:
  ! " # $ % & ' ( ) * + - . / 0 1 2 3 4 5 6 7
  8 9 : ; < = > ? @ A B C D E F G H I J K L M
  N O P Q R S T U V W X Y Z [ \ ] ^ _ @ ` a b
  c d e f g h i j k l m n o p q r s t u v w x
  y z { | } ~

non-English:
  á Č Ç À à Ã Ë Ô ť Ö ř í ů Ľ É ä ü
  Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ
  α β γ δ ε ζ η θ ι κ λ μ ν ξ ο π ρ
  あ い う え お か き く け こ が ぎ ぐ
  ア イ ウ エ オ k カ キ ク ケ コ ガ ギ 
  А Б В Г Ґ Д Е Ё Є Ж З И Й К Л М Н 
  漢 字 阪 熊 奈 岡 鹿 梨 阜 埼 茨 栃 媛

emoji:
  😀 😁 😂 😃 😄 😅 ☹ ☻

right to left text:
 ظ ض ذ خ ث ت ش ر ق ص ف ع س ن م ل و ه د ج ب   

math:
  × ± ∓ ÷  √ ¬ ⊕ ⊖ ⊗ ⊙ ⋅ ∥ ∧ 
  ∏ ∑ ∫ ⋀  ≦ ≧ ≤ ≥ = ≟ ≅ ≠ ⇒ ⇐
  ⬄ ≺ ≻ ⋞ ⋟ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ∈ ∉ ∀ ∃ ∄ ∊ ⊢
  ⊣ ∞ % ∅ ∆ ∡ ⋮ ⋯ ⋱ ⋰ ◻ 𝕬 𝕭 𝕮 𝕯 𝕰 𝕱
  ⟪ ⟫ ⧼ ⧽ ⁰ⁱ²³⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉

graphics:
  ┌─┬─┐       ╭─╮ ┏┓
  │ ┊ │      │ │ ┗┛
  ├─┼─┤   ╲  ╰─╯
  └┈┴┈┘
  ░▒▓█▅▁ ▐▌▙▛▚▞▜▟▗▄▖▝▀▘

other:
  ⭠ ⭡ ⭢ ⭣ ⭤ ⭥ ⭦ ⭧ ⭨ ⭩ ☀ ☆ ★ ☏ ☠
  ☮ Ⓐ 卐 卍 ☢ ☣ ☪ ☯ ☭ ☰ ☾ ♀ ♁ ♂ ⚥
  ♡ ♪ ♫ 𝄞 ♿ ⚅ ⚠ ⚬ ⚽ ⛤ € £ ✂ ✈ ✓
  © ™ ® ⁋ ⏚ ⏎ ℃ ♔ ♕ ♖ ♗ ♘ ♙ ♚ ♛
  ♜ ♝ ♞ ♟︎ ⒛ ⓯ ⬡ ⛱ ⛺ ⛏ ๛ ✞ 🌈️
  ☒ ⌛  ⌚ ⚡ ௵ ○ ◎ ● ◑ ◐ ◤ ▣ ▤ ▥
  o u ɯ l ʞ ɾ ᴉ ɥ ɓ ɟ ǝ p ɔ q ɐ
  𒐫ﷺ

similar (homoglyphs):
  ΑАA𝙰𝐀ⒶДд⩜𝒜𝗔
  ΒВB𝙱ℬ𝐁𝕭вβß𝗕
  ϹС𝙲𝐂𝕮𝒞𝗖

How to convert UTF-8 to ASCII? Easiest way is to just filter out all bytes with the highest bit set, or, in other words, throw out all bytes with value higher than 127 (or maybe replace such bytes with question marks or something). This will possibly deform the text though, so it may be a last resort solution. Better (but of course still imperfect) results may be achieved by replacing Unicode characters by their ASCII approximations (e.g. the multiplaction symbol × by the letter x and so on), but this is non-trivial, a conversion table is needed -- thankfully there exist programs for doing this, e.g.: cat unicodefile.txt | iconv -f utf-8 -t ascii//TRANSLIT.