Update
This commit is contained in:
parent
3d238884b9
commit
c03351e922
15 changed files with 1885 additions and 1825 deletions
61
unicode.md
61
unicode.md
|
@ -4,7 +4,7 @@
|
|||
|
||||
WORK IN PROGRESS
|
||||
|
||||
Unicode is a successful standard that aims to organize symbols and characters (letters, digits, graphical symbols, [emoticons](emoticon.md), ...) of all world's writing systems and to define several ways of encoding them as [digital](digital.md) [data](data.md), i.e. it's a big [project](project.md) promising to unify digitization and encoding of any possible text in [computers](computer.md). The effort dates back to 1980s and was started to do away with the mess of many existing incompatible text encoding systems -- in this it succeeded, Unicode is nowadays everywhere and it's the standard way of encoding text wherever you look, probably owing a lot to its backwards compatibility with plain [ASCII](ascii.md) encoding which was the most popular encoding of English back in the day (i.e. any old ASCII text is still a valid Unicode text, provided we use UTF-8 encoding). The standard is made by the Unicode Consortium whose members are basically all the big companies.
|
||||
Unicode is a successful standard that aims to organize symbols and characters (letters, digits, graphical symbols, [emoji](emoji.md), ...) of all world's writing systems and to define several ways of encoding them as [digital](digital.md) [data](data.md), i.e. it's a big [project](project.md) promising to unify digitization and encoding of any possible text in [computers](computer.md). Currently there are over 100000 characters defined by the standard. The effort dates back to 1980s and was started to do away with the mess of many existing incompatible text encoding systems -- in this it succeeded, Unicode is nowadays everywhere and it's the standard way of encoding text wherever you look, probably owing a lot to its backwards compatibility with plain [ASCII](ascii.md) encoding which was the most popular encoding of English back in the day (i.e. any old ASCII text is still a valid Unicode text, provided we use UTF-8 encoding). The standard is made by the Unicode Consortium whose members are basically all the big companies.
|
||||
|
||||
In Unicode every character is unique like a unicorn.
|
||||
|
||||
|
@ -18,12 +18,65 @@ In Unicode every character is unique like a unicorn.
|
|||
- **UTF-16**: Quite [shitty](shit.md) encoding, uses either 16 or 32 bits to encode each character, i.e. it is variable length like UTF-8 but also wastes space like UTF-32. The encoding is also a bit messy. Probably avoid.
|
||||
- **UTF-32**: Uses literally 32 bits to encode the exact codepoint with leading bits being 0. Of course this wastes space but may be useful sometimes, for example in quickly finding Nth character or counting string length. Sucks for storage but may be useful for quick processing of text.
|
||||
|
||||
**More detail**: Unicode codepoints go from U+0000 to U+10FFFF (1114111 in decimal), i.e. there is a place for over a million characters (only 1112064 are actually valid characters, a few are used for other purposes). These codes are subdivided into **17 planes** by 2^16 (65536) characters, i.e. U+0000 to U+FFFF are plane 0, U+10000 to U+1FFFF are plane 1 etc. Planes are further subdivided to blocks that group together related characters. There are even so called "private areas" (perverts BTFO), for example U+E000 to U+F8FF, which are left for third party use (for example you may use them to add custom emoticons in your game's chat).
|
||||
**More detail**: Unicode codepoints go from U+0000 to U+10FFFF (1114111 in decimal), i.e. there is a place for over a million characters (only 1112064 are actually valid characters, a few are used for other purposes). These codes are subdivided into **17 planes** by 2^16 (65536) characters, i.e. U+0000 to U+FFFF are plane 0, U+10000 to U+1FFFF are plane 1 etc. Planes are further subdivided to blocks that group together related characters. There are even so called "private areas" (perverts BTFO), for example U+E000 to U+F8FF, which are left for third party use (for example you may use them to add custom emoji in your game's chat). As mentioned, the first 128 codepoints are equivalent to [ASCII](ascii.md); furthermore the first 256 codepoints are equivalent to ISO 8859-1. This is for high backwards compatibility.
|
||||
|
||||
The Unicode [project](project.md) is indeed highly ambitious, it's extremely difficult to do what they aim to do because, naturally, many hard to answer questions come up, such as what even IS a character (Do we include every possible emoticon? Icons and pictograms used on road signs? Their upside down versions? Fictional alien language characters from sci-fi movies? ...), which characters to distinguish (Are same looking characters in different scripts the same character or a different one? Are the same characters in Chinese and Japanese different if they have different meaning in each language? ...), how to organize and assign the codes (How much space to allocate? What meaning will the code have? ...), how to deal with things such as accents, AND there are many crazy writing systems all over the world (Some write right to left, some top to bottom, some may utilize color, some compose characters by combining together multiple other characters etcetc.). And, of course, writing systems evolve and change constantly, new ones are being discovered by archaeologists, new ones are invented by the [Internet](internet.md) and so on and so [forth](forth.md). And what if we make a mistake? Do we correct it and break old documents or leave it in for backwards compatibility?
|
||||
The Unicode [project](project.md) is indeed highly ambitious, it's extremely difficult to do what they aim to do because, naturally, many hard to answer questions come up, such as what even IS a character (Do we include every possible emoji? Icons and pictograms used on road signs? Their upside down versions? Fictional alien language characters from sci-fi movies? ...), which characters to distinguish (Are same looking characters in different scripts the same character or a different one? Are the same characters in Chinese and Japanese different if they have different meaning in each language? ...), how to organize and assign the codes (How much space to allocate? What meaning will the code have? ...), how to deal with things such as accents, AND there are many crazy writing systems all over the world (Some write right to left, some top to bottom, some may utilize color, some compose characters by combining together multiple other characters etcetc.). And, of course, writing systems evolve and change constantly, new ones are being discovered by archaeologists, new ones are invented by the [Internet](internet.md) and so on and so [forth](forth.md). And what if we make a mistake? Do we correct it and break old documents or leave it in for backwards compatibility?
|
||||
|
||||
It's also important that Unicode clearly states its goals and philosophy so that all the issues and questions that come up may be answered and decided in accordance with them. For example part of the Unicode philosophy is to treat the symbols as abstract entities defined by their usage and meaning rather than their exact graphical representation (this is left to specific typesetting/rendering systems, [fonts](font.md) etc.).
|
||||
|
||||
**Is Unicode [crap](shit.md) and [bloat](bloat.md)?** Yes, it inevitably has to be, there's a lot of obscurity and crap in Unicode and many systems infamously can't handle malicious (or even legit) Unicode text and will even crash. Unicode can be raped and abused in spectacular ways, for example using homoglyphs (characters that graphically look like other characters but are in fact different) one may create text that won't be detected by simple exact-comparison algorithms (for example you may be able to register a username that graphically looks like someone else's already registered username). There are also some kind of ways to combine characters weirdly, e.g. make very tall text by creating chains of exponents or something, which can just nuke many programs. Still it has to be said that **Unicode is designed relatively well** for what it's trying to do, it's kind of a bittersweet, double edged kind of beast -- for [LRS](lrs.md) it's important especially that we don't have to care much about it, we can just still keep using [ASCII](ascii.md) and we're good, i.e. we aren't forced to use the bloated part of Unicode and if we get Unicode text, we can quite easily filter out non-ASCII characters. Full Unicode compliance is always bloat and shouldn't be practiced, but it's possible to partially comply with only minimum added complexity. Nevertheless Unicode has, to some degree, fucked up many texts because soyboys and bloat fans now try to use the "correct" characters for everything, so they will for example use the correct "multiplication sign" instead of just *x* or * which won't display well in ASCII readers, but again, this can at least be automatically corrected. Unicode is also controversial because [SJWs](sjw.md) push it too hard, claiming that ASCII is [racist](racism.md) to people who can only write in retarded languages like [Chinese](chinese.md) -- we say it's better for the Chinese to learn [English](english.md) than to fuck computers up. Unicode also allowed noobs to make what they call "[ASCII_art](ascii_art.md)" without having any actual skill at it.
|
||||
|
||||
TODO: funny characters?
|
||||
Here are some **examples** of Unicode characters:
|
||||
|
||||
```
|
||||
regular ASCII:
|
||||
! " # $ % & ' ( ) * + - . / 0 1 2 3 4 5 6 7
|
||||
8 9 : ; < = > ? @ A B C D E F G H I J K L M
|
||||
N O P Q R S T U V W X Y Z [ \ ] ^ _ @ ` a b
|
||||
c d e f g h i j k l m n o p q r s t u v w x
|
||||
y z { | } ~
|
||||
|
||||
non-English:
|
||||
á Č Ç À à Ã Ë Ô ť Ö ř í ů Ľ É ä ü
|
||||
Α Β Γ Δ Ε Ζ Η Θ Ι Κ Λ Μ Ν Ξ Ο Π Ρ
|
||||
α β γ δ ε ζ η θ ι κ λ μ ν ξ ο π ρ
|
||||
あ い う え お か き く け こ が ぎ ぐ
|
||||
ア イ ウ エ オ k カ キ ク ケ コ ガ ギ
|
||||
А Б В Г Ґ Д Е Ё Є Ж З И Й К Л М Н
|
||||
漢 字 阪 熊 奈 岡 鹿 梨 阜 埼 茨 栃 媛
|
||||
|
||||
emoji:
|
||||
😀 😁 😂 😃 😄 😅 ☹ ☻
|
||||
|
||||
right to left text:
|
||||
ظ ض ذ خ ث ت ش ر ق ص ف ع س ن م ل و ه د ج ب
|
||||
|
||||
math:
|
||||
× ± ∓ ÷ ∪ √ ¬ ⊕ ⊖ ⊗ ⊙ ⋅ ∥ ∧ ∨ ∩ ∪
|
||||
∏ ∑ ∫ ⋀ ⋁ ⋂ ⋃ ≦ ≧ ≤ ≥ = ≟ ≅ ≠ ⇒ ⇐
|
||||
⬄ ≺ ≻ ⋞ ⋟ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ∈ ∉ ∀ ∃ ∄ ∊ ⊢
|
||||
⊣ ∞ % ∅ ∆ ∡ ⋮ ⋯ ⋱ ⋰ ◻ 𝕬 𝕭 𝕮 𝕯 𝕰 𝕱
|
||||
⟪ ⟫ ⧼ ⧽ ⁰ⁱ²³⁴⁵⁶⁷⁸⁹₀₁₂₃₄₅₆₇₈₉
|
||||
|
||||
graphics:
|
||||
┌─┬─┐ ╭─╮ ┏┓
|
||||
│ ┊ │ ╳ │ │ ┗┛
|
||||
├─┼─┤ ╱ ╲ ╰─╯
|
||||
└┈┴┈┘
|
||||
░▒▓█▅▁ ▐▌▙▛▚▞▜▟▗▄▖▝▀▘
|
||||
|
||||
other:
|
||||
⭠ ⭡ ⭢ ⭣ ⭤ ⭥ ⭦ ⭧ ⭨ ⭩ ☀ ☆ ★ ☏ ☠
|
||||
☮ Ⓐ 卐 卍 ☢ ☣ ☪ ☯ ☭ ☰ ☾ ♀ ♁ ♂ ⚥
|
||||
♡ ♪ ♫ 𝄞 ♿ ⚅ ⚠ ⚬ ⚽ ⛤ € £ ✂ ✈ ✓
|
||||
© ™ ® ⁋ ⏚ ⏎ ℃ ♔ ♕ ♖ ♗ ♘ ♙ ♚ ♛
|
||||
♜ ♝ ♞ ♟︎ ⒛ ⓯ ⬡ ⛱ ⛺ ⛏ ๛ ✞ 🌈️
|
||||
☒ ⌛ ⌚ ⚡ ௵ ○ ◎ ● ◑ ◐ ◤ ▣ ▤ ▥
|
||||
o u ɯ l ʞ ɾ ᴉ ɥ ɓ ɟ ǝ p ɔ q ɐ
|
||||
𒐫ﷺ
|
||||
|
||||
similar (homoglyphs):
|
||||
ΑАA𝙰𝐀ⒶДд⩜𝒜𝗔
|
||||
ΒВB𝙱ℬ𝐁𝕭вβß𝗕
|
||||
ϹСⅭC𝙲ℂ𝐂𝕮𝒞𝗖
|
||||
```
|
Loading…
Add table
Add a link
Reference in a new issue