Update

2025-03-17 16:42:36 +01:00 · 2025-03-17 16:42:36 +01:00 · f69e3a3e4b
commit f69e3a3e4b
parent 6f0a813940
16 changed files with 2006 additions and 1999 deletions
--- a/unicode.md
+++ b/unicode.md
@ -18,11 +18,11 @@ In Unicode every character is unique like a unicorn. It has all the diverse char

 **More detail**: Unicode codepoints go from U+0000 to U+10FFFF (1114111 in decimal), i.e. there is a place for over a million characters (only 1112064 are actually valid characters, a few are used for other purposes). These codes are subdivided into **17 planes** by 2^16 (65536) characters, i.e. U+0000 to U+FFFF are plane 0, U+10000 to U+1FFFF are plane 1 etc. Planes are further subdivided to blocks that group together related characters. There are even so called "private areas" (perverts BTFO), for example U+E000 to U+F8FF, which are left for third party use (for example you may use them to add custom emoji in your game's chat). As mentioned, the first 128 codepoints are equivalent to [ASCII](ascii.md); furthermore the first 256 codepoints are equivalent to ISO 8859-1. This is for high backwards compatibility.

-The Unicode [project](project.md) is indeed highly ambitious, it's extremely difficult to do what they aim to do because, naturally, many hard to answer questions come up, such as what even IS a character (Do we include every possible emoji? Icons and pictograms used on road signs? Their upside down versions? Fictional alien language characters from sci-fi movies? ...), which characters to distinguish (Are same looking characters in different scripts the same character or a different one? Are the same characters in Chinese and Japanese different if they have different meaning in each language? ...), how to organize and assign the codes (How much space to allocate? What meaning will the code have? ...), how to deal with things such as accents, AND there are many crazy writing systems all over the world (Some write right to left, some top to bottom, some may utilize color, some compose characters by combining together multiple other characters etcetc.). And, of course, writing systems evolve and change constantly, new ones are being discovered by archaeologists, new ones are invented by the [Internet](internet.md) and so on and so [forth](forth.md). And what if we make a mistake? Do we correct it and break old documents or leave it in for backwards compatibility?
+Unlike chad ASCII, the Unicode [project](project.md) reaches biblical proportions and is indeed highly ambitious, it's exceptionally difficult and challenging to do what they aim to do because, naturally, many hard to answer questions come up, such as what even IS a character (Do we include every possible emoji? Icons and pictograms used on road signs? Their upside down versions? Fictional alien language characters from sci-fi movies? ...), which characters to distinguish (Are same looking characters in different scripts the same character or a different one? Are the same characters in Chinese and Japanese different if they have different meaning in each language? ...), how to organize and assign the codes (How much space to allocate? What meaning will the code have? ...), how to deal with things such as accents, AND there are many crazy writing systems all over the world (Some write right to left, some top to bottom, some may utilize color, some compose characters by combining together multiple other characters etcetc.). And, of course, writing systems evolve and change constantly, new ones are being discovered by archaeologists, new ones are invented by the [Internet](internet.md) and so on and so [forth](forth.md). And what if we make a mistake? Do we correct it and break old documents or leave it in for backwards compatibility?

 It's also crucial for Unicode to very clearly state its goals and philosophies so that all the issues and questions that come up may be answered and decided in accordance with them. For example part of the Unicode philosophy is to treat the symbols as abstract entities defined by their usage and meaning rather than their exact graphical representation (this is left to specific typesetting/rendering systems, [fonts](font.md) etc.).

-**Is Unicode [crap](shit.md) and [bloat](bloat.md)?** Yes, it inevitably has to be, there's a lot of obscurity and crap in Unicode and many systems infamously can't handle malicious (or even legit) Unicode text and will possibly even crash (see e.g. the infamous *black dot of death*). A lot of that mess previously caused by different encodings now just poured over to Unicode itself: for example there are sometimes multiple versions of the exact same character (e.g. those with accents -- one versions is a composed plain character plus accent character, the other one a single "precomposed" character) and so it's possible to encode exactly the same string in several ways and a non-trivial Unicode [normalization](normalization.md) is required to fix this. Unicode can be raped and abused in spectacular ways, for example using homoglyphs (characters that graphically look like other characters but are in fact different) one may create text that won't be detected by simple exact-comparison algorithms (for example you may be able to register a username that graphically looks like someone else's already registered username). There are also ways to combine characters in queer ways, e.g. make very tall text by creating chains of exponents or something (see the rabbithole around so called *composing characters*), which can just similarly nuke many naive programs. With Unicode things that were previously simple (such as counting string length or computing the size of rectangle into which a text will fit) now become hard (and slow) to do. Still it has to be said that **Unicode is designed relatively well** (of course minus the fascist political bias in its choice of characters) for what it's trying to do, it's just that the goal is ultimately an untameable beast, a bittersweet topic and a double edged sword -- for [LRS](lrs.md) it's important especially that we don't have to care much about it, we can just still keep using [ASCII](ascii.md) and we're good, i.e. we aren't forced to use the bloated part of Unicode and if we get Unicode text, we can quite easily filter out non-ASCII characters. Full Unicode compliance is always bloat and shouldn't be practiced, but it's possible to partially comply with only minimum added complexity. On one hand it [just werks](just_werks.md) -- back in the [90s](90s.md) we still had to trial/error different encodings to correctly display non-English texts, nowadays everything just displays correctly, but comfort comes with a price tag. Unicode has, to some degree, fucked up many texts because soyboys and bloat fans now try to use the "correct" characters for everything, so they will for example use the correct "multiplication sign" instead of just *x* or * which won't display well in ASCII readers, but again, this can at least be automatically corrected. Terminal emulators now include ugly Unicode bullcrap and have to drag along huge fonts and a constantly updating Unicode library. Unicode is also controversial because [SJWs](sjw.md) push it too hard, claiming that ASCII is [racist](racism.md) to people who can only write in retarded languages like [Chinese](chinese.md) -- we say it's better for the Chinese to learn [English](english.md) than to fuck computers up. Other controversies revolve around emojis and other political symbols, SJWs push crap like images of pregnant men and want to [censor](censorship.md) "offensive" symbols. Unicode also allowed noobs to make what they call "[ASCII_art](ascii_art.md)" without having any actual skill at it.
+**Is Unicode [crap](shit.md) and [bloat](bloat.md)?** Yes, it inevitably has to be, there's a lot of obscurity and crap in Unicode and many systems infamously can't handle malicious (or even legit) Unicode text and will possibly even crash (see e.g. the infamous *black dot of death*). A lot of that mess previously caused by different encodings now just poured over to Unicode itself: for example there are sometimes multiple versions of the exact same character (e.g. those with accents -- one versions is a composed plain character plus accent character, the other one a single "precomposed" character) and so it's possible to encode exactly the same string in several ways and a non-trivial Unicode [normalization](normalization.md) is required to fix this. Unicode can be raped and abused in spectacular ways, for example using homoglyphs (characters that graphically look like other characters but are in fact different) one may create text that won't be detected by simple exact-comparison algorithms (for example you may be able to register a username that graphically looks like someone else's already registered username). There are also ways to combine characters in queer ways, e.g. make very tall text by creating chains of exponents or something (see the rabbithole around so called *composing characters*), which can just similarly nuke many naive programs. With Unicode things that were previously simple (such as counting string length or computing the size of rectangle into which a text will fit) now become hard (and slow) to do. Still it has to be said that **Unicode is designed relatively well** (of course minus the fascist political bias in its choice of characters) for what it's trying to do, it's just that the goal is ultimately an untameable beast, a bittersweet topic and a double edged sword -- for [LRS](lrs.md) it's important especially that we don't have to care much about it, we can just still keep using [ASCII](ascii.md) and we're good, i.e. we aren't forced to use the bloated part of Unicode and if we get Unicode text, we can quite easily filter out non-ASCII characters. Full Unicode compliance is always bloat and shouldn't be practiced, but it's possible to partially comply with only minimum added complexity. On one hand it [just werks](just_werks.md) -- back in the [90s](90s.md) we still had to trial/error different encodings to correctly display non-English texts, nowadays everything just displays correctly, but comfort comes with a price tag. Unicode has, to some degree, fucked up many texts because [soyboys](soydev.md) and bloat fans now tryhard to use the "correct" characters for everything, so they will for example use the correct "multiplication sign" instead of just *x* or * which won't display well in ASCII readers, but again, this can at least be automatically corrected. Terminal emulators now include ugly Unicode bullcrap and have to drag along huge fonts and a constantly updating Unicode library. Unicode is also controversial because [SJWs](sjw.md) push it too hard, claiming that ASCII is [racist](racism.md) to people who can only write in retarded languages like [Chinese](chinese.md) -- we say it's better for the Chinese to learn [English](english.md) than to fuck computers up. Other controversies revolve around emojis and other political symbols, SJWs push crap like images of pregnant men and want to [censor](censorship.md) "offensive" symbols. Unicode also allowed noobs to make what they call "[ASCII_art](ascii_art.md)" without having any actual skill at it.

 Here are some **examples** of Unicode characters: