This commit is contained in:
Miloslav Ciz 2024-04-03 17:16:51 +02:00
parent b9fd8d7532
commit f8e54987d6
9 changed files with 1846 additions and 1758 deletions

View file

@ -46,7 +46,7 @@ Let's write an extremely primitive Markov bot that will work on the level of ind
Keep in mind this example is really extremely simple, it only looks one letter back and makes some further simplifications, for example it only approximates the probabilities with kind of a [KISS](kiss.md) hack -- we won't record any numeric probability, we'll only hold a table of letters, each one having a "bucket" of letters that may possibly follow; during training we'll always throw a preceding letter's follower to a random place in the preceding letter's bucket, with the idea that once we finish training, statistically in any bucket there will remain more letters that are more likely to follow given letter, just because we simply threw more such letters in. Similarly when generating the output text we will choose a letter to follow the current one by looking into the table and pulling out a random follower from that letter's bucket, again hoping that letters that have greater presence in the bucket will be more likely to be randomly selected. This approach has issues, for example regarding the question of ideal bucket size, and it introduces statistical biases (maximum probability is limited by bucket size, order matters, later letters are kind of privileged), but it kind of works. Try to think of how we could make a better text generator -- for starters it might work on the level of words and could take into account a history of let's say three letters, i.e. it would record triplets of words and then list of words that likely follow, along with each one's probability that we would record as an actual number to make the probabilities accurate.
Anyway with all this said, below is a [C](c.md) code implementing the above described text generator. To use it just pipe some input ASCII text to it, however make it reasonably sized (a few thousand lines maybe, please don't feed it whole Britannica, the output won't be better), keep in mind the program always trains itself from scratch (in practice we might separate training from generation, as serious training might take very long, i.e. we would have a separate training program that would output a trained model, i.e. the learner probabilities, and then a generator that would only take the trained model and generate output text). Here is the code:
Anyway with all this said, below is a [C](c.md) code implementing the above described text generator. To use it just pipe some input ASCII text to it, however make it reasonably sized (a few thousand lines maybe, please don't feed it whole Britannica, the output won't be better), keep in mind the program always trains itself from scratch (in practice we might separate training from generation, as serious training might take very long, i.e. we would have a separate training program that would output a trained model, i.e. the learned probabilities, and then a generator that would only take the trained model and generate output text). Here is the code:
```
#include <stdio.h>
@ -98,4 +98,56 @@ int main(void)
}
```
Trying it out on the text of [this wiki](LRS_wiki.md) may output something like this:
```
Ther thellialy oris threstpl/pifragmediaragmagoiby s agmexes, den
atss pititpenaraly d thiplio s ts gs, tis wily NU gmarags gos
aticel/EEECTherixed atstixedells, s s ores agolltixes tixe. TO: N
s, s, TOpedatssth NUCAPorag: puffrits, pillly ars agmen No tpix abe
aghe. aragmed ssh titixen plioix ag: Th tingoras TOD s wicipixe d
tpllifr.edarenexeramed Thecospix ts ts s osth s pes ovipingor
g: agors agass s TOnamand s aghech th wopipistalioiaris agontibuf
ally Thrixtply tiaceca th oul/EEEEEEEECPU), wicth NU athed wen
aragag athichipl Thechixthass s gmelliptilicex th ostunth gmagh
atictpixe. ar Th on wipixexepifrag gman g: sthabopl/te.
```
We see at first glance it looks a bit like English text, even with some quirks specific to this wiki, for example here and there having FULL CAPS words (due to acronyms and also rants that often appear here). It even generated the word "CPU". Notice the algorithm correctly learned punctuation, i.e. it knows that after commas and periods there is almost always space and after space there is usually not another space. For comparison here is a Spanish-like text generated from Don Quixote (with accents removed):
```
Diloma Dadro hacaci gua usta lesano strore sto do diaco; ma ro
hiciso stue ue dita. do que menotamalmeci ma quen do gue lo;
denestajo qucos rdo horor Da que qunca. quadombuce que queromiderbre
hera ha rlabue F de querdos Dio macino; dombidrompo mi ste derdiba
l, mbiolo Ferbes l ste s lolo que ha Du hano quenore Dio ueno que
hala F uano he Dorame de qus rl, ha didesa que halanora Fla quco
dil qucio ue do mestostaloste hados de gusta querana. stuce F s s
Do lo dre s hal Fro gue sa sa. la sido la dico; hado mbuno Do.
mororo; rdenaja. qunolole Diba. do. Fa gor stamestamo ha quno
unostabro quero mue s Diado Didota. quencoralor dio sotomo Fuen
que halora. gunore quabrbe rol gostuno hadolmbe Da que unendor
que le di so; qunta rajos s F de qucol
```
We see some shorter words like *lo*, *le*, *de*, *he*, *que* and *sido* are real Spanish words. Though punctuation is quite nice, the algorithm fails to learn that after period the word of the next sentence should start with a capital letter (it only does so sometimes by pure chance) -- this is due to the algorithm only seeing one character back; after a period there is also one space which already makes the algorithm forget about the period. This would be addressed by keeping longer history, as said above. Now let's try a difference kind of text altogether, let's try to feed in the source code of [Anarch](anarch.md):
```
2 camechererea = 20;
#erereppon.xereponioightFuaighe16_ARABEIUnst
chtreraySqua->rarepL_RCL_CL_PE;
caminsin.yDINeramaxer = costRCL_PERCL_ditsins->pL_ime1
= 0;
* = RCL_dime1,y 1)
0;
}
}
ck;
camererayDimameaxSqua ca = ca->ra caininin.xS_UAME;
caminstFua-> 0 0;
} ca->ponstramiomereaxSquts chts 154;
1)
```
Here it's pretty clear the code won't work but its structure really does resemble the original source: curly brackets and semicolons are correctly followed by newlines, assignments look pretty correct as well, dereference arrows (`->`) appear too -- the code even generated the `RCL_` prefix of the [raycastlib](raycastlib.md) functions that's widely seen in the original code.