This commit is contained in:
Miloslav Ciz 2025-03-29 16:59:03 +01:00
parent 3fe12a0939
commit 651f779374
25 changed files with 1986 additions and 1969 deletions

4
www.md
View file

@ -95,12 +95,12 @@ Other programming languages such as [PHP](php.md) can also be used on the web, b
### How To (Sc)rape And Hack The Web
A great deal of information on the Internet is sadly presented via web pages in favor or normies and disfavor of [hackers](hacking.md) who would like to just download the info without having to do [clickity click on seizure inducing pictures](gui.md) while dodging jumpscare porn [ads](marketing.md). As hackers we aim to write scripts to rape the page and force it to give out its information without us having to suck its dick. With this we acquire the power to automatically archive data, [hoard](data_hoarding.md) it, analyze it, do some [netstalking](netstalking.md), discover hidden gems, make our own search engines, create [lulz](lulz.md) such as spambots etc. For doing just that consider the following tools:
A great deal of information on the Internet is sadly presented via web pages in favor of [normies](npc.md) and disfavor of [hackers](hacking.md) who would indeed prefer to just download the data without having to do [clickity click on seizure inducing pictures](gui.md) while dodging jumpscare porn [ads](marketing.md). As hackers we aim to write scripts to rape the page and force it to give out its content without us having to do any dick sucking. With this we acquire the power to automatically archive data, [hoard](data_hoarding.md) it, analyze it, do some [netstalking](netstalking.md), discover hidden gems, make our own search engines, create [lulz](lulz.md) such as spambots etc. For doing just that consider the following tools:
- General [CLI](cli.md) downloaders like [wget](wget.md) and [curl](curl.md). You download the resource and then use normal Unix tools to process it further. Check out the man pages, there exist many options to get around annoying things such as redirects and weirdly formatted URLs.
- Text web browsers like [links](links.md), [lynx](lynx.md) and [w3m](w3m.md) -- these are excellent! Check out especially the `-dump` option. Not only do they handle all the crap like parsing faulty HTML and handling shitty [encryption](encryption.md) [bullshit](bullshit.md), they also nicely render the page as plain text (again allowing further use of standard Unix tools), allow easily filling out forms and all this kind of stuff.
- [Libraries](library.md) and scraping specific tools: there exist many, such as the BeautifulSoup [Python](python.md) library -- although these tools are oftentimes very ugly, you may just abuse them for a one time [throwaway script](throwaway_script.md).
- Do it yourself: if a website is friendly (plain HTTP, no JavaShit, ...) and you just want to do something simple like extract all links, you may well just program your scraper from scratch let's say in [C](c.md), it won't the that hard.
- [Do it yourself](diy.md): if a website is friendly (plain HTTP, no JavaShit, ...) and you just desire something simple like extracting all the links, you may as well just program your scraper from scratch let's say in [C](c.md), it won't be that hard, and it'll be [fun](fun.md).
- ...
## See Also