Update
This commit is contained in:
parent
d2cc338141
commit
8604bfc7c0
33 changed files with 1866 additions and 1824 deletions
10
www.md
10
www.md
|
@ -90,6 +90,16 @@ When a user enters a URL of a page into the browser, the following happens (it's
|
|||
[Cookies](cookie.md), small files that sites can store in the user's browser, are used on the web to implement stateful behavior (e.g. remembering if the user is signed in on a forum). However cookies can also be abused for tracking users, so they can be turned off.
|
||||
|
||||
Other programming languages such as [PHP](php.md) can also be used on the web, but they are used for server-side programming, i.e. they don't run in the web browser but on the server and somehow generate and modify the sites for each request specifically. This makes it possible to create dynamic pages such as [search engines](search_engine.md) or [social networks](social_network.md).
|
||||
|
||||
### How To (Sc)rape And Hack The Web
|
||||
|
||||
A great deal of information on the Internet is sadly presented via web pages in favor or normies and disfavor of [hackers](hacking.md) who would like to just download the info without having to do [clickity click on seizure inducing pictures](gui.md) while dodging jumpscare porn [ads](marketing.md). As hackers we aim to write scripts to rape the page and force it to give out its information without us having to suck its dick. With this we acquire the power to automatically archive data, [hoard](data_hoarding.md) it, analyze it, do some [netstalking](netstalking.md), discover hidden gems, make our own search engines, create [lulz](lulz.md) such as spambots etc. For doing just that consider the following tools:
|
||||
|
||||
- General [CLI](cli.md) downloaders like [wget](wget.md) and [curl](curl.md). You download the resource and then use normal Unix tools to process it further. Check out the man pages, there exist many options to get around annoying things such as redirects and weirdly formatted URLs.
|
||||
- Text web browsers like [links](links.md), [lynx](lynx.md) and [w3m](w3m.md) -- these are excellent! Check out especially the `-dump` option. Not only do they handle all the crap like parsing faulty HTML and handling shitty [encryption](encryption.md) [bullshit](bullshit.md), they also nicely render the page as plain text (again allowing further use of standard Unix tools), allow easily filling out forms and all this kind of stuff.
|
||||
- [Libraries](library.md) and scraping specific tools: there exist many, such as the BeautifulSoup [Python](python.md) library -- although these tools are oftentimes very ugly, you may just abuse them for a one time [throwaway script](throwaway_script.md).
|
||||
- Do it yourself: if a website is friendly (plain HTTP, no JavaShit, ...) and you just want to do something simple like extract all links, you may well just program your scraper from scratch let's say in [C](c.md), it won't the that hard.
|
||||
- ...
|
||||
|
||||
## See Also
|
||||
|
||||
|
|
Loading…
Add table
Add a link
Reference in a new issue