This commit is contained in:
Miloslav Ciz 2025-05-29 17:30:19 +02:00
parent 8b619fe2cc
commit cf7680ee94
18 changed files with 1999 additions and 1974 deletions

View file

@ -27,6 +27,7 @@ Techniques of netstalking include port scanning, randomly generating web domains
- **Guess randomly.** It can even be an entertaining pastime to play a lottery, randomly digging and seeing what you find. For example you can type random domains or IP addresses in your URL bar: `nigger.com`, `hitler.il`, `weirdporn.xyz` or whatever. One can even quite effortlessly bash together a script to automatically check millions of such domains. This has a chance of discovering something that would be otherwise unfindable because it's not linked to from anywhere on the indexed web.
- **Manually search unindexable material**. A lot of information is out there but search engines don't know about it because it's not in plaintext format or it's hiding behind a login or captcha wall or whatever. Plenty of stuff is hidden in scanned PDF books, videos, compressed archives, spoken audio etc. Hence when you're searching manually, try to go to places where search engines are less likely to get.
- **Write own tools.** Today you no longer have to possess a [PhD](phd.md) (or even brain) to write a simple web scraping script. Custom tools can take you beyond what search engines can (and are willing to) do for you -- for example search engines typically can't search for [regular expressions](regexp.md), but your own crawler can. Your own tool is 100% tailored to your needs, it can behave in exact ways you want (ignore robots.txt, use your credentials to bypass login walls, follow very specific trails, you can even use [OCR](ocr.md) to extract text from images etc.). Like said above, a simple tool is for example one that randomly checks various combinations of words and TLDs to discover curious domain names. Writing a simple crawler is also pretty easy, provided you [keep it very simple](kiss.md) -- exploit existing tools like wget or curl to download pages and extract everything that looks like URL, no need to parse [HTML](html.md) or whatever, literally treat everything as plain text. Then you can extract only documents that are somehow "[interesting](interesting.md)", for example containing specific keywords, not containing JavaScript tags, only being hosted through plain [HTTP](http.md) etc.
- **Use existing crawlers and similar tools**: for example [YACY](yacy.md). It may not be an awesome search engine for daily use or an example of well written software, but it's means to an end: discovering obscure stuff. And it does a great job at that. YACY is a crawler that takes a list of websites as a startpoint and follows links according to rules you set, indexing everything it finds, without censorship, according to your personal preferences, ignoring robots.txt if you want etc. It creates visual maps and aggregates links leading from and to any website, and this is immensely helpful, it shows you every single link buried deep within a web site somewhere in the middle of a wall of text, something you would most likely never find manually. Really this yields many great results.
- **Find lists of obscure sites and other people who search for them.** A sizable number of small sites now like to post links to other interesting sites, it's enough to find one and then you just start following the links, you find more links etc. This can never end. Some communities like to share lulzy links, e.g. [4chan](4chan.md), kiwifarms, ... Don't forget to contribute back and publish the list of your findings too ;)
- **Analyze data.** There are tons of publicly accessible, but yet undigested data about the web -- for example Internet Archive's crawl data, [WikiData](wikidata.md), the Yacy index and so on. You may try your luck sniffing here.
- **Filtering**: today the issue of finding something of value has turned from discovering paths to rather filtering out all the countless surrounding [noise](noise.md). There is so much data we get lost in it, so the focus shifts to clever filtering. For example on YouTube all the weird, cool videos are accessible, they're just buried and the algorithm never recommends them, the search never finds them. A way to get to quality videos is for example searching older videos (`before:2015`) which also have subtitles (this is usually a sign of high quality videos, no one bothers with subtitles on crappy videos).