Update

2025-05-08 20:41:37 +02:00 · 2025-05-08 20:41:37 +02:00 · 64fd120266
commit 64fd120266
parent 8b530b5952
35 changed files with 2034 additions and 2007 deletions
--- a/data_hoarding.md
+++ b/data_hoarding.md
@ -16,7 +16,7 @@ Here let be an advice to the good data hoarder.
 - **Use appropriate formats and quality**: if the value of the data is text, save it as txt (even if you found it in pdf), if it's a black and white scan, save it as black and white image (no need for RGB, sometimes not even shades of gray, 1bit is enough), if it's a diagram, find vector version of it and save that, if it's a meme whose entertaining value will be preserved even at half resolution, don't save it at 1080K, save it at lowest acceptable quality etc. Some types of images, such as big pixel art or bitmap diagrams, are best saved by converting them to indexed mode with let's say 32 colors and then saving then as PNG (it often beats even lossy JPEG). Use simple, common file formats that can be handled by free software or custom written tools, do NOT use proprietary formats or formats that are extremely complicated if you can at all avoid it. Go to great lengths to extract valuable data out of shitty formats: for example if you find a vlog video whose main value is in what's being said and not the video itself, rather find and store the video text transcript ([youtube](youtube.md) has automatic or even manual transcripts for almost all videos, they can be downloaded) than the video itself (it takes much less space, can be searched, indexed, printed and backed up on paper, ...), or, as the next best thing, extract only audio and compress that so that it's just barely understandable (convert to mono, 8bit 8 KHz, store as OGG with very low bitrate).
 - **Careful with that [compression](compression.md), Eugene**: compression can be good but again, only use it when appropriate, in most cases compression will be achieved just by saving the data in good format (and such compression will generally be even better than general purpose compression). General purpose compression (zip etc.) brings in trouble, for example it makes the data more prone to corruption (removes redundancy, increases entropy), it adds a dependency on the decompression program, it makes the files harder to inspect etc. Use it only on very large files that will get reduced a lot, for example some extremely huge dump of text data will likely benefit from being zipped.
 - **See how to do [backups](backup.md) well** and stick to that.
- **Use and make tools, automatize**. For example if you're downloading a lot of Wikipedia articles, make a simple script that will extract just the article text, throwing away the unnecessary sidebar, script and styles. Minify all websites you download, remove image tags if you're not saving images etc. Make converting images quicker and simpler e.g. with some ImageMagick scripts. Similarly use ffmpeg to tame your videos. There already exist many web scrapers and format converters and a lot can be achieved with the basic Unix tools, just look stuff up.
+- **Use and make tools, automate**. For example if you're downloading a lot of Wikipedia articles, make a simple script that will extract just the article text, throwing away the unnecessary sidebar, script and styles. Minify all websites you download, remove image tags if you're not saving images etc. Make converting images quicker and simpler e.g. with some ImageMagick scripts. Similarly use ffmpeg to tame your videos. There already exist many web scrapers and format converters and a lot can be achieved with the basic Unix tools, just look stuff up.
 - **Organization may be good**: primarily try to name the files well, only use alphanumeric characters and underscore, limit the filename length and adopt some general naming rules (it may be cool if the filename contains some simple [hash](hash.md) of the file itself so that adding a file with an already existing name won't overwrite the previous file). This will help preserve correct names when copying between different systems, and it will make searching more comfortable too. Some general directory structure may be cool, for example separating free and proprietary data will allow you to easily upload the free part anywhere on the Internet and so partially back it up, whereas with proprietary data you might get in trouble. Do not overdo organization though, that may lead to huge mess (see Wikimedia Commons category hell), obsessions and wasting time, even complicating the search for something -- [keep it simple](kiss.md). Put some thought into WHY you're organizing the files that way, don't just do it because it "looks nice", just use your fucking brain.
 - **NEVER [ENCRYPT](encryption.md)** for fucks sake, encryption is [shit](shit.md), you might as well smear your smelly diarrhea over it.
 - ...