Update

2024-08-31 14:44:45 +02:00 · 2024-08-31 14:44:45 +02:00 · 3f374a4713
commit 3f374a4713
parent 124b9d1e7c
85 changed files with 2281 additions and 2272 deletions
--- a/optimization.md
+++ b/optimization.md
@ -1,19 +1,20 @@
 # Optimization

-Optimization means making a program more efficient in terms of consumption of some computing resource or by any similar metric, commonly aiming for greater execution speed or lower memory usage (but also e.g. lower power consumption, lower network usage etc.) while preserving how the program functions externally; this can be done manually (by rewriting parts of your program) or automatically (typically by [compiler](compiler.md) when it's translating your program). Unlike [refactoring](refactoring.md), which aims primarily for a better readability of source code, optimization changes the inner behavior of the executed program to a more optimal one. Apart from optimizing programs/[algorithms](algorithm.md) we may also more widely talk about optimizing e.g. [data structures](data_structure.md), file formats, [hardware](hardware.md), [protocol](protocol.md) and so on.
+Optimization means making a program more efficient in terms of some computing resource usage or by any similar metric, commonly aiming for higher execution speed or lower memory usage (but also e.g. lower power consumption, lower [network](network.md) speed demand etc.) while preserving how the program functions externally; this can be done manually (by rewriting parts of your program) or automatically (typically by [compiler](compiler.md) when it's translating your program). Unlike [refactoring](refactoring.md), which aims primarily for a better readability of source code, optimization changes the inner behavior of the executed program to a more optimal one. Apart from optimizing programs/[algorithms](algorithm.md) we may also more widely talk about optimizing e.g. [data structures](data_structure.md), file formats, [hardware](hardware.md), [protocol](protocol.md) and so on.

 ## Manual Optimization

-These are optimizations you do yourself by writing better code.
+These are optimizations you do yourself by writing better code or fiddling with how you compile your code.

 ### General Tips'N'Tricks

 These are mainly for [C](c.md), but may be usable in other languages as well.

- **Tell your compiler to actually optimize** (`-O3`, `-Os` flags etc.). Also check out further compiler flags that may help you turn off unnecessary things you don't need, AND try out different compilers, some may just produce better code. If you are brave also check even more aggressive flags like `-Ofast` and `-Oz`, which may be even faster than `-03`, but may break your program too.
- **[gprof](gprof.md) is a utility you can use to profile your code**.
+- **Tell your compiler to actually auto optimize** (`-O3`, `-Os` flags etc.). Also check out further compiler flags that may help you turn off unnecessary things you don't need, AND try out different compilers, some may just produce better code. If you are brave also check even more aggressive flags like `-Ofast` and `-Oz`, which may be even faster than `-03`, but may break your program too.
+- **Watch out: what's fast on one platform may be slow on another -- know your platform and compiler**. This depends on the [instruction set](isa.md) as well as on compiler, operating system, library implementation, available hardware, [driver](driver.md) implementation and other details. In the end you always need to test on the specific platform to be sure about how fast it will run. For example with simple compilers that don't do much auto optimizations you may want to do clever tricks to optimize manually, which may however in turn confuse smarter compilers that optimize well but rely on idiomatic code, i.e. optimizing something for one platform may it slower on another platform. A good approach is probably to optimize for the weakest platform you want to support -- if it runs fasts on a weak platform, a "better" platform will most likely still run it fast (even if not optimally).
+- **[gprof](gprof.md) is a utility you can use to profile your code**. You can also program your own [profiling](profiling.md) but be careful, it's not trivial to do it well.
 - **`<stdint.h>` has fast type nicknames**, types such as `uint_fast32_t` which picks the fastest type of at least given width on given platform.
- **Actually measure the performance** to see if your optimizations work or not. Sometimes things behave counterintuitively and you end up making your program perform worse by trying to optimize it! Also make sure that you MEASURE THE PERFORMANCE CORRECTLY, many beginners for example just try to measure run time of a single simple function call which doesn't really work, you want to try to measure something like a million of such function calls in a loop and then average the time.
+- **Actually measure the performance** to see if your optimizations work or not. Sometimes things behave counterintuitively and you end up making your program perform worse by trying to optimize it! Also make sure that you MEASURE THE PERFORMANCE CORRECTLY, many beginners for example just try to measure run time of a single simple function call which doesn't really work, you want to try to measure something like a million of such function calls in a loop and then average the time; also make sure the compiler doesn't auto remove your code if it happens to have no effect.
 - **Keywords such as `inline`, `static`, `const` and `register` can help compiler optimize well**.
 - **Optimize the [bottlenecks](bottleneck.md)!** Optimizing in the wrong place is a complete waste of time. If you're optimizing a part of code that's taking 1% of your program's run time, you will never speed up your program by more than that 1% even if you speed up the specific part by 10000%. Bottlenecks are usually inner-most loops of the main program loop, you can identify them with [profiling](profiling.md). A typical bottleneck code is for example a [shader](shader.md) that processes millions of pixels per second. Generally initialization code that runs only once in a long time doesn't need much optimization -- no one is going to care if a program starts up 1 millisecond faster (but of course in special cases such as launching many processes this may start to matter).
 - **You can almost always trade space (memory usage) for time (CPU demand) and vice versa** and you can also fine-tune this. You typically gain speed by [precomputation](precomputation.md) ([look up tables](lut.md), more demanding on memory) and memory with [compression](compression.md) (more demanding on CPU).
@ -29,8 +30,8 @@ These are mainly for [C](c.md), but may be usable in other languages as well.
 - **Use powers of 2** (1, 2, 4, 8, 16, 32, ...) whenever possible, this is efficient thanks to computers working in [binary](binary.md). Not only may this help nice utilization and alignment of memory, but mainly multiplication and division can be optimized by the compiler to mere bit shifts which is a tremendous speedup.
 - **Memory alignment usually helps speed**, i.e. variables at "nice addresses" (usually multiples of the platform's native integer size) are faster to access, but this may cost some memory (the gaps between aligned data).
 - **Write [cache-friendly](cache-friendly.md) code** (minimize long jumps in memory).
- **Compare to [0](zero.md) rather than other values**. There's usually an instruction that just checks the zero flag which is faster than loading and comparing two arbitrary numbers.
- **Use [bit tricks](bit_hack.md)**, hacks for manipulating binary numbers in clever ways only using very basic operations without which one might naively write complex inefficient code with loops and branches. Example of a simple bit trick is checking if a number is power of two as `!(x & (x - 1)) && x`.
+- **Compare to [zero](zero.md) rather than other values**. There's usually an instruction that just checks the zero flag which is faster than loading and comparing two arbitrary numbers. For example in for loops where order of iteration doesn't matter you may count down rather than up and compare if you're at zero. { E.g. under [tcc](tcc.md) I measured `for (int i = 1000; i; --i)` to save a bit of time compared to `for (int i = 0; i < 1000; ++i)`. ~drummyfish }
+- **Consider [bit tricks](bit_hack.md)** (but be aware that [idiomatic](idiomatic.md) code may be better for advanced compilers), hacks for manipulating binary numbers in clever ways only using very basic operations without which one might naively write complex inefficient code with loops and branches. Example of a simple bit trick is checking if a number is power of two as `!(x & (x - 1)) && x`.
 - **Consider moving computation from run time to compile time**, see [preprocessor](preprocessor.md), [macros](macro.md) and [metaprogramming](metaprogramming.md). E.g. if you make a resolution of your game constant (as opposed to a variable), the compiler will be able to partially precompute expressions with the display dimensions and so speed up your program (but you won't be able to dynamically change resolution).
 - On some platforms such as [ARM](arm.md) the first **arguments to a function may be passed via registers**, so it may be better to have fewer parameters in functions.
 - **Passing arguments costs something**: passing a value to a function requires a push onto the stack and later its pop, so minimizing the number of parameters a function has, using global variables to pass arguments and doing things like passing structs by pointers rather than by value can help speed. { from *Game Programming Gurus* -drummyfish }
@ -38,14 +39,13 @@ These are mainly for [C](c.md), but may be usable in other languages as well.
 - **Use your own [caches](cache.md) where they help**, for example if you're frequently working with some database item you better pull it to memory and work with it there, then write it back once you're done (as opposed to communicating with the DB there and back).
 - **[Single compilation unit](single_compilation_unit.md) (one big program without [linking](linking.md)) can help compiler optimize better** because it can see the whole code at once, not just its parts. It will also make your program compile faster.
 - Search literature for **algorithms with better [complexity class](complexity_class.md)** ([sorts](sorting.md) are a nice example).
- For the sake of simple computers such as [embedded](embedded.md) platforms **avoid [floating point](floating_point.md)** as that is often painfully slowly emulated in software. Use [fixed point](fixed_point.md), or at least offer it as a [fallback](fallback.md). This also applies to other hardware requirements such as [GPU](gpu.md) or sound cards: while such hardware accelerates your program on computers that have the hardware, making use of it may lead to your program being slower on computers that lack it.
+- For the sake of simple computers such as [embedded](embedded.md) platforms **avoid [floating point](floating_point.md)** as that is often painfully slowly emulated in software (and also inserts additional code, making the executable bigger). Use [fixed point](fixed_point.md), or at least offer it as a [fallback](fallback.md). This also applies to other hardware requirements such as [GPU](gpu.md) or sound cards: while such hardware accelerates your program on computers that have the hardware, making use of it may lead to your program being slower on computers that lack it.
 - **Factoring out invariants from loops and early branching can create a speed up**: it's sometimes possible to factor things out of loops (or even long non-looping code that just repeats some things), i.e. instead of branching inside the loop create two versions of the loop and branch in front of them. This is a kind of space-time tradeoff. Consider e.g. `while (a) if (b) func1(); else func2();` -- if *b* doesn't change inside the loop, you can rewrite this as `if (b) while (a) func1(); else while (a) func2();`. Or in `while (a) b += c * d;` if *c* and *d* don't change (are invariant), we can rewrite to `cd = c * d; while (a) b += cd;`. And so on.
 - **Division can be replaced by multiplication by [reciprocal](reciprocal.md)**, i.e. *x / y = x * 1/y*. The point is that multiplication is usually faster than division. This may not help us when performing a single division by variable value (as we still have to divide 1 by *y*) but it does help when we need to divide many numbers by the same variable number OR when we know the divisor at compile time; we save time by precomputing the reciprocal before a loop or at compile time. Of course this can also easily be done with [fixed point](fixed_point.md) and integers!
 - **Consider the difference between logical and bitwise operators!** For example [AND](and.md) and [OR](or.md) boolean functions in C have two variants, one bitwise (`&` and `|`) and one logical (`&&` and `||`) -- they behave a bit differently but sometimes you may have a choice which one to use, then consider this: bitwise operators usually translate to only a single fast (and small) instruction while the logical ones usually translate to a branch (i.e. multiple instructions with potentially slow jumps), however logical operators may be faster because they are evaluated as [short circuit](short_circuit_eval.md) (e.g. if first operand of OR is true, second operand is not evaluated at all) while bitwise operators will evaluate all operands.
 - **Consider the pros and cons of using indices vs pointers**: When working with arrays you usually have the choice of using either pointers or indices, each option has advantages and disadvantages; working with pointers may be faster and produce smaller code (fewer instructions), but array indices are portable, may be smaller and safer. E.g. imagine you store your game sprites as a continuous array of images in RAM and your program internally precomputes a table that says where each image starts -- here you can either use pointers (which say directly the memory address of each image) or indices (which say the offset from the start of the big image array): using indices may be better here as the table may potentially be smaller (an index into relatively small array doesn't have to be able to keep any possible memory address) and the table may even be stored to a file and just loaded next time (whereas pointers can't because on next run the memory addresses may be different), however you'll need a few extra instructions to access any image (adding the index to the array pointer), which will however most definitely be negligible.
 - **Reuse variables to save space**. A warning about this one: readability may suffer, mainstreamers will tell you you're going against "good practice", and some compilers may do this automatically anyway. Be sure to at least make this clear in your comments. Anyway, on a lower level and/or with dumber compilers you can just reuse variables that you used for something else rather than creating a new variable that takes additional RAM; of course a prerequisite for "merging" variables is that the variables aren't used at the same time.
- **To save memory use [compression](compression.md) techniques.** Compression doesn't always have to mean you use a typical compression algorithm such as [jpeg](jpg.md) or [LZ77](lz77.md), you may simply just throw in a few compression techniques such as [run length](run_length.md) or word dictionaries into your data structures. E.g. in [Anarch](anarch.md) maps are kept small by consisting of a small dictionary of tile definitions and map cells referring to this dictionary (which makes the cells much smaller than if each one held a complete tile definition).
- **What's fast on one platform may be slow on another**. This depends on the instruction set as well as on compiler, operating system, available hardware, [driver](driver.md) implementation and other details. In the end you always need to test on the specific platform to be sure about how fast it will run. A good approach is to optimize for the weakest platform you want to support -- if it runs fasts on a weak platform, a "better" platform will most likely still run it fast.
+- **To save memory use [compression](compression.md) techniques.** (Needless to say this will however slow down the code a bit, we're trading space for time here.) Compression doesn't always have to mean you use a typical compression algorithm such as [jpeg](jpg.md) or [LZ77](lz77.md), you may simply just throw in a few compression techniques such as [run length](run_length.md) or word dictionaries into your data structures. E.g. in [Anarch](anarch.md) maps are kept small by consisting of a small dictionary of tile definitions and map cells referring to this dictionary (which makes the cells much smaller than if each one held a complete tile definition).
 - **Prefer preincrement over postincrement** (typically e.g. in a for loop), i.e. rather do `++i` than `i++` as the latter is a bit more complex and normally generates more instructions.
 - **Mental calculation tricks**, e.g. multiplying by one less or more than a power of two is equal to multiplying by power of two and subtracting/adding once, for example *x * 7 = x * 8 - x*; the latter may be faster as a multiplication by power of two (bit shift) and addition/subtraction may be faster than single multiplication, especially on some primitive platform without hardware multiplication. However this needs to be tested on the specific platform. Smart compilers perform these optimizations automatically, but not every compiler is high level and smart.
 - **With more than two branches use switch instead of ifs** (if possible) -- it should be common knowledge but some newcomers may not know that switch is fundamentally different from if branches: switch statement generates a jump table that can branch into one of many case labels in constant time, as opposed to a series of if statements which keeps checking conditions one by one, however switch only supports conditions of exact comparison. So prefer using switch when you have many conditions to check (but know that switch can't always be used, e.g. for string comparisons). Switch also allows hacks such as label fall through which may help some optimizations.
@ -60,7 +60,8 @@ These are mainly for [C](c.md), but may be usable in other languages as well.
 - **Optimizing [data](data.md)**: it's important to remember we can optimize both algorithm AND data, for example in a 3D game we may simplify our 3D models, remove parts of a level that will never be seen etc.
 - **Specialized hardware (e.g. a [GPU](gpu.md)) astronomically accelerates programs**, but as with the previous point, portablity and simplicity greatly suffers, your program becomes bloated and gains dependencies, always consider using specialized hardware and offer software fallbacks.
 - **Smaller code may also be faster** as it allows to fit more instructions into [cache](cache.md).
- Do not optimize everything and for any cost: optimization often makes the code more cryptic, it may [bloat](bloat.md) it, bring in more bugs etc. Only optimize if it is worth the prize. { from *Game Programming Gurus* -drummyfish }
+- Do not optimize everything and for any cost: optimization often makes the code more cryptic, it may [bloat](bloat.md) it, bring in more bugs etc. Only optimize if it is worth the reward. { from *Game Programming Gurus* -drummyfish }
+- ...

 ### When To Actually Optimize?