Update

2024-02-10 12:19:55 +01:00 · 2024-02-10 12:19:55 +01:00 · be97db7fc4
commit be97db7fc4
parent 9d1876387b
24 changed files with 1754 additions and 1683 deletions
--- a/optimization.md
+++ b/optimization.md
@ -38,7 +38,7 @@ These are mainly for [C](c.md), but may be usable in other languages as well.
 - **[Single compilation unit](single_compilation_unit.md) (one big program without [linking](linking.md)) can help compiler optimize better** because it can see the whole code at once, not just its parts. It will also make your program compile faster.
 - Search literature for **algorithms with better [complexity class](complexity_class.md)** ([sorts](sorting.md) are a nice example).
 - For the sake of simple computers such as [embedded](embedded.md) platforms **avoid [floating point](floating_point.md)** as that is often painfully slowly emulated in software. Use [fixed point](fixed_point.md), or at least offer it as a [fallback](fallback.md). This also applies to other hardware requirements such as [GPU](gpu.md) or sound cards: while such hardware accelerates your program on computers that have the hardware, making use of it may lead to your program being slower on computers that lack it.
- **[Early branching](early_branching.md) can create a speed up** (instead of branching inside the loop create two versions of the loop and branch in front of them). This is a kind of space-time tradeoff.
+- **Factoring out invariants from loops and early branching can create a speed up**: it's sometimes possible to factor things out of loops (or even long non-looping code that just repeats some things), i.e. instead of branching inside the loop create two versions of the loop and branch in front of them. This is a kind of space-time tradeoff. Consider e.g. `while (a) if (b) func1(); else func2();` -- if *b* doesn't change inside the loop, you can rewrite this as `if (b) while (a) func1(); else while (a) func2();`. Or in `while (a) b += c * d;` if *c* and *d* don't change (are invariant), we can rewrite to `cd = c * d; while (a) b += cd;`. And so on.
 - **Division can be replaced by multiplication by [reciprocal](reciprocal.md)**, i.e. *x / y = x * 1/y*. The point is that multiplication is usually faster than division. This may not help us when performing a single division by variable value (as we still have to divide 1 by *y*) but it does help when we need to divide many numbers by the same variable number OR when we know the divisor at compile time; we save time by precomputing the reciprocal before a loop or at compile time. Of course this can also easily be done with [fixed point](fixed_point.md) and integers!
 - **Consider the difference between logical and bitwise operators!** For example [AND](and.md) and [OR](or.md) boolean functions in C have two variants, one bitwise (`&` and `|`) and one logical (`&&` and `||`) -- they behave a bit differently but sometimes you may have a choice which one to use, then consider this: bitwise operators usually translate to only a single fast (and small) instruction while the logical ones usually translate to a branch (i.e. multiple instructions with potentially slow jumps), however logical operators may be faster because they are evaluated as [short circuit](short_circuit_eval.md) (e.g. if first operand of OR is true, second operand is not evaluated at all) while bitwise operators will evaluate all operands.
 - **Consider the pros and cons of using indices vs pointers**: When working with arrays you usually have the choice of using either pointers or indices, each option has advantages and disadvantages; working with pointers may be faster and produce smaller code (fewer instructions), but array indices are portable, may be smaller and safer. E.g. imagine you store your game sprites as a continuous array of images in RAM and your program internally precomputes a table that says where each image starts -- here you can either use pointers (which say directly the memory address of each image) or indices (which say the offset from the start of the big image array): using indices may be better here as the table may potentially be smaller (an index into relatively small array doesn't have to be able to keep any possible memory address) and the table may even be stored to a file and just loaded next time (whereas pointers can't because on next run the memory addresses may be different), however you'll need a few extra instructions to access any image (adding the index to the array pointer), which will however most definitely be negligible.
@ -47,7 +47,7 @@ These are mainly for [C](c.md), but may be usable in other languages as well.
 - **What's fast on one platform may be slow on another**. This depends on the instruction set as well as on compiler, operating system, available hardware, [driver](driver.md) implementation and other details. In the end you always need to test on the specific platform to be sure about how fast it will run. A good approach is to optimize for the weakest platform you want to support -- if it runs fasts on a weak platform, a "better" platform will most likely still run it fast.
 - **Prefer preincrement over postincrement** (typically e.g. in a for loop), i.e. rather do `++i` than `i++` as the latter is a bit more complex and normally generates more instructions.
 - **Mental calculation tricks**, e.g. multiplying by one less or more than a power of two is equal to multiplying by power of two and subtracting/adding once, for example *x * 7 = x * 8 - x*; the latter may be faster as a multiplication by power of two (bit shift) and addition/subtraction may be faster than single multiplication, especially on some primitive platform without hardware multiplication. However this needs to be tested on the specific platform. Smart compilers perform these optimizations automatically, but not every compiler is high level and smart.
- **Use switch instead of if branches** -- it should be common knowledge but some newcomers may not know that switch is fundamentally different from if branches: switch statement generates a jump table that can branch into one of many case labels in constant time, as opposed to a series of if statements which keeps checking conditions one by one, however switch only supports conditions of exact comparison. So prefer using switch when you have many conditions to check. Switch also allows hacks such as label fall through which may help some optimizations.
+- **With more than two branches use switch instead of ifs** (if possible) -- it should be common knowledge but some newcomers may not know that switch is fundamentally different from if branches: switch statement generates a jump table that can branch into one of many case labels in constant time, as opposed to a series of if statements which keeps checking conditions one by one, however switch only supports conditions of exact comparison. So prefer using switch when you have many conditions to check (but know that switch can't always be used, e.g. for string comparisons). Switch also allows hacks such as label fall through which may help some optimizations.
 - **Else should be the less likely branch**, try to make if conditions so that the if branch is the one with higher probability of being executed -- this can help branch prediction.
 - Similarly **order if-sequences and switch cases from most probable**: If you have a sequences of ifs such as `if (x) ... else if (y) ... else if (z) ...`, make it so that the most likely condition to hold gets checked first, then second most likely etc. Compiler most likely can't know the probabilities of the conditions so it can't automatically help with this. Do the same with the `switch` statement -- even though switch typically gets compiled to a table of jump addresses, in which case order of the cases doesn't matter, it may also get compiled in a way similar to the if sequence (e.g. as part of size optimization if the cases are sparse) and then it may matter again.
 - **Variable aliasing**: If in a function you are often accessing a variable through some complex dereference of multiple pointers, it may help to rather load it to a local variable at the start of the function and then work with that variable, as dereferencing pointers costs something. { from *Game Programming Gurus* -drummyfish }
@ -71,7 +71,26 @@ Another kind of optimization done during development is just automatically writi

 ## Automatic Optimization

-TODO
+Automatic optimization is typically performed by the compiler; usually the programmer has the option to tell the compiler how much and in what way to optimize (no optimization, mild optimization, aggressive optimization, optimization for speed, size; check e.g. the man pages of [gcc](gcc.md) where you can see how to turn on even specific types of optimizations). Some compilers perform extremely complex reasoning to make the code more efficient, the whole area of optimization is a huge science -- here we'll only take a look at the very basic techniques. We see optimizations as transformations of the code that keep the semantics the same but minimize or maximize some measure (e.g. execution time, memory usage, power usage, network usage etc.). Automatic optimizations are usually performed on the intermediate representation (e.g. [bytecode](bytecode.md)) as that's the ideal way (we only write the optimizer once), however some may be specific to some concrete instruction set -- these are sometimes called *peephole* optimizations and have to be delayed until code generation.
+
+The following are some common methods of automatic optimization (also note that virtually any method from the above mentioned manual optimizations can be applied if only the compiler can detect the possibility of applying it):
+
+{ Tip: man pages of gcc or possibly other compilers detail specific optimizations they perform under the flags that turn them on, so see these man pages for a similar overview. ~drummyfish }
+
+- **Replacing instructions with faster equivalents**: we replace an instruction (or a series of instructions) with another one that does the same thing but faster (or with fewer instructions etc.). Typical example is replacing multiplication by power of two with a bit shift (e.g. `x * 8` is the same as `x << 3`).
+- **Inlining**: a function call may usually (not always though, consider e.g. [recursion](recursion.md)) be replaced by the function code itself inserted in the place of the call (so called inlining). This is faster but usually makes the code bigger so the compiler has to somehow judge and decide when it's worth to inline a function -- this may be affected e.g. by the function size (inlining a short function won't make the code that much bigger), programmer's hints (`inline` keyword, optimize for speed rather than size etc.) or guesstimating how often the function will be called. Function that is only called in one place can be safely inlined.
+- **Loop unrolling**: dupliates the body of a loop, making the code bigger but increasing its speed (a condition check is saved). E.g. `for (int i = 0; i < 3; ++i) func();` may be replaced with `func(); func(); func();`. Unrolling may be full or just partial.
+- **[Lazy](lazy_eval.md) evaluation/short circuit/test reordering**: the principles of lazy evaluation (evaluate function only when we actually need it) and short circuit evaluation (don't further evaluate functions when it's clear we won't need them) may be auto inserted into the code to make it more efficient. Test reordering may lead to first testing simpler things (e.g. equality comparison) and leaving complex tests (function calls etc.) for later.
+- **Algebraic laws, expression evaluation**: expressions may be partially preevaluated and manipulated to stay mathematically equivalent while becoming easier to evaluate, for example `1 + 3 + 5 * 3 * x / 6` may be transformed to just `4 + 5 * x / 2`.
+- **Removing instructions that cancel out**: for example in [Brainfuck](brainfuck.md) the series of instructions `+++--` may be shortened to just `+`.
+- **Removing instructions that do nothing**: generated code may contain instructions that just do nothing, e.g. NOPs that were used as placeholders that never got replaced; these can be just removed.
+- **Register allocation**: most frequently used variables should be kept in CPU registers for fastest access.
+- **Removing branches**: branches are often expensive due to not being CPU pipeline friendly, they can sometimes be replaced by a branch-free code, e.g. `if (a == b) c = 1; else c = 0;` can be replaced with `c = a == b;`.
+- **Memory alignment, reordering etc.**: data stored in memory may be reorganized for better efficiency, e.g. an often accessed array of bytes may actually be made into array of ints so that each item resides exactly on one address (which takes fewer instructions to access and is therefore faster). Data may also be reordered to be more [cache](cache.md) friendly.
+- **Generating [lookup tables](lut.md)**: if the optimizer judges some function to be critical in terms of speed, it may auto generate a lookup table for it, i.e. precompute its values and so sacrifice some memory for making it run extremely fast.
+- **Dead code removal**: parts of code that aren't used can be just removed, making the generated program smaller -- this includes e.g. functions that are present in a [library](library.md) which however aren't used by the specific program or blocks of code that become unreachable e.g. due to some `#define` that makes an if condition always false etc.
+- **[Compression](compression.md)**: compression methods may be applied to make data smaller and optimize for size (for the price of increased CPU usage).
+- ...

 ## See Also