16 KiB
Optimization
Optimization means making a program more efficient in terms of consumption of some computing resource or by any similar metric, commonly aiming for greater execution speed or lower memory usage (but also e.g. lower power consumption, lower network usage etc.) while preserving how the program functions externally. Unlike refactoring, which aims primarily for a better readability of source code, optimization changes the inner behavior of the executed program to a more optimal one.
General Tips'N'Tricks
These are mainly for C, but may be usable in other languages as well.
- Tell your compiler to actually optimize (
-O3
,-Os
flags etc.). - gprof is a utility you can use to profile your code.
<stdint.h>
has fast type nicknames, types such asuint_fast32_t
which picks the fastest type of at least given width on given platform.- Keywords such as
inline
,static
andconst
can help compiler optimize well. - Optimize the bottlenecks! Optimizing in the wrong place is a complete waste of time. If you're optimizing a part of code that's taking 1% of your program's run time, you will never speed up your program by more than that 1% even if you speed up the specific part by 10000%. Bottlenecks are usually inner-most loops of the main program loop, you can identify them with profiling. Generally initialization code that runs only once in a long time doesn't need much optimization -- no one is going to care if a program starts up 1 millisecond faster (but of course in special cases such as launching many processes this may start to matter).
- You can almost always trade space (memory usage) for time (CPU demand) and vice versa and you can also fine-tune this. You typically gain speed by precomputation (look up tables, more demanding on memory) and memory with compression (more demanding on CPU).
- Static things are faster and smaller than dynamic things. This means that things that are somehow fixed/unchangeable are better in terms of performance (and usually also safer and better testable) than things that are allowed to change during run time -- for example calling a function directly (e.g.
myVar = myFunc();
) is both faster and requires fewer instructions than calling a function by pointer (e.g.myVar = myFuncPointer();
): the latter is more flexible but for the price of performance, so if you don't need flexibility (dynamic behavior), use static behavior. This also applies to using constants (faster/smaller) vs variables, static vs dynamic typing, normal vs dynamic arrays etc. - Be smart, use math. Example: let's say you want to compute the radius of a zero-centered bounding sphere of an N-point point cloud. Naively you might be computing the Euclidean distance (sqrt(x^2 + y^2 + z^2)) to each point and taking a maximum of them, however you can just find the maximum of squared distances (x^2 + y^2 + z^2) and return a square root of that maximum. This saves you a computation of N - 1 square roots.
- Learn about dynamic programming.
- Avoid branches (ifs) if you can (remember ternary operators, loop conditions etc. are branches as well). They break prediction in CPU pipelines and instruction preloading and are often source of great performance losses. Don't forget that you can many times compare and use the result of operations without using any branching (e.g.
x = (y == 5) + 1;
instead ofx = (y == 5) ? 2 : 1;
). - Use iteration instead of recursion if possible (calling a function costs something).
- You can use good-enough approximations instead of completely accurate calculations, e.g. taxicab distance instead of Euclidean distance, and gain speed or memory without trading. Nice examples can be found in computer graphics, e.g. some software renderers use perspective-correct texturing only for large near triangles and cheaper affine texturing for other triangles, which mostly looks OK.
- Use quick opt-out conditions: many times before performing some expensive calculation you can quickly check whether it's even worth performing it and potentially skip it. For example in physics collision detections you may first quickly check whether the bounding spheres of the bodies collide before running an expensive precise collision detection -- if bounding spheres of objects don't collide, it is not possible for the bodies to collide and so we can skip further collision detection.
- Operations on static data can be accelerated with accelerating structures (look-up tables for functions, indices for database lookups, spatial grids for collision checking, various trees ...).
- Use powers of 2 (1, 2, 4, 8, 16, 32, ...) whenever possible, this is efficient thanks to computers working in binary. Not only may this help nice utilization and alignment of memory, but mainly multiplication and division can be optimized by the compiler to mere bit shifts which is a tremendous speedup.
- Write cache-friendly code (minimize long jumps in memory).
- Compare to 0 rather than other values. There's usually an instruction that just checks the zero flag which is faster than loading and comparing two arbitrary numbers.
- Use bit tricks, hacks for manipulating binary numbers in clever ways only using very basic operations without which one might naively write complex inefficient code with loops and branches. Example of a simple bit trick is checking if a number is power of two as
!(x & (x - 1)) && x
. - Consider moving computation from run time to compile time. E.g. if you make a resolution of your game constant (as opposed to a variable), the compiler will be able to partially precompute expressions with the display dimensions and so speed up your program (but you won't be able to dynamically change resolution).
- On some platforms such as ARM the first arguments to a function may be passed via registers, so it may be better to have fewer parameters in functions.
- Optimize when you already have a working code. As Donald Knuth put it: "premature optimization is the root of all evil". Nevertheless you should get used to simple nobrainer efficient patterns by default and just write them automatically.
- Use your own caches where they help, for example if you're frequently working with some database item you better pull it to memory and work with it there, then write it back once you're done (as opposed to communicating with the DB there and back).
- Single compilation unit (one big program without linking) can help compiler optimize better because it can see the whole code at once, not just its parts. It will also make your program compile faster.
- Search literature for algorithms with better complexity class (sorts are a nice example).
- For the sake of simple computers such as embedded platforms avoid floating point as that is often painfully slowly emulated in software. Use fixed point, or at least offer it as a fallback. This also applies to other hardware requirements such as GPU or sound cards: while such hardware accelerates your program on computers that have the hardware, making use of it may lead to your program being slower on computers that lack it.
- Early branching can create a speed up (instead of branching inside the loop create two versions of the loop and branch in front of them). This is a kind of space-time tradeoff.
- Division can be replaced by multiplication by reciprocal, i.e. x / y = x * 1/y. The point is that multiplication is usually faster than division. This may not help us when performing a single division by variable value (as we still have to divide 1 by y) but it does help when we need to divide many numbers by the same variable number OR when we know the divisor at compile time; we save time by precomputing the reciprocal before a loop or at compile time. Of course this can also easily be done with fixed point and integers!
- Reuse variables to save space. A warning about this one: readability may suffer, mainstreamers will tell you you're going against "good practice", and some compilers may do this automatically anyway. Be sure to at least make this clear in your comments. Anyway, on a lower level and/or with dumber compilers you can just reuse variables that you used for something else rather than creating a new variable that takes additional RAM; of course a prerequisite for "merging" variables is that the variables aren't used at the same time.
- What's fast on one platform may be slow on another. This depends on the instruction set as well as on compiler, operating system, available hardware, driver implementation and other details. In the end you always need to test on the specific platform to be sure about how fast it will run. A good approach is to optimize for the weakest platform you want to support -- if it runs fasts on a weak platform, a "better" platform will most likely still run it fast.
- Mental calculation tricks, e.g. multiplying by one less or more than a power of two is equal to multiplying by power of two and subtracting/adding once, for example x * 7 = x * 8 - x; the latter may be faster as a multiplication by power of two (bit shift) and addition/subtraction may be faster than single multiplication, especially on some primitive platform without hardware multiplication. However this needs to be tested on the specific platform. Smart compilers perform these optimizations automatically, but not every compiler is high level and smart.
- Else should be the less likely branch, try to make if conditions so that the if branch is the one with higher probability of being executed -- this can help branch prediction.
- Similarly order if-sequences and switch cases from most probable: If you have a sequences of ifs such as
if (x) ... else if (y) ... else if (z) ...
, make it so that the most likely condition to hold gets checked first, then second most likely etc. Compiler most likely can't know the probabilities of the conditions so it can't automatically help with this. Do the same with theswitch
statement -- even though switch typically gets compiled to a table of jump addresses, in which case order of the cases doesn't matter, it may also get compiled in a way similar to the if sequence (e.g. as part of size optimization if the cases are sparse) and then it may matter again. - You can save space by "squeezing" variables -- this is a space-time tradeoff, it's a no brainer but nubs may be unaware of it -- for example you may store 2 4bit values in a single
char
variable (8bit data type), one in the lower 4bits, one in the higher 4bits (use bit shifts etc.). So instead of 16 memory-aligned booleans you may create oneint
and use its individual bits for each boolean value. This is useful in environments with extremely limited RAM such as 8bit Arduinos. - You can optimize critical parts of code in assembly, i.e. manually write the assembly code that takes most of the running time of the program, with as few and as inexpensive instructions as possible (but beware, popular compilers are very smart and it's often hard to beat them). But note that such code loses portability! So ALWAYS have a C (or whatever language you are using) fallback code for other platforms, use ifdefs to switch to the fallback version on platforms running on different assembly languages.
- Loop unrolling/splitting/fusion, function inlining etc.: there are optimizations that are usually done at assembly level (e.g. loop unrolling physically replaces a loop by repeated commands which gains speed but also makes the program bigger) and higher level languages try to perform them automatically. However if you're writing in assembly or have a dumb compiler (or are even writing a compiler) you may do these automatically, e.g. with macros/templates etc. Sometimes you can hint a compiler to perform these optimizations, so look this up.
- Parallelism (multithreading, compute shaders, ...) can astronomically accelerate many programs, it is one of the most effective techniques of speeding up programs -- we can simply perform several computations at once and save a lot of time -- but there are a few notes. Firstly not all problems can be parallelized, some problem are sequential in nature, even though most problems can probably be parallelized to some degree. Secondly it is hard to do, opens the door for many new types of bugs, requires hardware support (software simulated parallelism can't work here of course) and introduces dependencies; in other words it is huge bloat, we don't recommend parallelization unless a very, very good reason is given. Optional use of SIMD instructions can be a reasonable midway to going full parallel computation.
- Specialized hardware (e.g. a GPU) astronomically accelerates programs, but as with the previous point, portablity and simplicity greatly suffers, your program becomes bloated and gains dependencies, always consider using specialized hardware and offer software fallbacks.
When To Actually Optimize?
Nubs often ask this and this can also be a very nontrivial question. Generally fine, sophisticated optimization should come as one of the last steps in development, when you actually have a working thing. These are optimizations requiring significant energy/time to implement -- you don't want to spend resources on this at the stage when they may well be dropped in the end, or they won't matter because they'll be outside the bottleneck. However there are two "exceptions".
The highest-level optimization is done as part of the initial design of the program, before any line of code gets written. This includes the choice of data structures and mathematical models you're going to be using, the very foundation around which you'll be building your castle. This happens in your head at the time you're forming an idea for a program, e.g. you're choosing between server-client or P2P, monolithic or micro kernel, raytraced or rasterized graphics etc. These choices affect greatly the performance of your program but can hardly be changed once the program is completed, so they need to be made beforehand. This requires wide knowledge and experience as you work by intuition.
Another kind of optimization done during development is just automatically writing good code, i.e. being familiar with specific patterns and using them without much thought. For example if you're computing some value inside a loop and this value doesn't change between iterations, you just automatically put computation of that value before the loop. Without this you'd simply end up with a shitty code that would have to be rewritten line by line at the end. Yes, compilers can often do this simple kind of optimization for you, but you don't want to rely on it.