diff --git a/fizzbuzz.md b/fizzbuzz.md
index ac20d8c..13f23fd 100644
--- a/fizzbuzz.md
+++ b/fizzbuzz.md
@@ -11,6 +11,8 @@ Fizz, 28, 29, FizzBuzz, 31, 32, Fizz, 34, Buzz, Fizz, 37, 38, Fizz, Buzz, 41, Fi
 
 **Why the fuss around FizzBuzz?** Well, firstly it dodges an obvious single elegant solution that many similar problems usually have and it leads a beginner to a difficult situation that can reveal a lot about his experience and depth of his knowledge. The tricky part lies in having to check not only divisibility by 3 and 5, but also by BOTH at once, which when following basic programming instincts ("just if-then-else everything") leads to inefficiently checking the same divisibility twice and creating some extra ugly if branches and also things like reusing [magic constants](magic_constant.md) in multiple places, conflicting the "[DRY](dry.md)" principle etc. It can also show if the guy knows things usually unknown to beginners such as that the modulo operation with non-power-of-two is usually expensive and we want to minimize its use. However it is greatly useful even when an experienced programmer faces it because it can serve a good, deeper discussion about things like [optimization](optimization.md); while FizzBuzz itself has no use and optimizing algorithm that processes 100 numbers is completely pointless, the problem is similar to some problems in practice in which the approach to solution often becomes critical, considering [scalability](scalability.md). In practice we may very well encounter FizzBuzze's big brother, a problem in which we'll need to check not 100 numbers but 100 million numbers per second and check not only divisibility by 3 and 5, but by let's say all [prime numbers](prime.md). Problems like this come up e.g. in [cryptography](cryptography.md) all the time, so we really have to come to discussing [time complexity](time_complexity.md) classes, [instruction sets](isa.md) and hardware acceleration, [parallelism](parallelism.md), possibly even [quantum computing](quantum.md), different [paradigm](paradigm.md)s etc. So really FizzBuzz is like a kind of great conversation starter, a bag of topics, a good training example and so on.
 
+TODO: some history etc.
+
 ## Implementations
 
 Let's see how we can implement, improve and [optimize](optimization.md) FizzBuzz in [C](c.md). Keep in mind the question of scalability, i.e. try to imagine how the changes we make to the algorithm would manifest if the problem grew, i.e. if for example we wanted to check divisibility by many more numbers than just 1 and 5 etc. We will only focus on optimizing the core of the algorithm, i.e. the divisibility checking, without caring about other things like optimizing printing the commas between numbers and whatnot. Also we'll be supposing all compiler optimization are turned off so that you the excuse "compiler will optimize this" can't be used :)
@@ -110,7 +112,40 @@ int main(void)
 
 This solution utilizes a [switch](switch.md) structure to only perform single branching in the divisibility check, based on a 2 bit value that in its upper bit records divisibility by 5 and in the lower bit divisibility by 3. This gives us 4 possible values: 0 (divisible by none), 1 (divisible by 3), 2 (divisible by 5) and 3 (divisible by both). The switch structure by default creates a jump table that branches right into the correct label in O(1).
 
-TODO: optimize this further to no jump at all, just by offsetting the string to be printed
+We can even go as far as avoiding any branching at all with so called [branchless programming](branchless.md), even though in this specific case saving one branch is probably not worth the cost of making it happen. But for the sake of completeness we can do e.g. something as follows.
+
+```
+#include <stdio.h>
+
+char str[] = "\0\0\0\0\0\0\0\0Fizz\0\0\0\0Buzz\0\0\0\0FizzBuzz";
+
+int main(void)
+{
+  for (int i = 1; i <= 100; ++i)
+  {
+    if (i != 1)
+      printf(", ");
+
+    char *s = str;
+
+    *s = '1'; // convert number to string
+    s += i >= 100;
+    *s = '0' + (i / 10) % 10;
+    s += (*s != '0') | (i >= 100);
+    *s = '0' + i % 10;
+    s++;
+    *s = 0;
+
+    int offset = ((i % 3 == 0) + ((i % 5 == 0) << 1)) << 3;
+    printf(str + offset);
+  }
+
+  putchar('\n');
+  return 0;
+}
+```
+
+The idea is to have a kind of [look up table](lut.md) of all options we can print, then take the thing to actually print out by indexing the table with the 2 bit divisibility value we used in the above example. Our lookup table here is the global string `str`, we can see it rather as an array of zero terminated strings, each one starting at the multiple of 8 index (this alignment to power of two will make the indexing more efficient as we'll be able to compute the offset with a mere bit shift as opposed to multiplication). The first item in the table is initially empty (all zeros) and in each loop cycle will actually be overwritten with the ASCII representation of currently checked number, the second item is "Fizz", the third item is "Buzz" and last one is "FizzBuzz". In each loop cycle we compute the 2 bit divisibility value, which will be a number 0 to 3, bit shift it by 3 to the left (multiply it by 8) and use that as an offset, i.e. the place where the printing function will start printing (also note that printing will stop at encountering a zero value). The conversion of number to ASCII is also implemented without any branches (and could be actually a bit simpler as we know e.g. the number 100 won't ever be printed). However notice that we pay a great price for all this: the code is quite ugly and unreadable and also performance-wise we many times waste time on converting the number to ASCII even if it then won't be printed, i.e. something that a branch can actually prevent. So at this point we probably overengineered this.
 
 If the problem asks for shortest code, even on detriment of [readability](readability.md) and efficiency, we might try **the [code golfer](code_golf.md) approach**: