# Slightly Faster Powers of Two

A somewhat shallow post - just getting this out there in case someone else is searching the internet for this particular problem. I found myself trying recently to optimise a loop that looked like this:

1
2
3
4
5
6
7
8

uint32_t get_next_power_of_two(uint32_t i) {
uint32_t ret = 1;
while (ret < i) {
ret <<= 1;
}
return ret;
}

The program this was a part of was a fairly boring experiment with a
custom memory allocator, but this “find the next power of two” loop
was taking a good 30% of the `allocate`

function’s runtime in certain
pathological (but plausible/possible) use cases - in particular, it
was possible to always call this with really big numbers when
allocating/deallocating large slabs of memory.

I thought “there has to some hardware primitive for something similar
to this”, and indeed there was. I happen to be on x86_64, which has a
`bsr`

instruction, for “bit scan reverse”. From the Intel docs, this
“searches the source operand for the most significant set bit. If a
most significant 1 bit is found, its bit index is stored in the
destination operand.”

This sounds exactly like what I want, and I managed to get an inline
assembly version working, but there’s a better way - both GCC and
Clang support the `__builtin_clz`

builtin (for “count leading
zeros”). This has the advantage of doing the analogue of `bsr`

on
other architectures, and presumably falling back to a loop where an
architecture does not have a fast primitive for this operation. The
function I ended up with was:

1
2
3

uint32_t get_next_power_of_two(uint32_t i) {
return 1 << (32 - __builtin_clz(i));
}

As a back-of-the-envelope calculation of time taken, let’s look at the number of clock cycles Intel’s documentation says this approach should take. First, the assembly for this approach:

1
2
3
4
5
6
7
8
9
10

.LFB1818:
.cfi_startproc
bsrl %edi, %edi
movl $32, %ecx
movl $1, %eax
xorl $31, %edi
subl %edi, %ecx
sall %cl, %eax
ret
.cfi_endproc

According to Intel, the `bsrl`

takes 3 cycles, the two `movl`

instructions take one cycle each, the `xorl`

and `subl`

take 0.33
cycles, and the `sall`

takes one cycle. Counting naively, this would
be 7 cycles, but the the two `movl`

instructions can be parallelised
with the `bsrl`

, and the `sall`

looks like it depends on the `xorl`

and `subl`

, leaving us with an estimate of 5 cycles. Even if this
estimate is somewhat off, it will clearly be faster than the iterative
approach - but by how much?

I wrote a small microbenchmarking program to test out both version of
the function. Each run of 3 billion repetitions (after an initial
cache-warming call) was run with a different `i`

for which to
calculate the next power of two. 10 runs with these random `i`

s were
made, and the results averaged.

This isn’t anything big or formal, so I’m going to be super sloppy and
give the short of it: for 3 billion iterations, the iterative version
averaged 20.779 seconds, and the `clz`

version averaged 3.343
seconds. This is definitely a worthwhile improvement, and the original
benchmarks for the memory allocator improved similarly, in proportion
with the amount of time we spent in this function - and as
importantly, no cases got slower.

The next bit is just some nanosecond/clock cycle maths for curiosity’s sake. At the 4.20GHz non-turbo-boost of my machine, each clock cycle takes approximately 0.2381 nanoseconds. Dividing the 3,343,000,000 nanoseconds by three billion (the number of runs), we see that each iteration took 1.1143 nanoseconds, or 4.68 clocks - this is very close the estimate we took before, especially considering the turbo boost clock speed increase that likely occurred while running the benchmark. So, we’re probably on the right track.

As for the original approach, each iteration took 6.926 nanoseconds, or 29.1 cycles.