A somewhat shallow post - just getting this out there in case someone else is searching the internet for this particular problem. I found myself trying recently to optimise a loop that looked like this:

1
2
3
4
5
6
7
8
uint32_t get_next_power_of_two(uint32_t i) {
        uint32_t ret = 1;
        while (ret < i) {
                ret <<= 1;
        }

        return ret;
}

The program this was a part of was a fairly boring experiment with a custom memory allocator, but this “find the next power of two” loop was taking a good 30% of the allocate function’s runtime in certain pathological (but plausible/possible) use cases - in particular, it was possible to always call this with really big numbers when allocating/deallocating large slabs of memory.

I thought “there has to some hardware primitive for something similar to this”, and indeed there was. I happen to be on x86_64, which has a bsr instruction, for “bit scan reverse”. From the Intel docs, this “searches the source operand for the most significant set bit. If a most significant 1 bit is found, its bit index is stored in the destination operand.”

This sounds exactly like what I want, and I managed to get an inline assembly version working, but there’s a better way - both GCC and Clang support the __builtin_clz builtin (for “count leading zeros”). This has the advantage of doing the analogue of bsr on other architectures, and presumably falling back to a loop where an architecture does not have a fast primitive for this operation. The function I ended up with was:

1
2
3
uint32_t get_next_power_of_two(uint32_t i) {
        return 1 << (32 - __builtin_clz(i));
}

As a back-of-the-envelope calculation of time taken, let’s look at the number of clock cycles Intel’s documentation says this approach should take. First, the assembly for this approach:

1
2
3
4
5
6
7
8
9
10
.LFB1818:
	.cfi_startproc
	bsrl	%edi, %edi
	movl	$32, %ecx
	movl	$1, %eax
	xorl	$31, %edi
	subl	%edi, %ecx
	sall	%cl, %eax
	ret
	.cfi_endproc

According to Intel, the bsrl takes 3 cycles, the two movl instructions take one cycle each, the xorl and subl take 0.33 cycles, and the sall takes one cycle. Counting naively, this would be 7 cycles, but the the two movl instructions can be parallelised with the bsrl, and the sall looks like it depends on the xorl and subl, leaving us with an estimate of 5 cycles. Even if this estimate is somewhat off, it will clearly be faster than the iterative approach - but by how much?

I wrote a small microbenchmarking program to test out both version of the function. Each run of 3 billion repetitions (after an initial cache-warming call) was run with a different i for which to calculate the next power of two. 10 runs with these random is were made, and the results averaged.

This isn’t anything big or formal, so I’m going to be super sloppy and give the short of it: for 3 billion iterations, the iterative version averaged 20.779 seconds, and the clz version averaged 3.343 seconds. This is definitely a worthwhile improvement, and the original benchmarks for the memory allocator improved similarly, in proportion with the amount of time we spent in this function - and as importantly, no cases got slower.

The next bit is just some nanosecond/clock cycle maths for curiosity’s sake. At the 4.20GHz non-turbo-boost of my machine, each clock cycle takes approximately 0.2381 nanoseconds. Dividing the 3,343,000,000 nanoseconds by three billion (the number of runs), we see that each iteration took 1.1143 nanoseconds, or 4.68 clocks - this is very close the estimate we took before, especially considering the turbo boost clock speed increase that likely occurred while running the benchmark. So, we’re probably on the right track.

As for the original approach, each iteration took 6.926 nanoseconds, or 29.1 cycles.