Introduction#

Recently I came over this old school xor swap trick. It’s a trick that impressed me when I was a young coder and one of my friends first showed it to me. As I’ve grown older and with more understanding of computer architecture I use this trick as an example of how more complex solution does not really imply better. And in the fact of the XOR swap trick it is one of the things that makes you feel good over a tricky implementation, but will hurt performance at the end of the day.

Swap#

So what is the XOR swap trick you might ask. Consider the following C code where you want to swap the variables A and B. You might want to write something akin to

	int A = 3;
	int B = 5;
	int tmp; 

	tmp = A; 
	A = B; 
	B = tmp;

Then at some point in the past, somebody has the idea of reducing this down to using 2 registers instead of the 3 which are implied in the above program. I suspect this trick is a relic of an age long past when register usage was important. Hence they did the following:

	int A = 3;
	int B = 5;

	A = B ^ A; // B XOR A
	B = A ^ B; // A XOR B
	A = B ^ A; // B XOR A

Which in itself is a rather nifty little trick. I like the elegance of it.

Problem#

Now, what is the problem you might ask. And on the face of it, this looks innocently enough. However, for those that has studied computer architecture or compilers, you can see that the XOR swap method introduces a dependency on the results. If you look at the code itself, you can see that each line in the XOR swap code depends on the line above it. This alone prevents the computer / CPU to issue the instructions in parallel, and it has to wait for the result to return before it can issue the next instruction ( very simplified ). The CPU has also a bunch of move optimization tricks that is doing which makes the simple straight-forward implementation quite fast.

Benchmark#

Now, how much worse/better are the two different strategies. To test, I wrote two different pieces of code. Mind you, the code was written at the airport waiting for my delayed plane so might have bugs in it. It basically implements the code from above plus some tricks to make sure that the code does not get constant optimized away. I then did an objdump -D ont the executable files to make sure no compiler tricks had been done. Finally, I timed the binaries over a bunch of iterations ( 268435456 to be exact ). The experiments were run on a macbook pro with an Apple M2.

./mv 34 54  0.63s user 0.00s system 99% cpu 0.639 total
./xor 34 54  1.65s user 0.01s system 99% cpu 1.665 total

So, the xor implementation is 2.6 times slower than the naive mv implementation.

Conclusions#

So, to sum it up. I think what I am trying to say that be careful with premature optimizations, and solutions that looks fancy. There are a lot of optimization tricks happening in the entire stack from your simple program to the gates on the CPU. At least try the straightforward implementation first, and if that does not work you can go tricky. Otherwise, you might up creating more problems than solving with the tricky solution.

Refs#