The final verdict: SSE2 is the best option. It offers performance between 150% and 388% of the CRT strlen function. 32-bit CRT and libc strlen are quite slow and the 64-bit strlens are about twice as fast.
1. What about SSE4.2?
First, SSE4.2 is slower. Though it can run on unaligned memory faster than SSE2 can, it seems searching only for zeros can be done more efficiently with SSE2.
Second, it is only available on processors designed after 2009. That is a harsh penalty.
Noteworthy: The SSE4.2 pcmpistri instruction (_mm_cmpistri instrinsic) runs faster on all unaligned memory than with the overhead of aligning the initial block of memory.
2. Is a separate executable for 64-bit worth it?
The benefits of a 64-bit operating system show here using the traditional methods without SIMD. However, when it comes to the 128-bit SIMD instructions, it really doesn’t matter too much. You would indeed get a boost running 64-bit code on a 64-bit operating system, but it is negligible for string functions.
3. What is more memory efficient?
None of these methods allocate memory.
4. Is it worth requiring SSE2?
While you can certainly fall back to a standard implementation, SSE2 has been around for a very long time and it is unlikely that any software that you write today will be run on unsupported hardware.
5. Is this faster in ICC 12 or MSVC++ 2010 SP1?
The results were inconclusive and appear equal.
The source code and raw data are available. The tests were run on an Intel i7-2600K @ 3.4GHz with 8GB of RAM in Windows 7 SP1 64-bit. Y-axis is the duration in seconds of 100,000 iterations multiplied by 100,000. The alignment of the string increases by 1 byte for each string length. The libc function is the source code from git on 2011-10-13 updated to utilize 64-bits in VC++.
“return 0” is essentially a baseline for comparison since it is the most simple function that can be called. The tests were performed in Visual Studio 2010 SP1.