Archive for the ‘Assembly’ Category

Multi-byte NOPs

Tuesday, June 26th, 2007

Horay, finally official multi-byte NOPs are supported by Intel 80×86 processors. So OK, not all processors support it yet, and you have to check the result of a cpuid instruction to know whether you can use it or not, but it’s a good start nevertheless. It is unnecessary to mention that this kind of NOP is a real NOP, and yet I just did. :)

So why do we care you ask?

Well, probably you certainly don’t. But some code generator tools will use it in the future, or hand written code in assembly. It will make the code more understandable and it will fit all sizes you need from 2 bytes up to 9 bytes. However, to be honest, I can’t really grasp the reason why they actually support it. I mean, one could write X single byte NOPs. But then I thought, the processor will spend time for decoding the instructions in the pipeline until it will realize it’s a NOP. So I guess the multi-byte NOP will be faster than a varying number of NOPs in a row. I tried to think what’s the benefit of such thing. Most of the NOPs or pseudo NOPs are used to align code to 8 or 16 bytes boundaries. This is probably for caching reasons and faster memory read operations. But if the NOPs are used for alignment usually they will follow a branch and (sometimes) even conditional branch, so they might not get to be executed most of the times. So why the effort for making such a new instruction? I can’t come up with a good answer. Got any idea? Ah, and if you were to align your code and it won’t run then why not dumping zeroes as a padding? And it’s not like 80×86 architecture has delayed branch… So they had a good reasoning for coming up with this instruction. The only real answer I came up with is for time critical code where you have to measure your code in MS…

Maybe you already saw some pseudo NOPs in the past. These are real instructions that don’t affect registers or flags of the current execution context. You can come up with many variations of such NOPs using the LEA instruction:

LEA REG, [REG+0]

And you can come up with less popular NOPs, MOV REG, REG…

The advantage of the LEA instruction is that it accepts a complex addressing mode operand so you can make the whole instruction bigger easily by using different sizes of the immediate ‘0’. If you don’t understand how the ModR/M is formatted, you will have to assemble the code and disassemble it until it fits the size you require. Not mentioning that you can’t control 100% of the code generated by the assembler that you would prefer to use the DB thingy directly to emit the pseudo NOP. The multi-byte NOP accepts the same source operand as LEA accepts, therefore its size varies.

Well, you might find yourself find of ADD EAX, 0 as a NOP. Probably because in math it doesn’t mean anything special. But while executed it affects the flags. Though, you might say it’s not a big deal and that depends where you dump the pseudo NOPs in your code. However, it seems that even a popular assembler had some bad issues with generating pseudo NOPs. And believe me, when you trust a tool and it misbehaves, you might get crazy until you realize who’s to blame!

Now I have to make diStorm support it ;)

NOP NOP NOP!!!1

Pop Count

Tuesday, June 5th, 2007

Population Count is counting the number of bits set to one in a word (or any sized integer). I found pop counting as one of the most fascinating small algorithms in the bit twiddling category. There are so many uses for pop counting. Though, rarely, I find myself use them, to make stuff compact, like bit-arrays, special bit-flags which represent collections, etc… There are also popular algorithms which use pop counting in error-correcting codes and sound/image processing.

Here’s my favorite implementation:

int popcount(unsigned int x)
{
int c = 0;
for (; x > 0; x &= x -1) c++;
return c;
}

The whole concept is behind the mere formula: x &= x -1. But I will leave it for you to figure it out.

Speaking of pop count, Imri and I even used this same implementation above in the NULL Terminated Strip. I even hinted for this algorithm. And here’s a real 32 bits bugfree implementation (that strip has the bugs on purpose):

XCHG EBX, EAX
XOR EAX, EAX
OR EBX, EBX
JZ END
A:
LEA ECX, [EBX-1]
INC EAX
AND EBX, ECX
JNZ A
END:

Of course, most implementation are branch-free code, for the sake of speed. However – Lately, I added the SSE4 support to diStorm, and among the new instructions I found out there is a single instruction for it all: popcnt. :)