Archive for the ‘Assembly’ Category

Anti Debugging

Monday, January 14th, 2008

I found this nice page about Anti-Debugging tricks. It covers so many of them and if you know the techniques it’s really fun to read it quickly one by one. You can take a look yourself here: Window Anti-Debug Reference. One of the tricks really attracted my focus and it was soemthing like this:

push ss
pop ss
pushf

What really happens is that you write to SS and the processor has a protection mechanism, so you can safely update rSP immediately as well. Because it could have led to catastrophic results if an interrupt would occur precisely after only SS is updated but rSP wasn’t yet. Therefore the processor locks all interrupts until the end of the next instruction, whatever it is. However, it locks interrupts only once during the next instruction no matter what, and it won’t work if you pop ss and then do it again… This issue means that if you are under a debugger or a tracer, the above code will push onto the stack the real flags of the processor’s current execution context.

Thus doing this:
pop eax
and eax, 0x100
jnz under_debugging

Anding the flags we just popped with 0x100 actually examines the trap flag which if you simply try to pushf and then pop eax, will show that the trap flag is clear and you’re not being debugged, which is a potential lie. So even the trap flag is getting pended or just stalled ’till next instruction and then the debugger engine can’t get to recognize a pushf instruction and fix it. How lovely.

I really agree with some other posts I saw that claim that an anti-debugging trick is just like a zero-day, if you’re the first to use it – you will win and use it well, until it is taken care of and gets known. Although, to be honest, a zero-day is way cooler and another different story, but oh well… Besides anti-debugging can’t really harm, just waste some time for the reverser.

Since I wrote diStorm and read the specs of both Intel and AMD regarding most instructions upside down, I immediately knew about “mov ss” too. Even the docs state about this special behavior. But it never occurred to me to use this trick. Anyway, another way to do the same is:

mov eax, ss
mov ss, eax
pushf

A weird issue was that the mov ss, eax, must really be mov ss, ax. Although all disassemblers will show them all as mov ss, ax (as if it were in 16 bits). In truth you will need a db 0x66 to make this mov to work… You can do also lots of fooling around with this instruction, like mov ss, ax; jmp $-2; and if you single step that, without seeing the next instruction you might get crazy before you realize what’s going on. :)

I even went further and tried to use a priviliged instruction like CLI after the writing to SS in the hope that the processor is executing in a special mode and there might be a weird bug. And guess what? It didn’t work and an exception was raised, of course. Probably otherwise I won’t have written about it here :). It seems the processors’ logic have a kind of an internal flag to pend interrupts till end of next instruction and that’s all. To find bugs you need to be creative…never harm to try even if it sounds stupid. Maybe with another privileged instruction in different rings and modes (like pmode/realmode/etc) it can lead to something weird, but I doubt it, and I’m too lazy to check it out myself. But imagine you can run a privileged instruction from ring3…now stop.

Delegators #3 – The ATL Way

Saturday, November 24th, 2007

The Active-Template Library (or ATL) is very useful. I think that if you code in C++ under Windows it’s even a must. It will solve your great many hours of work. Although I have personally found a few bugs in this library, the code is very tight and does the work well. In this post I’m going to focus on the CWindow class, although there are many other classes which do the delegations seamlessly for the user, such as: CAxDialogImpl, CAxWindow, etc. CWindow is the main one and so we will examine it.

I said in an earlier post that ATL uses thunks to call the instance’s window-procedure. A thunk is a mechanism to convert a function invocation between callee and caller. Look it up in Wiki for more info… To be honest, I was somewhat surprised to see that the mighty ATL uses Assembly to implement the thunks. As I was suggesting a few ways myself to avoid Assembly, I don’t see a really good reason to use Assembly here. You can say that the the ways I suggested are less safe, but if a window is choosing to be malicious you can make everything screwed anyway, so I don’t think it’s for this reason they used Assembly. Another reason I can think of is because their way, they don’t have to look-up for the instance’s ‘this’ pointer, they just have it, wheareas you will have to call GetProp or GetWindowLong. But come on… so if you got any idea let me know. I seriously have no problem with Assembly, but as most people thought that delegations must be implemented in Assembly, I showed you that’s not true. The reason it’s really surprised me is that the Assembly code is not portable among processors as you know; and ATL is very popular and used library. So if you take a look at ATL’s thunks code, you will see that they support x86 (obviously), AMD64, MIPS, ARM and more. And I ask, why the heck?when you can avoid it all? Again, for speed? Not sure it’s really worth it. The ATL guys know what they do, I doubt they didn’t know they could have done it without Assembly.

Anyhow, let’s get dirty, it’s all about their _stdcallthunk struct in the file atlstdthunk.h. The struct has a few members, that their layout in memory will be the same as it was in the file, that’s the key here. There is an Init function which constructs the members. These members are the byte code of the thunk itself, that’s why their layout in memory is important, because they are get to ran later by the processor. The Init function gets the ‘this’ pointer and the window procedure pointer. And then it will initialize the members to form the following code:

mov dword ptr [esp+4], this
jmp WndProc

Note that ‘this’ and ‘WndProc’ are member values that their values will be determined in construction-time. They must be known in advance, prior to creation of the thunk. Seeing [esp+4] we know they override the first argument of the instance’s window-procedure which is hWnd. They could have pushed another argument for the ‘this’ pointer, but why should they do it if they can recover the hWnd from the ‘this’ pointer anyway…? And save a stack-access? :)

Since the jmp instruction is relative in its behaviour, that is, the offset to the target address is not absolute but rather relative to the address of the jmp instruction itself, upon initialization the offset is calculated as well, like this:

DWORD((INT_PTR)proc – ((INT_PTR)this+sizeof(_stdcallthunk)));

Note that the ‘this’ here is the address of the thunk itself in memory (already allocated).
Now that we know how the thunk really looks and what it does, let’s see how it’s all get connected, from the global window procedure to the instance’s one. This is some pseudo code I came up with (and does not really reflect the ATL code, I only wanna give you the idea of it):

CWindow::CreateWindow:

WndProcThunk = AllocThunk((DWORD)this.WndProc, (DWORD)this);
m_hWnd = CreateWindow (…);
SetWindowLong(m_hWnd, GWL_WNDPROC, WndProcThunk);

This is not all yet, although in reality it’s a bit more complicated in the way they bind the HWND with its thunk…Now when the window will be sent a message, the stack will contain the HWND, MESSAGE, WPARAM and LPARAM arguments for the original window procedure, but then the thunk will change the HWND to THIS and immediately transfer control to the global window procedure, but this time they got the context of the instance!

CWindow::WindowProc(HWND hWnd, UINT message, WPARAM wParam, LPARAM lParam)
{
 CWindowImplBaseT< TBase, TWinTraits >* pThis = (CWindowImplBaseT< TBase, TWinTraits >*)hWnd;
 pThis->ProcessWindowMessage(pThis->m_hWnd, uMsg, wParam, lParam, lRes, 0);
}

And there we go. Notice the cast from hWnd to the ‘this’ pointer and the call with the context of the instance, at last, mission accomplished ;)

More information can be found here.

 A new problem now arises, I will mention it in the next post in this chain, but in a couple of words, here’s a hint: NX bit.

Delegators #1 – C++ & Win32

Wednesday, November 14th, 2007

In the technical point of view a Delagator is a call-forwarder, as simple as that. But in the designing aspect it is a technique where an object outwardly expresses certain behaviour but in reality delegates responsibility for implementing that behavior… And thanks for Wiki for sparing me the ugly explanation. :)

The reason I came up to want to implement delegators as seamlessly as possible in C++ was because I got to a situation where I wanted to write a pure OOP wrapper for windows objects (CreateWindow). Now let me get into the juicy details before you raise your eyebrows. And even before that – yes, something like ATL but only for windows. So say you have a class that is responsible for creating a window and controling it by handling its messages and stuff. Sounds legitimate, right? It would look something like:

class Window {
public:
Window(const cstring& name) { m_hWnd = CreateWindow(…) }
void Show () { ShowWindow(m_hWnd, SW_SHOW); }
 …
 ..
private:
 HWND m_hWnd;
};

And you get the notion. It looks ok and it is right, but let’s get further and hit the problem. When we create a window we need to pass the name of the window-class, that is some structure that contains more general info about the class itself, like the background color, mouse cursor and other stuff, but the most important one of them is wndproc pointer. The wndproc pointer is a pointer to a callback function that handles the messages of the windows that belong to this window-class. I assume you know how Win32 Windows system works. Now, can you spot the problem already?

Well, since it asks for a pointer to a function, and we want to have an instance of our Window object per window, there is no way to bind between the two. (OK, I lie, but continue reading please). In our case we would like to have the method of our message-handler being called and not a global function. If you’re not sure why, then think that each window has some private members in its instance that tells special things about that window. So we gotta find a way to link between our instance and our “window-procedure” method.

This is a start:

class Window
private:
LRESULT WINAPI WndProc(HWND hWnd, UINT message, WPARAM wparam, LPARAM lparam)
{
 switch (message)
 {
  …
  ..
 }
 return DefWindowProc(hWnd, message, wparam, lparam);
}
public:
static LRESULT WINAPI WndProc(HWND hWnd, UINT message, WPARAM wparam, LPARAM lparam)
{
 // Delegate to the instance’s window procedure.
 return g_window.WndProc(hWnd, message, wparam, lparam);
}

So now the window-class structure will point to the static function WndProc which when called will delegate its parameters to the internal (private) WndProc that can access the members. It’s almost a good solution. But now we are allowed to have only one window and a global instance that contains it. The good thing is that a static public function can call a private method of the same class, so we can hide the core of it. The bad thing is we still expose an internal function we don’t want to in the first place.

The problem is now to find a way to link between a real window object and our instance. The ID of a window is its HWND (or Handle to Window). So we could hold a list with all HWND’s we created and look it up before delegation and then make the right call to the correct instance. This is too much hassle for nothing. There ought to be a way to store some private data on the window object itself, right? At least I would suspect so. Eventually, after reading some MSDN and searching the net, I found a savior function which is called SetProp (and GetProp ofc). Exmaining their prototypes:
BOOL SetProp(HWND hWnd, LPCTSTR lpString, HANDLE hData);
HANDLE GetProp(HWND hWnd, LPCTSTR lpString);
We actually have a kind of dictionary, we give a string and store a pointer (to anything we’re upto). Afterwards, we can retrieve that pointer by using GetProp. Let’s work it out again:

ctor:
m_hWnd = CreateWindow(…);
SetProp(m_hWnd, “THIS_POINTER”, (HANDLE)this); // Ah ha!

What we did was to link the HWND with the this pointer. The window procedure will look like this now:

static LRESULT WINAPI WndProc(HWND hWnd, UINT message, WPARAM wparam, LPARAM lparam)
{
 Window* pThis = (Window*)GetProp(m_hWnd, “THIS_POINTER”); // Some magic?
 if (pThis != NULL) return pThis->WndProc(…); // Forward the call to the correct instance.
 …
 …
 return DefWindowProc(…);
}

 Well, as for the code here, I don’t handle errors, and I use C casts, so sue me :). I merely wanna show you the mechanism. And if you really wanna get dirty, you will have to RemoveProp when you get WM_NCDESTROY, etc…

After I got this code really working, I still was wondering how ATL does this binding. So I took a look at their code… It seems to have a global linked list with all instances of the windows that ATL created. And then when the global window procedure is get called, it looks it up on that list. In reality it is much more complex then my explanation, since they need to synchronize accesses among threads, make sure the window belongs to the same thread, etc… All that for only the first time call of the window-procedure. Then it sets the REAL ‘window-procedure’ method of the instance itself, and there it uses Assembly, muwahaha. That will be covered next time.

BTW – SetWindowLong cannot work since all you can do is changing the window-class fields. Although, maybe there is some undocumented field you can play with to store anything you like. Never know? :)

About DIV, IDIV and Overflows

Thursday, November 8th, 2007

The IDIV instruction is a divide operation. It is less popular than its counterpart DIV. The different between the two is that IDIV is for signed numbers wheareas DIV is for unsigned numbers. I guess the “i” in IDIV means Integer, thus implying a signed integer. Sometimes I still wonder why they didn’t name it SDIV, which is much readable and self explantory. But the name is not the real issue here. However, I would like to say that there is a difference between signed and unsigned division. Otherwise they wouldn’t have been two different instructions in the first place, right? :) What a smart ass… The reason it is necessary to have them both is because signed division is behaving differently than unsigned division. Looking at a finite string of bits (i.e, unsigned char) which has a value of -2 and trying to unsigned divide that by -1, will result in 0, since if we take a look at the numbers as unsigneds – 0xfe and 0xff. And naively asking how many times 0xff is contained inside 0xfe, will result in 0. Now that’s a shame because we would like to treat the division as signed. For that, the algorithm is a bit more complex. I am really not a Math guy. So I don’t wanna get into dirty details of how the signed division works. I will leave that algorithm for the BasicOps column of posts… Anyway, I can just say that if you have an unsigned division you can use it to do a signed division of the same operands size.

Some processors only have signed division instructions. So for doing an unsigned division, one might convert the operands to the next bigger size and then do the signed division. Which means the high half of the operand is zero, which makes the division work as expected.

With x86, luckily we don’t have to do some nifty tricks, we have them straight away, DIV and IDIV, for our use. Unlike multiplication, when there is an overflow in division, a division overflow will be raised, wheareas in multiplication only the CF and OF flags will be set. If we like it or not this is the situation. Therefore it’s necessary to convert the numbers before doing the operation. Sign extension or zero extension (depending on the signedness of operands) and only then do the division operation.

What I really wanted to talk about is the way the overflow is detected by the processor. I am interested in that behavior since I write a simple x86 simulator as part of the diStorm3 project. So truly, my code is the “processor” or should I say the virtual machine…Anyhow, the Intel documentation for the IDIV instruction shows some psuedo algorithm:

temp  = AX / src; // Signed division by 8bit
if (temp > 0x7F) or (temp < 0x80)
// If a positive result is greater than 7FH or a negative result is less than 80H
then #DE; // Divide error

src is a register/immediate or a memory indirection, which results in a 8bits value that will be signed extended to 16bits and only then will be signed divided by AX. So far so good, nothing special.

Then comes some stupid looking if statement. Which on the first look says, that if temp is 0x7f or 0x80 then bam, raise the exception. So you ask yourself how these special values have anything to do with overflowing.

Reading on the next comment makes things clearer, since for 8bits input, the division is done on 16bits, and the result is stored inside 8bits that are signed values, the result can vary from -128 to 127. Thus, if the result is positive, and the value is above 127, there is an overflow, because then the value will be treated as a negative number, which is a no no. And same for negative results: if the result is negative and the value is below 128 there is an overflow. Since the negative number cannot be represented in 8bits and as a signed number.

It is vital to understand that overflow means that a resulting value cannot be stored inside its destination because it’s too low or too big to be represented in that container. Don’t confuse it with carry [flag].

So how do we know if the result is positive or negative? If we take a look at temp as a byte sized, we can’t really know. But that’s why we got temp as 16bits. That extra half of temp (high byte) is really the hint for the sign of the whole value. If the high byte is 0xff, we know the result is negative, otherwise the result is positive. Well I’m not 100% accurate it, but let’s keep things simple for matter of conversation. Anyway, it is enough to examine the most significant bit of temp to know its sign. So let’s take a look at the if statement again now that we have more knowledge about the case.

if temp[15] == 0 and temp > 127 : raise overflow

Suddenly it makes sense, huh? Because we assure the number is positive (doesn’t have the sign bit set) and the result is yet higher than 127, and thus cannot be represented as a sign value in a 8bits container.

Now, let’s examine its counterpart guard for negative numbers:

if temp[15] == 1 and temp < 128: raise overflow

Ok, I tried to fool here. We have a problem. Remember that temp is 16bits long? It means that if, for example, the result of temp after the division is -1 (0xffff), our condition is still true and will raise an overflow exception, where the result is really valid (0xff represents -1 in 8bits as well). The problem origin is in the signed comparison. By now, you should understood that the first if statement for a positive number uses an unsigned comparison as well, although temp is a signed value.

We are left with one option since we are forced to use unsigned comparisons, (my virtual processor supports only unsigned comparisons), then we have to convert the signed 128 value into a 16bits unsigned value, which is 0xff80. As easy as that, just signed extend it…

So taking that value and putting it in its place we get the following if statement:

if temp[15] == 1 and temp < 0xff80: raise exception

We know by now that temp is being compared to as an unsigned number. Therefore, if the result was a negative number (must be above 0x8000) and yet it was below 0xff80, then we cannot represent that value in a 8bits signed container, and we have to raise the division error exception.

Eventually we want to merge both if statements to be one, sparing some basic boolean algebra, we end up with:

if (temp > 0x7f) && ((temp < 0x8000) || (temp > 0xff80)):

    then raise exception…

X86 Assemblyyy

Monday, October 8th, 2007

Complex instructions are really useful, especially if you try to optimize the size of your code. Of course, modern processors nowadays are becoming RISC’ish more and more. But as for X86 its backward compatibility makes those instruction to stay there (forever?) ready for you to use. The funny thing is that in the modern X86 processors the RISC instructions are probably faster, so compiler don’t generate code with the CISC instructions. Thing is, that when you size-optimizing your code, or writing a shell code, you don’t care much about speed at all. So why not take advantage of those instructions?

The most popular X86 CISC instruction is LOOP. It’s a simple one as well, decrements the genereal purpose register CX(/ECX/RCX) by one and jumps to some address if it’s not zero. So you have something like 3 sub-instructions in one. Or call it micro-opcodes. Such as: a decrement, an if statement (cmp) and a branch.

 So speaking of LOOP, there are also LOOPZ and LOOPNZ, those instruction in addition to branching upon rCX not being zero, will also branch if the Zero flag is set or not. Which means that you “earn” another condition testing for free. For instance, if you wanted to do some test on each entry in an array and then continue to next entry only if the previous was successful and there are still cells to scan, those instruction might be helpful.

 I have never seen anyone uses those instructions, even not in code crunching. I think it’s because most people just don’t read the specs, and even so, they don’t know how to use those instructions. Not that they are hard to use, but maybe a bit confusing or not popular.

I found somewhat a useless combination of the repeat prefix with the LODS instruction. A REP LODSB, means: read into AL the byte at address DS:rSI and advance rSI (by examining DF…). So you end up with some code that gets into AL the last byte of the buffer that rSI was pointing to…(Of course it depends on the initial value of rCX). I think that in the 8086 this repeat and lods combination was prohibited. So while I was working on diStorm, I made it so if a LODS instruction is prefixed with a REP, that REP prefix is being ignored. Then I got some angry email that today it’s not the case and this combo is supported… I even checked the current specs and it seems that that guy was right. So honestly, I’m not sure it’s useful for anything… but it’s cool to note it.

Another instruction I wanted to talk about is SCAS. I guess you know this instruction in the strlen implementation as follows:

sub ecx, ecx
sub al, al
not ecx
cld
repne scasb
not ecx
dec ecx

Now, I’m not sure whether this is the fastest way to implement an strlen, some compilers use this implementation and other have find-a-zero-byte-inside-a-dword trick. Though maybe I should talk about those tricks in another post someday…

Anyway, back to SCASB, so now that we saw how strlen is implemented, we know that with the REPNE prefix, which means continue as long as rCX is not zero and as long as the Zero flag is zero as well; we test for two conditions in one instruction. In the code above the REPNE prefix tests ZF, but the truth is that the SCAS instruction updates all other flags. So think of the SCAS instruction as a compare instruction between the Accumulator register (AL/AX/EAX/RAX) and the source memory…For example you can do SCAS and then JS (jump on sign)…

There are many other forsaken instructions, that are not fully used, so next time when you fire your assembler, take a look at the specs again, maybe you will find something better. Well, if you have more ideas of the like, you are welcome to send a comment.

Code Anaylsis #1 – Oh, Please Return

Sunday, September 30th, 2007

The first thing you do when you write a high level disassembler (in contrast with diStorm which is a flat stream disassembler) is starting to scan from the entry point of the binary. Assuming you have one, for instance in .COM files it will be the very first byte in the binary file. For MZ or PE it’s a bit complicated story, yet easily achievable.

So what is this scanning really?

  Well, as I have never taken a look in any high level disassembler’s source code, my answer is from scratch only. The way I did it, was to disassemble the entry point, following control flow (branches, such as: jmp/jcc/loop, etc) and recursively add the new functions’ addresses (upon encountering the call instruction) to some list. This list will be processed until it is exhausted. So there’s some algo that will firstly insert the entry point to that function and then it will pop the first address and start analyzing it. Everytime it stumbles a new function address, it will add it to that same list. And once it’s finished analyzing the current function (for example:) by hitting the ret instruction; it will halt the inner loop, pop the next address off the list (if one exists) and continue again. The disassembled instruction/info will be stored into your favorite collection, in my case, it was a dictionary (address->instruction’s info), which later you can walk easily and print it or do anything you wish with it.

The thing is, that some functions (generated by compilers for the sake of conversation) are not always terminated by the RET instruction. It might be an IRET, and then immediately you know it’s an ISR. But that’s a simple case. Some functions end with INT3. Even that is ok. When do things get uglier? When they end with a CALL to ExitProcess (for the Win32 minded), so then your stupid scanning algo can’t determine when the function ends, because now it also has to ‘know’ the IAT and determine whether the function was ExitProcess, ExitThread or whatever Exit API there exists. So before you even made your first move with analyzing a binary code, you have to make it smarter. And that’s a bammer. My goal was to try and decide where a function start (usually easy) and where a function ends. Parsing the PE and getting the IAT is no biggy, but now it means that if you wanted to write a generic x86 disassembler, you’re screwed. So you will have to write plugins or addon (whatever, you name it) to extend the disassembler capabilities for different systems…

But even that’s ok, because the job is the same, although the project is now much bigger. And again, it all depends how accurate you wish to be. In my opinion, I try to be 99% accurate. With heuristics you cannot ask for 100%? Right? :P

So tell me, you smart aleck compiler-engineers out there, why the heck you write the generated function code in a way that it NEVER ends?

You all know the noreturn keyword or compiler extension, which states that the function doesn’t return. Yes, that’s good for functions that the (invisible) system takes control from that point, like ExitProcess, etc. I really never unerstood the reason that a programmer would like to state such a behaviour for a function. So what? Now your generated code will be optimized? To omit the RET instruction? Wow, you rock! NOT.

To be honest, talking about ExitProcess is not the case, and to be more accurate I was talking about the Linux code:

00000da6 (03) 8b40 14                  MOV EAX, [EAX+0x14] 
00000da9 (05) a3 ec7347c0              MOV [0xc04773ec], EAX 
00000dae (01) c3                       RET 
00000daf (05) 68 dcb834c0              PUSH 0xc034b8dc 
00000db4 (05) e8 8b09d1ff              CALL 0xffffffffffd11744 
00000db9 (05) 68 c5a034c0              PUSH 0xc034a0c5 
00000dbe (05) e8 8109d1ff              CALL 0xffffffffffd11744 
00000dc3 (03) 0068 00                  ADD [EAX+0x0], CH 
00000dc6 (05) 0d 39c06888              OR EAX, 0x8868c039

This is some disassembled code that I got from a good friend, Saul Tamari, while he was researching some stuff in the Linux kernel. He noticed that panic() function never returns, but this time, for real. So the problem now is that while flatly disassembling the stream you got, you go out of synchronization and start to disassemble real code in the wrong offset. You can see in the above snippet the second call, which a zero byte follows. That single byte is the end-of-function marker. How nice huh? The next instruction PUSH (68 00 …) is now out of synch, and actually is considered as a new different function.

So now tell me, how should you find this kind of noreturn function when you want to solve this puzzle only in static analysis?? It is defiantly not an easy question. We (Saul and I) had some ideas, but nothing 100% reliable. Performance was an issue also which made things harder. And there are some crazy ideas, which I will cover next time.

Meanwhile, if you got any ideas, you’re more than welcome to write them here.

TinyPE Made The World a Safer Place, did it?

Saturday, August 25th, 2007

It’s pretty cool to see after a long while since I’ve started that project that many AV’s now find the concept of Tiny PE as a virus or a risky application. On the other hand, it’s not a virus, so why do you alert about it? But most people think of the Tiny PE project, specifically what I started – was to download a file from the Internet and execute it. So it came out that the PE header was really fragile and yet it worked for Windows. So most AV’s and disassemblers didn’t even manage to parse it. That was only a side effect, later on, it was used with WebDAV to download the file directly by the Windows Loader using the name of a DLL as a URL(!), a real ownage.

So now I see that the link to the file my proof of concept code downloads is “censored” by some AV’s. My code is really inocent, will open a mere message box. But I guess you can imagine where it can end. Here’s the output of some AV:

http://ragestorm.net/tiny/_SANITIZED_    # void
Where the original file URL is: http://ragestorm.net/tiny/tiny3.exe

So it seems like it really made the world, or to be accurate the Internet, a safer place…although it wasn’t my real intention, because it was all started as a small bet with a friend and now see where it ended. Respect.

PS: to be really accurate when I say AV I mean malware scanning systems.

AMD SSE4a

Tuesday, August 21st, 2007

In the latest version of the programmer’s manual (for AMD64 architecture) of July 2007, AMD released a new instruction set – SSE4a. In the beginning we (YASM maililng list) weren’t sure whether this set is part of Intel’s SSE4.1 or SSE4.2 until a simple check of the documentation for CPUID shed some light and showed that there is a different bit for indicating the instruction set is SSE4a rather than SSE4.1. So now we got a new instruction set. What’s so special about it? Well nothing in particular. It only has a few instruction: extrq, insertq, movntsd, movntss.

The ugly thing about these instructions is insertq which gets fouroperands. Two XMM operands and two byte-sized immediates. We have seen many instructions with 3 operands, so it’s nothing new. Although most of them are in the SSE sets, we got a few in the basic/integer set such as SHLD/SHRD/IMUL… But four operands? And two of them are immediates? Hmm for example the ENTER instruction gets two immediates of differente size, that’s the only one I can come up with quickly, maybe a quick test with disOps can yield more, but it doesn’t really matter. Just trying to show this irregularity. So in diStorm what I did was to add a fourth operand for my extended instruction-information structure (the structure which holds the data that describes a full instruction). Wondering where are we heading to with all those new SSE sets, and weird instructions. It gets harder to parse them every time.

I mean – come on, even in the internal engine of the AMD processor’s pipeline, the engineers hated to add support for a fourth operand, or it was rather a quick hack? who knows…But I am sure they have a generic engine and it’s not an “execution” module of circuity per instruction.

NTVDM #1

Sunday, July 15th, 2007

DOS is dead; and that’s a fact. But NTVDM is still a cool and handy useful tool. I guess that most of us are not satisfied with the way it works…usually the sound doesn’t work, which is a good enough cause to try the great opened source projects which simulate DOS. Anyways, a few years a go, a friend of mine wrote some piece of code which writes to 0xb800, remember this one? That’s the text mode buffer starting address. Anyways, I was wondering how come you write to this address and something appears on the screen (of course, there is the attribute and the character), mind you it’s NTVDM we are talking about. But this wasn’t the interesting part – Why sometimes your writes to this buffer works and sometimes simply not. I decided to learn the situation and see why it happens.

So here’s what I did:

mov ax, 0xb800
mov ds, ax
mov ax, 0x0741
mov [bx], ax
ret

Which prints a grey ‘a’ in the top left corner, yeppi. Now if you open cmd and run the .com file of this binary you won’t see anything at all. Which is unexpected because you write to the ‘screen’, after all. Now, my friend only knew that whenever he runs ‘debug’ before his program, which I just showed the important part above, then the letter ‘a’ will be displayed. So I gave it a long thought…. …. After that I tried the following addition to the above code (I put it before the original code):

mov ax, 3
int 0x10

This will only set the current video mode to text mode 80x25x16… And then voila, the writing worked as expected. Then I suspected that the VM monitors for int 0x10 and function #0, set mode. But it had seemed that every function will enable the writes…And I later confirmed that it is true.

So now that I knew how to trigger the magic, I simply searched for ‘cd 10’ (that’s int 0x10) in ‘debug’ and found a few occurrences, which proved my friend’s experience – that after running ‘debug’, writing to 0xb800 would work. Of course, if you ran other programs which used int 0x10, you’re good to go as well.

But that was only one thing of the mystery, I wanted to also understand how the writes really happens. Whether the VM monitors all instructions and checks the final effective address to see if it’s in the buffer range, or maybe the memory is specially mapped with Win32 API. Because after all, the NTVDM screen is a normal console window (not speaking of graphical modes now). Surprisingly, I found out that the solution was even simpler, a timer was issued every short interval, which called among other things to a function that copies the 0xb800 buffer to the real console screen, using some console APIs… And yes, your simulated writes really go to the virtual memory of the same address in the real NTVDM.exe process. Maybe it has a filter or something I assume, but I didn’t look for it, so I really don’t know.

Hot Patching (/Detouring)

Thursday, July 12th, 2007

Hot Patching is a nice feature which lets you apply a patch in-memory to affect the required code immediately. This is good as long as you can’t restart your system to do the on-disk patching. Since there are times that you can’t allow to restart your computer, probably only in servers…

Well speaking technically about Hot Patching, if you happen to see how code is generated in MS files, for instance, you can always see the 5 CC’s in a row before every function and then the function will begin with the infamous MOV EDI, EDI.

It looks something like this:

0005951e (01) 90                      NOP
0005951f (01) 90                       NOP
00059520 (01) 90                      NOP
00059521 (01) 90                      NOP
00059522 (01) 90                      NOP
00059523 (02) 8bff                   MOV EDI, EDI
00059525 (01) 55                      PUSH EBP
00059526 (02) 8bec                  MOV EBP, ESP

This is a real example, but this time it uses NOP’s instead of INT3’s… It doesn’t really matter, that piece of padding code isn’t really executed.
First things first – So why the MOV EDI, EDI is really executed?
So before I answer directly to this question, I will just say that when you want to patch the function, you will make a detour. So instead of patching a few bytes here and there, you will probably load a new whole copy of the patched and fixed function to a new region in the memory. This will be easier than specific spots patching… And then you will want this new code to run instead of the old one. Now you have two options to patch all callers to this function, which is a crazy thing to do. Or the more popular way- the trick comes in, the MOV EDI, EDI is used as a pseudo NOP, and it is executed on purpose every time the function runs. So when time comes and you apply the patch you can simply override this instruction with a short JMP instruction which takes 2 bytes as well. The jump instruction will jump 5 bytes backward to the beginning of the padding precisely before the patched function. So why 5 bytes of padding and not less or more? This is an easy one, in 5 bytes you can jump anywhere in the address space of 32 bits. Thus, no matter where your new patched function lies in memory you can jump to it. So the 5 bytes will be patched to contain a long JMP instruction. The offset of the long JMP will be calculated once as a relative offset.

Well, actually I didn’t really answer the first question yet. But now that you got a better understanding of this mechanism I really can. The thing is, that in old times the perfect patchers had to disassemble the beginning of the patched function in order to see where it can replace a few instructions to put the 5 bytes long JMP. So it transfers control to you in the beginning of the original function and when you are done, you run the overriden instruction, but as whole instructions(!) and then continue executing that same function from the place you finished overriding it.

Here’s some example, the first instruction for the sake of conversation takes 3 bytes and then the second instruction takes 3 bytes too. Now if you put the long JMP instruction at the first byte of the function and then you want to continue execution after you got control at offset 5, you will be out of synchronization and run incorrect code, because you are supposed to continue execution from offset 6… Eventually it will crash, probably for a access-violation exception.

So now instead of having all this headache, you know that you can safely change the first 2 bytes, to a short JMP and it will always work no matter what.

Another crazy reason for this new way is because say the patched function can run in a few threads at the same time. Now think that you patched the first 5 bytes, and then a different thread start running at offset 3 (because it already ran the first instruction, it just continue normally, but with changed code), then bam… you broke the instruction…

 The reason for using the specific MOV instruction is understood, since it’s a pseudo NOP, it doesn’t really affect (although it is not a real NOP) the CPU context but the program counter. And EDI, was chosen to my guess, because it makes the second byte of the instruction as 0xFF when both operands are EDI, like in this case. And yet there is no specific reason that I can come up with.

You can see that in two memcpy’s for the matter, you can detour a function successfuly without any potential problems. Piece of cake. The problem is that not all files support this feature yet, thus sometimes you still have to stick to the old methods and find a generic solution, like I did in ZERT’s patches…but that’s another story.