Code Anaylsis #1 – Oh, Please Return

September 30th, 2007

The first thing you do when you write a high level disassembler (in contrast with diStorm which is a flat stream disassembler) is starting to scan from the entry point of the binary. Assuming you have one, for instance in .COM files it will be the very first byte in the binary file. For MZ or PE it’s a bit complicated story, yet easily achievable.

So what is this scanning really?

  Well, as I have never taken a look in any high level disassembler’s source code, my answer is from scratch only. The way I did it, was to disassemble the entry point, following control flow (branches, such as: jmp/jcc/loop, etc) and recursively add the new functions’ addresses (upon encountering the call instruction) to some list. This list will be processed until it is exhausted. So there’s some algo that will firstly insert the entry point to that function and then it will pop the first address and start analyzing it. Everytime it stumbles a new function address, it will add it to that same list. And once it’s finished analyzing the current function (for example:) by hitting the ret instruction; it will halt the inner loop, pop the next address off the list (if one exists) and continue again. The disassembled instruction/info will be stored into your favorite collection, in my case, it was a dictionary (address->instruction’s info), which later you can walk easily and print it or do anything you wish with it.

The thing is, that some functions (generated by compilers for the sake of conversation) are not always terminated by the RET instruction. It might be an IRET, and then immediately you know it’s an ISR. But that’s a simple case. Some functions end with INT3. Even that is ok. When do things get uglier? When they end with a CALL to ExitProcess (for the Win32 minded), so then your stupid scanning algo can’t determine when the function ends, because now it also has to ‘know’ the IAT and determine whether the function was ExitProcess, ExitThread or whatever Exit API there exists. So before you even made your first move with analyzing a binary code, you have to make it smarter. And that’s a bammer. My goal was to try and decide where a function start (usually easy) and where a function ends. Parsing the PE and getting the IAT is no biggy, but now it means that if you wanted to write a generic x86 disassembler, you’re screwed. So you will have to write plugins or addon (whatever, you name it) to extend the disassembler capabilities for different systems…

But even that’s ok, because the job is the same, although the project is now much bigger. And again, it all depends how accurate you wish to be. In my opinion, I try to be 99% accurate. With heuristics you cannot ask for 100%? Right? :P

So tell me, you smart aleck compiler-engineers out there, why the heck you write the generated function code in a way that it NEVER ends?

You all know the noreturn keyword or compiler extension, which states that the function doesn’t return. Yes, that’s good for functions that the (invisible) system takes control from that point, like ExitProcess, etc. I really never unerstood the reason that a programmer would like to state such a behaviour for a function. So what? Now your generated code will be optimized? To omit the RET instruction? Wow, you rock! NOT.

To be honest, talking about ExitProcess is not the case, and to be more accurate I was talking about the Linux code:

00000da6 (03) 8b40 14                  MOV EAX, [EAX+0x14] 
00000da9 (05) a3 ec7347c0              MOV [0xc04773ec], EAX 
00000dae (01) c3                       RET 
00000daf (05) 68 dcb834c0              PUSH 0xc034b8dc 
00000db4 (05) e8 8b09d1ff              CALL 0xffffffffffd11744 
00000db9 (05) 68 c5a034c0              PUSH 0xc034a0c5 
00000dbe (05) e8 8109d1ff              CALL 0xffffffffffd11744 
00000dc3 (03) 0068 00                  ADD [EAX+0x0], CH 
00000dc6 (05) 0d 39c06888              OR EAX, 0x8868c039

This is some disassembled code that I got from a good friend, Saul Tamari, while he was researching some stuff in the Linux kernel. He noticed that panic() function never returns, but this time, for real. So the problem now is that while flatly disassembling the stream you got, you go out of synchronization and start to disassemble real code in the wrong offset. You can see in the above snippet the second call, which a zero byte follows. That single byte is the end-of-function marker. How nice huh? The next instruction PUSH (68 00 …) is now out of synch, and actually is considered as a new different function.

So now tell me, how should you find this kind of noreturn function when you want to solve this puzzle only in static analysis?? It is defiantly not an easy question. We (Saul and I) had some ideas, but nothing 100% reliable. Performance was an issue also which made things harder. And there are some crazy ideas, which I will cover next time.

Meanwhile, if you got any ideas, you’re more than welcome to write them here.

Some Rants

September 23rd, 2007

Aaarg, the most annoying thing is to visit your own site and to see that it is down. And then you check on register.com what’s wrong and you see that the domain has expired… Afterwards, you check your inbox for the invoices of the payment to the hosting company and you see that you were charged last month for the domain registration renewal. Next thing, you find yourself writing an email from some anonymous address (because mine at ragestorm hadn’t work at the time) and trying to sound polite and remove the swears after you write them on the second pass of the email before sending it.

Damn, there are some things that piss me off.  Like the fact that I really want to write in this blog more frequently. There are many excuses for this. Eventually I suck, everything else I will say can’t make it any better. So I really should try better. And yet, sometimes I have weird ideas to write about, and I’m not sure that my audience follows my posts, so why write them then? They are too low level, technical, or some might say boring. But hey, it IS the insanely low level blog, no? So I made up my mind that I will write just about anything (computer related) that I have in my mind. So you should expect some weird stuff. Usually, I’m inspired with ideas from my daily work at my company, and sometimes from the stuff I do in my free time. Thing is, that, at work – please don’t laugh at me – I do SQL and .Net stuff.. Ok, you can really laugh now, I deserve it. ;) but that’s only temporary (“that’s what they all say” haha)

So SQL or not, believe me, you can make your hands dirty with some of the stuff there, like transactions where you suddenly realize that there might be dead locks because of your queries. And you start thinking on your code as if it were some multi threaded application that you implement its synchronization on your own. It really gave me the impression that most people who write SQL belong to one of three types: 1. They don’t know shit and therefore their queries are subject to not work well or not efficient. 2. They do know something (I wouldn’t call it programming language with all due respect to the L in SQL) and manage to get their stuff to work. 3. People who really know the internals and algorithms of SQL and understand how things tend to work together and write something good.

What I’m saying is that even in SQL, there might be some decent ‘coders’. But how many – I can’t tell. A few prolly. With all the classification I just did, you still gain experience only by sitting down and trying on your own. But that’s true regarding everything, I guess.

And about .Net, it’s really awesome. I like the way that everything is already ready for use, you don’t have to waste time writing your own collections/containers algorithms (like in C for example) for instance. That you can speak Sockets and COM in the same easyness. That the security is part of the system. Now I really wanna start a flame about Java. But noooo. I just think C# is much better and more permissive, they took all Java’s advantages, fixed the broken stuff and created a whole new better language. Genericness? Well, less, and yet it is VMed…so screw it.

Happy new year & Hatima Tova

Say No to Redundant Checks

August 29th, 2007

I had a long argument with a friend about some check that I don’t do in diStorm. He only said that apparently I don’t do that check (if statement to test some value.) But if it was not important enough he wouldn’t have said that in the first place, right? :) And I told him that he is right – I don’t, because it is absolutely not necessary. Soon I will let you judge yourself. Eventually we got nothing with that argument, probably both sides are mad (at least I was pissed off) and the code left the same.

Now let me show you an example where I will ask you if you do an (extra?) test of the result or not. Since I’m a Win32 guy, here we go:

char buf[256];
int length = GetWindowText(HWND, buf, 256);
if (length == 0) return ; // No string for some reason, though I don’t care why.

memcpy(DST, buf, length+1); // Include null termination char.
Well, that’s a useless code, but still good enough for the sake of conversation. So I get the caption of some window and copy it into another destination buffer. Did you notice that I did not check the length before copying to DST?! (And yes, DST is the same size as buf). Oh no!! Am I doomed now? Of course not, it is guaranteed that the length will be less than 256 for sure. Because otherwise the API is broken. Now we can start another flaming about who’s fault it is, but spare me this, please. So if the API is broken, it will be probably a bug, and will break other applications which use it. But that’s not the point. Say the API is fine, why don’t I check the length before I memcpy? Am I doing anything wrong? I will say it shortly: NO. My code is just fine (as long as DST size is at least buf’s size, of course). So his accusations were that I should be a code-monkey-freak and consistent. And then he said something like “And what if in the future the API changes?”, Well, bammer, all other API’s users are now broken as well. If it was for that question, I would have to pad all my code with extra unnecessary IF’s. No thanks. Besides if I won’t trust my API’s what should I trust? They are the basis for every appliaction.

Now in diStorm the case is the almost the same, there is some internal function that you call with lots of parameters, one of them is an output param that receives the size which was written to some buffer you supply as well as a maximum size of that buffer. So I examine the return code and see whether I can use the data in the buffer or not, and then I know that the size of written entries in my buffer cannot exceed the maximum size I supplied. So why the heck should I check that writtenEntriesCount < MaximumEntries? Tell me, please!

The only rational I can see is to make an assertion (and only in debug mode) that everything is correct. But that’s not a must, and not doing that won’t even let you the possibility to say that I am a bad coder. The thing is, that both the internal function and its caller were written by me, and both are internal only (not exported in any way) so as long as one of them (probably the called) guarantees that the written entries don’t exceed the maximum size. Everyone are is happy. If it wasn’t the case, I should have had a crash (buffer overflow, access violation), just something nasty, that will hint of the buggy function. Now we can get into the philosophical issue of whether the callee or caller should make that test, or both. But for me, as I see it, it’s better be put inside the called function since it must work well for the interface it supplies. And I always think ahead and ask myself “…and what if someone will call that function and forgets(/doesn’t know -) to add the extra check?”

Don’t get me wrong assertions are really useful devices, I really suggest you use them. It just happened that diStorm doesn’t have them. Maybe I should add them. And yet, it’s not a must.

TinyPE Made The World a Safer Place, did it?

August 25th, 2007

It’s pretty cool to see after a long while since I’ve started that project that many AV’s now find the concept of Tiny PE as a virus or a risky application. On the other hand, it’s not a virus, so why do you alert about it? But most people think of the Tiny PE project, specifically what I started – was to download a file from the Internet and execute it. So it came out that the PE header was really fragile and yet it worked for Windows. So most AV’s and disassemblers didn’t even manage to parse it. That was only a side effect, later on, it was used with WebDAV to download the file directly by the Windows Loader using the name of a DLL as a URL(!), a real ownage.

So now I see that the link to the file my proof of concept code downloads is “censored” by some AV’s. My code is really inocent, will open a mere message box. But I guess you can imagine where it can end. Here’s the output of some AV:

http://ragestorm.net/tiny/_SANITIZED_    # void
Where the original file URL is: http://ragestorm.net/tiny/tiny3.exe

So it seems like it really made the world, or to be accurate the Internet, a safer place…although it wasn’t my real intention, because it was all started as a small bet with a friend and now see where it ended. Respect.

PS: to be really accurate when I say AV I mean malware scanning systems.

Lib hell (or DLL hell)

August 21st, 2007

In the last few months I got many emails regarding diStorm compilation, usually for different compilers and platforms. So now I’ve been working in the last week to make diStorm compileable for all those compilers. Some of them are in DOS (OpenWatcom/DJGPP/DMC), some of them are for Linux (mostly GCC) for its varying distributions and the rest are for Windows (MSVS, Mingw, TCC, etc…).

The problem begins with the one and only function of the diStorm library, called distorm_decode. This function gets a lot of parameters, the first one is the offset of the code as if it was loaded in the memory (AKA virtual address). For example, you would feed it with 0x100 for .COM binary files (that is, DOS). This very parameter’s type is varying according your compiler and environment, it might be a 32bits integer or, prefereably, a 64bits integer (unsigned long long/__uint64). If you have a compiler that can generate 64bits integer code, then you would prefer it to 32 bits code. So when you disassembler 64bits code you will be able to set any 64bits offset you just like. Otherwise, you will be limited to 32bit offsets.

So I got a special config.h file for diStorm, where there you configure the relevant macros so your compiler will be able to compile diStorm. The config.h is eventually there for all portable macros. There you also set if you would like to use the 64bit offsets or not.

Now the distorm_decode function uses that integer type which is dependent on your decision (which will probably be derived from the support for 64bit integers by your compiler). To simplify the matter, the diStorm interface is a function with the following declaration:

#ifdef SUPPORT64
void distorm_decode(unsigned long long offset);
#else
void distorm_decode(unsigned long offset);
#endif

Now, this function is actually exported by the library, so a caller project will be able to use it. And as you can see, the declaration of this function may change. What I did till now was to have this macro (SUPPORT64) defined twice, once for the library itself when it’s compiled and once for the caller project, so when it uses the library, it will know the size of integers it should use. So far, I didn’t get any problems with it. But then since many compilers (say, for DOS) don’t support 64bit integers, it is a pain in the ass to change these two different files. I couldn’t come up with the same header file for the sake of the caller project, since it might get only a compiled version, and then what? Do trial and error until it finds the correct size of the integer? I can’t allow things to be so loosy.

Things could get really nasty, if the library was compiled with 32bit integers, and then the caller project will use it with 64bit integers. Stack is unbalanced, it will probably crash somewhere when trying to use some pointer…soon or later you will notice it. And then you will have to know to comment out the macro on the caller project.

After I got this complaint from one of the diStorm users, I decided to try and put an end to this lib hell. I tried to think of a different ways. It would be the best way if I had a reflection in C, or if I had a scripting language which in there I could determine the size of the integer in runtime and know how to invoke distorm_decode. Unfortunately, this is not the case. Starting to think of COM out there, or SXS and other not really helpful ways to solve DLL hell. I sense it’s the wrong way. Since it is a library after all, and I don’t have the luxory of running code, I mean, it could be really solved with running code and asking the library (or rather the DLL actually) what interface should I use… reminds me of QueryInterface somewhat. But alas, I can’t do that either, since it’s compile time we’re talking about. So all I got is this lousy idea:

The exported function name will be determined according to the size of the integer it was compiled with. It would look like this:

#ifdef SUPPORT64
void distorm_decode64(unsigned long long offset);
#define distorm_decode distorm_decode64
#else … you get the idea

But we are still stuck with the SUPPORT64 macro for the caller project, which I think there is no option and we will have to stick with it to the bitter end…

Then what did we earn from this change? Well pretty much a lot, let’s see:

1) The actual code of existing projects don’t need to be changed at all, the macro substitution will do the work for it automatically.

2)The application now won’t get to crash, since if you got the wrong integer size, means you won’t find the exported symbol.

3)Now it becomes a linker time error rather than runtime crash/error. So before the caller project will get to fully be linked, you will know to change the SUPPORT64 macro (whether to add it or drop it).

4)The code as it was before could lead you to a crash, since you could manually enable/disable the macro.

Well, I still feel it’s not the best idea out there, but hey, I really don’t want to write some code in the caller project that will determine the interface in runtime and only then know which function to use…it’s simply ugly and not right, because you can do it earlier and cleaner.

I would like to hear how would you do it? I am still not committing my code… so hurry up :)

AMD SSE4a

August 21st, 2007

In the latest version of the programmer’s manual (for AMD64 architecture) of July 2007, AMD released a new instruction set – SSE4a. In the beginning we (YASM maililng list) weren’t sure whether this set is part of Intel’s SSE4.1 or SSE4.2 until a simple check of the documentation for CPUID shed some light and showed that there is a different bit for indicating the instruction set is SSE4a rather than SSE4.1. So now we got a new instruction set. What’s so special about it? Well nothing in particular. It only has a few instruction: extrq, insertq, movntsd, movntss.

The ugly thing about these instructions is insertq which gets fouroperands. Two XMM operands and two byte-sized immediates. We have seen many instructions with 3 operands, so it’s nothing new. Although most of them are in the SSE sets, we got a few in the basic/integer set such as SHLD/SHRD/IMUL… But four operands? And two of them are immediates? Hmm for example the ENTER instruction gets two immediates of differente size, that’s the only one I can come up with quickly, maybe a quick test with disOps can yield more, but it doesn’t really matter. Just trying to show this irregularity. So in diStorm what I did was to add a fourth operand for my extended instruction-information structure (the structure which holds the data that describes a full instruction). Wondering where are we heading to with all those new SSE sets, and weird instructions. It gets harder to parse them every time.

I mean – come on, even in the internal engine of the AMD processor’s pipeline, the engineers hated to add support for a fourth operand, or it was rather a quick hack? who knows…But I am sure they have a generic engine and it’s not an “execution” module of circuity per instruction.

Basic Ops #2 – Sub

August 21st, 2007

As promised, but in a long delay, the algorithm for Subtraction. Just notice how similar it is to the Addtraction algorithm. Also, try to think why we use the bitwise not operator to make it a borrow rather than carry.

unsigned int sub(unsigned int a, unsigned int b)
{
   a ^= b;
   b &= ~(a ^ b);
    while (b) {
       b <<= 1;
       a ^= b;
       b &= a;
    }
    return a;
}

Basic Ops #2 – Add/Sub

July 25th, 2007

After we started with the really easy implementation of Multiplication’s algorithm (right here), I will now proceed to add and sub, which are a bit more complicated. But still, nothing like the dreaded Long Division algorithm. :)

Anyways, the intresting part about add and sub is that you cetainly will use the XOR operator. If you think of it, actually XOR can calculate add and sub together, but without fixing the carry or borrow. That you will have to do yourself. But it’s still good enough for one bit operations. For example: 1-1 = 0, and 1^1 = 0. Or another example, 1 + 0=1 –> 1^0=1. So this is all nice, but what happens when you have to borrow or to calculate the carry? So if you operate on 1 bit variables and you do 1+1, you will get 0 and a carry should be propagated to the next bit. Therefore, even if this is the case, we still can use the XOR operator, because 1^1 will yield a zero, which is what we expect. So now comes the big question – how do we calculate the carry (for now I will stick to carry, and later I will talk about borrow), the carry has to propagate from the low significant bit to the highest. In my opinion this is the most fascinating device in this algorithm, because the rest is the same for both add and sub.

So to answer that, let the input be two integers: a and b. Calculating the XOR between them is a good start and can be done as easily as that. But now we would like to fix the carry and apply it to the other bits. So if we AND (bitwise) the original a and b, we will get all parallel bits that match. For example, adding 2+6 will result in:

010 &
110
—-
010

This ANDed result is the mask of the carry we begin with. And I use the word “begin” since the carry has to propogate to the higher bits, and therefore we will need to iterate a specific operation that will fix the carry on the rest of the bits. The reason we AND a and b, is because surely the result’s bits are these which will cause a carry… Once we applied the current carry, we will have to calculate the next carry. As I said earlier, we have to iterate this operation, therefore it means there must be some condition which will stop this iteration, otherwise it won’t make any sense and stuck forever. Do you have any idea what will be this condition? It’s not the number of bits of a (aka sizeof a), nor b (which we presume are the same, of course). We will run as long as there is a carry to apply to the constructed result of the addition.

By now we know almost everything, there’s only one thing left out to be done, and that’s the actual calculation of the next carry. This is the most tricky thing in the whole algorithm. We have the mask of the carry from the a&b operation, every time we apply this mask by XORing it the result, we have to remove (mask off) these bits and continue until the carry-mask is zero. Therefore it is understandable that some bits should “go home”, but which bits do we remove? We have to remove the bits that were already applied as a carry (since carry changes after every bit). The real problem is that we need to calculate the carry for all bits at once, since we can’t and don’t want to calculate the carry for each bit, since it will be less optimized and not a challenge at all. But even then, it’s not possible to apply the carry only once, since it ‘carries’ to the next bit by ‘nature’.

Well, usually I think that when code is shown no words need to be spoken, but in the case of this algorithm, some text must be written. And that was the above…So now that you should know what’s going on abit, the code will help you to fully understand the algorithm. I never claimed to explain this algorithm well, it’s a really hard one, honestly. Sorry.

I will just remind you that XOR is the key to the add and sub operations. So if you take a look at the truth-tables of XOR, you will see that it is exactly what we need, like I tried to explain in the beginning. XOR alone is not enough to calculate the result of the wanted operation, since we have to apply the carry everytime. The carry is applied using, nothing else but, XORing again…you got that right.

unsigned int add(unsigned int a, unsigned int b) 
{ 
  a ^= b; // Do a one bit addition, now we have to calculate carry and apply it if required. 
  b &= a ^ b; // b will now contain the carry-mask, note that they are XORed again to eliminate a temporary value. 
  while (b) { // As long as there is a carry to apply to the result... 
     b <<= 1; // We've got to shift left, so the carry propagates to higher bits, right? 
     a ^= b; // Apply the new carry to the result. 
     b &= ~a; // Get the next carry-mask by removing bits that were already taken into account and fixed. 
  } 
  return a; 
}

If you really want to master this algorithm, start applying it on small number with 3 bits. You will see that the next carry is wisely calculated when you will begin with a carry that have a few bits set, for example: 7+5…

I will leave the implementation of the ‘sub’ to next week, so you can try to come up with it on your own. I can only say that it is really similar to the addition’s algorithm. Think of how you would change the carry into a borrow, and how it actually affects the mask.

NTVDM #1

July 15th, 2007

DOS is dead; and that’s a fact. But NTVDM is still a cool and handy useful tool. I guess that most of us are not satisfied with the way it works…usually the sound doesn’t work, which is a good enough cause to try the great opened source projects which simulate DOS. Anyways, a few years a go, a friend of mine wrote some piece of code which writes to 0xb800, remember this one? That’s the text mode buffer starting address. Anyways, I was wondering how come you write to this address and something appears on the screen (of course, there is the attribute and the character), mind you it’s NTVDM we are talking about. But this wasn’t the interesting part – Why sometimes your writes to this buffer works and sometimes simply not. I decided to learn the situation and see why it happens.

So here’s what I did:

mov ax, 0xb800
mov ds, ax
mov ax, 0x0741
mov [bx], ax
ret

Which prints a grey ‘a’ in the top left corner, yeppi. Now if you open cmd and run the .com file of this binary you won’t see anything at all. Which is unexpected because you write to the ‘screen’, after all. Now, my friend only knew that whenever he runs ‘debug’ before his program, which I just showed the important part above, then the letter ‘a’ will be displayed. So I gave it a long thought…. …. After that I tried the following addition to the above code (I put it before the original code):

mov ax, 3
int 0x10

This will only set the current video mode to text mode 80x25x16… And then voila, the writing worked as expected. Then I suspected that the VM monitors for int 0x10 and function #0, set mode. But it had seemed that every function will enable the writes…And I later confirmed that it is true.

So now that I knew how to trigger the magic, I simply searched for ‘cd 10’ (that’s int 0x10) in ‘debug’ and found a few occurrences, which proved my friend’s experience – that after running ‘debug’, writing to 0xb800 would work. Of course, if you ran other programs which used int 0x10, you’re good to go as well.

But that was only one thing of the mystery, I wanted to also understand how the writes really happens. Whether the VM monitors all instructions and checks the final effective address to see if it’s in the buffer range, or maybe the memory is specially mapped with Win32 API. Because after all, the NTVDM screen is a normal console window (not speaking of graphical modes now). Surprisingly, I found out that the solution was even simpler, a timer was issued every short interval, which called among other things to a function that copies the 0xb800 buffer to the real console screen, using some console APIs… And yes, your simulated writes really go to the virtual memory of the same address in the real NTVDM.exe process. Maybe it has a filter or something I assume, but I didn’t look for it, so I really don’t know.

Hot Patching (/Detouring)

July 12th, 2007

Hot Patching is a nice feature which lets you apply a patch in-memory to affect the required code immediately. This is good as long as you can’t restart your system to do the on-disk patching. Since there are times that you can’t allow to restart your computer, probably only in servers…

Well speaking technically about Hot Patching, if you happen to see how code is generated in MS files, for instance, you can always see the 5 CC’s in a row before every function and then the function will begin with the infamous MOV EDI, EDI.

It looks something like this:

0005951e (01) 90                      NOP
0005951f (01) 90                       NOP
00059520 (01) 90                      NOP
00059521 (01) 90                      NOP
00059522 (01) 90                      NOP
00059523 (02) 8bff                   MOV EDI, EDI
00059525 (01) 55                      PUSH EBP
00059526 (02) 8bec                  MOV EBP, ESP

This is a real example, but this time it uses NOP’s instead of INT3’s… It doesn’t really matter, that piece of padding code isn’t really executed.
First things first – So why the MOV EDI, EDI is really executed?
So before I answer directly to this question, I will just say that when you want to patch the function, you will make a detour. So instead of patching a few bytes here and there, you will probably load a new whole copy of the patched and fixed function to a new region in the memory. This will be easier than specific spots patching… And then you will want this new code to run instead of the old one. Now you have two options to patch all callers to this function, which is a crazy thing to do. Or the more popular way- the trick comes in, the MOV EDI, EDI is used as a pseudo NOP, and it is executed on purpose every time the function runs. So when time comes and you apply the patch you can simply override this instruction with a short JMP instruction which takes 2 bytes as well. The jump instruction will jump 5 bytes backward to the beginning of the padding precisely before the patched function. So why 5 bytes of padding and not less or more? This is an easy one, in 5 bytes you can jump anywhere in the address space of 32 bits. Thus, no matter where your new patched function lies in memory you can jump to it. So the 5 bytes will be patched to contain a long JMP instruction. The offset of the long JMP will be calculated once as a relative offset.

Well, actually I didn’t really answer the first question yet. But now that you got a better understanding of this mechanism I really can. The thing is, that in old times the perfect patchers had to disassemble the beginning of the patched function in order to see where it can replace a few instructions to put the 5 bytes long JMP. So it transfers control to you in the beginning of the original function and when you are done, you run the overriden instruction, but as whole instructions(!) and then continue executing that same function from the place you finished overriding it.

Here’s some example, the first instruction for the sake of conversation takes 3 bytes and then the second instruction takes 3 bytes too. Now if you put the long JMP instruction at the first byte of the function and then you want to continue execution after you got control at offset 5, you will be out of synchronization and run incorrect code, because you are supposed to continue execution from offset 6… Eventually it will crash, probably for a access-violation exception.

So now instead of having all this headache, you know that you can safely change the first 2 bytes, to a short JMP and it will always work no matter what.

Another crazy reason for this new way is because say the patched function can run in a few threads at the same time. Now think that you patched the first 5 bytes, and then a different thread start running at offset 3 (because it already ran the first instruction, it just continue normally, but with changed code), then bam… you broke the instruction…

 The reason for using the specific MOV instruction is understood, since it’s a pseudo NOP, it doesn’t really affect (although it is not a real NOP) the CPU context but the program counter. And EDI, was chosen to my guess, because it makes the second byte of the instruction as 0xFF when both operands are EDI, like in this case. And yet there is no specific reason that I can come up with.

You can see that in two memcpy’s for the matter, you can detour a function successfuly without any potential problems. Piece of cake. The problem is that not all files support this feature yet, thus sometimes you still have to stick to the old methods and find a generic solution, like I did in ZERT’s patches…but that’s another story.