NULL, Vulnerabilities and Fuzzing

December 31st, 2008

I remember seeing Ilja at BH07. We talked about Kernel attacks, aka privilege escalation. He told me, also, back then, that he found some holes that he managed to execute code through. I think the platform of target was Windows, although Ilja is specializing in Unix. Back in ‘05 already he had a talk about Unix Kernel Auditing. Nothing new probably there, at least for the time being. However, the new approach of fuzzing the kernel, the system calls to be accurate, was pretty new. But feel free to correct me if I’m wrong about it. And it seems Ilja managed to find some holes using fuzzing. (BTW, a much more interesting paper from him about Unusual Bugs.)

Personally, I don’t believe in fuzzing. Usually the holes I find - there is no way a fuzzer will find. Although, I do believe that you need to mix tools/knowledge in order to find holes and audit a software in a better way. It is enough that there is a simple validation of some parameter you pass to a specific potential-hole’y function and all your test can be thrown away because of that validation, though, there is still a weakness in that function, you won’t get to it. Then you say “Ah Uh”, and you think that you can refine the randomness of the parameters you pass to that function and hopefully prevail. Well, it might work, it might not. As I said, I’m not a big fan of fuzzing.

Although, it might be cool to have a tool that analyzes the code of a function and builds the parameters in a special way to make a code coverage of 100% on that function, which is not fuzzing anymore and means: you walk all paths of execution and the chances to find a weakness are so much greater. Writing such a tool is crazyness, and yet possible, if you ask me.

Fuzzing or not, there are still weaknesses in Win32k, which supposed to be one of the most “secured”/audited components in the Kernel. Probably because many researches had their work on it as well. And that’s simply sad.

Speaking about Ilja’s fuzzing of kernel and stuff, and thinking we are cool to find weaknesses nowadays, Mark Russinovich wrote NTCrash back in ‘96 for god sake and, it was a Fuzzer(!), but back then nobody called it or knew about fuzzers. And NTCrash as simple as it is, found some weaknesses in kernel system calls of NT4 ;) Respect (though today it won’t even scratch the kernel, so we might think ourselves cool for still finding stuff :) ).

A friend and I are trying to audit another application, and my friend found some NULL dereference which crashes that software. So we fired up Olly and tried to see what’s going on. It seems that some interface is queried and returns a successful code value and at the same time we get NULL for that interface, which means something is really f*cked up there. Thing is, as you probably can imagine for yourself, we want to execute code out of it. But odds seem to be against us at this time, since we can’t control that NULL or anything about it.

I then wanted to see what people have done with NULL before, how to exploit it better. And usually 99% of the applications running out there don’t have page 0 mapped to their address space. But CSRSS and NTVDM for instance, do have it mapped, but who cares now…? It doesn’t help our cause. Besides, you probably can’t control that page 0 and its data anyway. So I encountered that Flash Exploitation. To be honest, I didn’t read all of the white paper about the exploitation, I only looked for how the arbitary data write worked. And it seems that some CALLOC had failed to allocate memory because of an integer overflow weakness and from there you got a NULL pointer to begin with. But Flash didn’t access that pointer immediately - it had some pointer arithmetic added to it. And you guessed it right, you can control some offset before the pointer is really accessed, thus you can write (almost) anywhere you want. Now I really don’t underestimate the exploitation, from the bits I read it is a crazy and very beautiful exploitation. But to say that it is a new technique and a new class of exploitation is one thing that I really don’t agree to. You know what, looking at it in a different light - it was probably not leading to a code execution if that CALLOC not returned NULL, because then you won’t know where you are on the heap and you couldn’t really write to anywhere you knew accurately. And besides, the NULL wasn’t dererferenced directly and an offset was added to it (no matter what the calculation was for the sake of conversation), so therefore I don’t see it so exciting if you ask me (again, not the exploitation but the “new class of exploitation”). Still you should check it out :)

So, as I saw that no one did anything really useful with a real NULL dereference, it seems that the weakness he found is only a DoS, but maybe we can control something there, yet to be researched…

String Initialization is Tricksy

December 29th, 2008

A friend of mine had to hand in an assignment for Computer Science in university. As I understood, it was a relatively easy assignment. And the point is that that friend is very experienced programmer and knows a thing or two about it. Anyway, he had a line in the code which goes like this:
char buf[1024] = “abc”;

You don’t even need to know C in order to understand that line, right? I assume we all agree to that. It simply initializes the buffer with a constant string literal. So his lecturer asked him, what does this line do precisely. And to his surprise his answer was incorrect. The correct answer is that the whole buffer is initialized and then the string constant is copied (this can be done in a few ways, for example copying a buffer with the zeros at the end of it). So today another friend called me on the phone to ask about this thing, why our first friend was wrong about it. Now as a reverser, I suppose I need to know the answer to such a simple matter as well. But, the sad part was that I was wrong as well as the two of them. I fired up the C standard and started to search for the solution. I wanted al iving proof to the matter at hand. Looking here and there it took me around 15 mins to lie my hands on the piece of sentence that settled all that matter down. And I quote:

“If there are fewer initializers in a brace-enclosed list than there are elements or members of an aggregate, or fewer characters in a string literal used to initialize an array of known size than there are elements in the array, the remainder of the aggregate shall be initialized implicitly the same as objects that have static storage duration.”

The underlined text is the answer - If there are less characters than the size of the array to initialize, the remainder has to be initialized as well. There is another clause which explains how the initialization is being done, but for now, let it be ‘zeroing’.

Now the reason I was wrong about it is because I happened to see many (for example):
mov [buf+1], ‘a’
mov [buf+2], ‘b’
mov [buf+3], ‘c’
mov [buf+4], ‘\0′

in lots of functions, and that means the source C code is:

char buf[] = “abc”;

The standard says about this case that the size of the buffer is to be acquired from the size of the literal constant string, don’t forget the null termination character as well. So that’s why I didn’t see the memset coming in to initialize the all buffer. Besides, maybe most of the people code it this way:
char buf[1024];
strcpy(buf, “abc”);

Which doesn’t lead to a memset or other way of initialization of the rest of the array.

Instructions’ Prefixes Hell

December 21st, 2008

Since the first day diStorm was out people didn’t know how to deal with the fact that I drop(ignore) some prefixes. It seems that dropping unused prefixes isn’t such a great feature for many people and it only complicates the scanning of streams. Therefore I am thinking about removing the whole mechanism, or maybe change it in a way that still preserves the same interface but behaves differently.

For the following stream: “67 50″, the result by diStorm will be: “db 0×67″ - “push eax”. The 0×67 prefix supposes to change the address size, which none is used in our case, thus it’s dropped. However, if we look at the hex code of the “push eax” part we will see “67 50″. And this is where most of the people become dumbfounded. Getting twice the same prefix-byte of the stream in two results is in a way confusing. Taking a look at other disassemblers will tell you that diStorm is not the only one to do such games with prefixes. Sometimes I get emails regarding this “impossible” prefix - since it gets to be output twice, which is wrong, right? Well, don’t know, it depends how you choose to decode it. The way I chose to decode prefixes was really advanced, each prefix could have been ignored, unless it has really affected (one of) the operand itself. I had to really keep tracking on each prefix and know whether it affected any operands in the instructions and only then I examined which prefixes I drop or not. This all sounds right in a way. Hey, at least for me.

However, we didn’t even talk about what you will do if you have multiple prefixes of the same family (segment-overide: DS, ES, SS, etc). Now this one is really up to interpretations of the designer. Probably the way I did it in diStorm is wrong, I admit it, that’s why I want to rewrite the whole prefixes thing from the beginning. There are 4 or 5 types of prefixes and according to the specs (Intel/AMD) I quote: “A single instruction should include a maximum of one prefix from each of the five groups.” …. “The result of using multiple prefixes from a single group is unpredictable.”. This pretty much sums all the problems in the world related to prefixes. I guess you can see for yourself from these 2 lines you can actually treat them in many different ways. We know now that it can lead to “unpredictable” results if you have many prefixes - in reality it won’t shut down your CPU, it won’t even throw an exception. So screw it you say, and you’re right. Now let’s see some CPU (16 bits) logic for decoding the prefixes:

while (prefix byte is read) {
 switch (prefix): {
  case seg_cs: use_seg = cs; break;
  case seg_ds: use_seg = ds; break;
  case seg_ss: use_seg = ss; break;
  ….
  ….
 case op_size: op_size = 32; break;
  case op_addr: op_addr = 32; break;
 case rep_z: rep = z; break;
 …
 }
 - skip byte in stream -
}

The processor will use those flags in order to know which prefix was presented or not. The thing about using a loop (in any form) is that now that you have to show text out of some streams with many prefixes, you don’t know whether the processor really uses the first occurrance of the prefix or its last, or maybe both? And maybe Intel and AMD implement it differently?

You know what? Why the heck do I bother so much with some minor end cases that never really happen in real code sections. I ask myself too, maybe I shouldn’t. Although I happened to see for myself some malware code that tries to screw up the disassembler with many extra prefixes, etc.. and I thought diStorm could help malware analyzers as well with advanced prefixes decoding.

Anyways, according to the above logic code I’m supposed to use the last prefix of each type. Given a stream such as: 66 66 67 67 40. I will get:
0: 66 (dropped)
2: 67 (dropped)
1: 66 67 40
Now you can see that the prefixes used are the second and the fourth and that the instruction starts at the second byte on the stream. Now I officially can commit a suicide, even I can’t follow these addresses, it’s hell. So any better solution?

Welcome Back

December 20th, 2008

Hey you guys again, I’m back from South East Asia after 3 months of traveling all around. Was awesome :)

So here’s some potentially cool real story: What happened is that while I was walking with a few friends in Vietnam (Nha Trang to be accurate) on the beach a friend found a pouch with credit cards and driving license, etc. The only thing we knew about that pouch was the owner’s name and that she was Irish. That didn’t really help us to get to her, unforetunately no cellphone number was attached anywhere in the pouch. The next thing we thought was to look her up on FaceBook, but she wasn’t listed (who doesn’t have FB nowadays? :) ). So, we had to give it to the Vietnamese local police station, but probably that poor girl continued traveling and didn’t find it…

 Anyways, I just realized something very nice, suppose you have somebody’s email. Whether someone left a comment with only his email on this blog, or whatever. And you wish to find that email or who he/she is. So usually we fire up google and looking for that email and we can learn much from that. But sometimes we can’t find anything. And besides, even if we do find something, it might not be relevant or enough information about that person. What I realized was that you can search people using their email in FaceBook, and I really managed to find a few people who were anonymous except their emails, which is quite interesting….Finally we got some way to link a person with an email address, think about it.

So that’s it, I’m back for a couple of months, hopefully I will write some interesting posts, need to get ideas, which usually are originated from my work, stay tuned ;)

Software That Uses diStorm

August 24th, 2008

After a few years that diStorm is out, we can already see it used here and there. Although most users are private users rather than commercial, but even commercial applications use diStorm. I guess many people also use it internally in their companies, but without their word I can’t really know about it. Except some friends who tell me so.

It’s pretty cool that you write something useful which people actually use, and to save commercial use. That was my main reason to release diStorm under the permissive BSD license. The problem arises when there is some commercial applications which don’t give credit for your work, it’s really frustrating and I guess one can’t do much about it. There is this Vietnamic BKV Pro anti virus software that claimed to be written by professors and students (or the like), so I didn’t really expect no credit from such people. But this is our world :( I got an email from an advocate about diStorm’s copyright infrigement. It seems they also abuse WinRAR’s license, so I’m not the only one.. To be honest, I prefer they stop using diStorm immediately rather than not giving me my credit. There are other disassembler libraries out there, they could use them as well. On the other hand, I’m happy to know they use diStorm, but I only ask for recognition, nothing else, after all the hard work I put there. I emailed them but to no response. This licenses’ violation from the AV guys seem to make a lot of noise in Vietnam blogs and forums, though I can’t really understand anything, except where they quote diStorm’s license or saying my name. I haven’t yet contacted OSI, and I’m not sure if they can really help, but it’s worth the try.

Anyway, there are good people who does give credit and I decided it’s about time I will show a small list of users. The first one though goes to a good friend I met through diStorm, who reported many bugs and helped in testing the 64bits environment (do not confuse with AMD64) support, Sanjay Patel. He works(/founder) at RotateRight.com which released last month their Zoom product, which is a very smart Profiler, currently only for Linux though. The product is free for 30 days trial version, you should check it out, it seems to be very promising, because I know more guys behind this product, although I haven’t tested it myself. But hey, it uses diStorm :)

More products which use diStorm:

Apple Shark Profiler

SolidShield - server side protector

DFSee - Low Level disk tools

And some open source projects:

Python-ptrace

Crypto Implementations Analysis Toolkit

Well, that’s what I’m aware about at least, I believe there are more though.

Have fun :)

Proxy Functions - The Right Way

August 21st, 2008

As much as I am an Assembly freak, I try to avoid it whenever possible. It’s just something like “pick the right language for your project” and don’t use overqualified stuff. Actually, in the beginning, when I started my patch on the IPhone, I compiled a simple stub for my proxy and then fixed it manually and only then used that code for the patch. Just to be sure about something here - a proxy function is a function that gets called instead of the original function, and then when the control belongs to the proxy function it might call the original function or not.

The way most people do this proxy function technique is using detour patching, which simply means, that we patch the first instruction (or a few, depends on the architecture) and change it to branch into our code. Now mind you that I’m messing with ARM here - iphone… However, the most important difference is that the return address of a function is stored on a register rather than in the stack, which if you’re not used to it - will get you confused easily and experiencing some crashes.

So suppose my target function begins with something like:

SUB SP, SP, #4
STMFD SP!, {R4-R7,LR}
ADD R7, SP, #0xC

This prologue is very equivalent to push ebp; mov ebp, esp thing on x86, plus storing a few registers so we can change their values without harming the caller, of course. And the last thing, we also store LR (link-register), the register which stores the return address of the caller.

Anyhow, in my case, I override (detour) the first instruction to branch into my code, wherever it is. Therefore, in order my proxy function to continue execution on the original function, I have to somehow emulate that overriden instruction and only then continue from the next instruction as if the original patched function wasn’t touched. Although, there are rare times when you cannot override some specific instructions, but then it means you only have to work harder and change the way your detour works (instructions that use the program counter as an operand or branches, etc).

Since the return address of the caller is stored onto a register, we can’t override the first instruction with a branch-link (’call’ equivalent on x86). Because then we would have lost the original caller’s return address. Give it a thought for a second, it’s confusing in the first time, I know. Just an interesting point to note that it so happens that if there’s a function which don’t call internally to other functions, it doesn’t have to store LR on the stack and later pop the PC (program-counter, IP register) off the stack, because nobody touched that register, unless the function needs around 14 registers for optimizations, instead of using local stack variables… This way you can tell which of the functions are leaves on the call graph, although it is not guaranteed.

Once we understand how the ARM architecture works we can move on. However, I have to mention that the 4 first parameters are passed on registers (R0 to R3) and the rest on the stack, so in the proxy we will have to treat the parameters accordingly. The good thing is that this ABI (Application-Binary-Interface) is something known to the compiler (LLVM with GCC front-end in my case), so you don’t have to worry about it, unless you manually write the proxy function yourself.

My proxy function can be written fully in C, although it’s possible to use C++ as well, but then you can’t use all features…

int foo(int a, int b)
{
 if (a == 1000) b /= 2;
}

That’s my sample foo proxy function, which doesn’t do anything useful nor interesting, but usually in proxies, we want to change the arguments, before moving on to the original function.

Once it is compiled, we can rip the code from the object or executable file, doesn’t really matter, and put it inside our patched file, but we are still missing the glue code. The glue code is a sequence of manually crafted instructions that will allow you to use your C code within the rest of the binary file. And to be honest, this is what I really wanted to avoid in first place. Of course, you say, “but you could write it once and then copy paste that glue code and voila”. So in a way you’re right, I can do it. But it’s bothersome and takes too much time, even that simple copy paste. And besides it is enough that you have one or more data objects stored following your function that you have to relocate all the references to them. For instance, you might have a string that you use in the proxy function. Now the way ARM works it is all get compiled as PIC (Position-Independent-Code) for the good and bad of it, probably the good of it, in our case. But then if you want to put your glue code inside the function and before the string itself, you will have to change the offset from the current PC register to the string… Sometimes it’s just easier to see some code:

stmfd sp!, {lr} 
mov r0, #0
add r0, pc, r0
bl _strlen
ldmfd sp! {pc}
db “this function returns my length :)”, 0

 When you read the current PC, you get that current instruction’s address + 8, because of the way the pipeline works in ARM. So that’s why the offset to the string is 0. Trying to put another instruction at the end of the function, for the sake of glue code, you will have to change the offset to 4. This really gets complicated if you have more than one resource to read. Even 32 bits values are stored after the end of the function, rather than in the operand of the instruction itself, as we know it on the x86.

So to complete our proxy code in C, it will have to be:

int foo(int a, int b)
{

 int (*orig_code)(int, int) = (int (*)(int, int))<addr of orig_foo + 4>; 
// +4 = We skip the first instruction which branches into this code!
 if (a == 1000) b /= 2;
// Emulate the real instruction we overrode, so stack is balanced before we continue with original function.
 asm(”sub sp, sp, #4″);
 return orig_foo(a, b);
}

This code looks more complete than before but contains a potential bug, can you spot it? Ok, I will give you a hint, if you were to use this code for x86, it would blow, though for ARM it would work well to some extent.

The bug lies in the number of arguments the original function receives. And since on ARM, only the 5th argument is passed through the stack, our “sub sp, sp, #4″ will make some things go wrong. The stack of the original function should be as if it were running without we touched that function. This means that we want to push the arguments on the stack, ONLY then, do the stack fix by 4, and afterwards branch to the second instruction of the original function. Sounds good, but this is not possible in C. :( cause it means we have to run ‘user-defined’ code between the ‘pushing-arguments’ phase and the ‘calling-function’ phase. Which is actually not possible in any language I’m aware of. Correct me if I’m wrong though. So my next sentence is going to be “except Assembly”. Saved again ;)

Since I don’t want to dirty my hands with editing the binary of my new proxy function after I compile it, we have to fix that problem I just desribed above. This is the way to do it, ladies and gentlemen:

int foo(int a, int b)
{
 if (a == 1000) b /= 2;
 return orig_foo(a, b);
}

void __attribute__((naked)) orig_foo(int a, int b)
{
// Emulate the real instruction we overrode, so stack is balanced before we continue with original function.
 asm(”sub sp, sp, #4\nldr r12, [pc]\n bx r12\n.long <FOO ADDR + 4>”);
}

The code simply fixes the stack, reads the address of the original absolute foo address, again skipping the first instruction, and branches into that code. Though, it won’t change the return address in LR, therefore when the original function is over, it will return straight to the caller of orig_foo, which is our proxy function, that way we can still control the return values, if we wish to do so.

We had to use the naked attribute (__declspec(naked) in VC) so that the compiler won’t put a prologue that will unbalance our stack again. In any way the epilogue wouldn’t get to run…

This technique will work on x86 the same way, though for branching into an absolute address, one should use: push <addr>; ret.

In the bottom line, I don’t mind to pay the price for a few code lines in Assembly, that’s perfectly ok with me. The problem was that I had to edit the binary after compilation in order to fix it so it’s becoming ready to be put in the original binary as a patch. Besides, the Assembly code is a must, if you wish to compile it without further a do, and as long as the first instruction of the function hasn’t changed, your code is good to go.

This code works well and just as I really wanted, so I thought so share it with you guys, for a better “infrastructure” to make proxy function patches.

However, it could have been perfect if the compiler would have stored the functions in the same order you write them in the source code, thus the first instruction of the block would be the first instruction you have to run. Now you might need to add another branch in the beginning of the code so it skips the non-entry code. This is really compiler dependent. GCC seems to be the best in preserving the functions’ order. VC and LLVM are more problematic when optimizations are enabled. I believe I will cover this topic in the future.

One last thing, if you use -O3, or functions inline, the orig_foo naked function gets to be part of the foo function, and then the way we assume the original function returns to our foo proxy function, won’t happen. So just be sure to peek at the code so everything is fine ;)

x86 Instruction Set Wars

August 14th, 2008

It all back in the 90’s where Intel came up with the dashing MMX technology. AMD wasn’t so late to respond with its new 3DNow! instruction set. That if you ask me, was much less popular and less used. But the good thing about 3DNow! is that it handled floating points rather than integers. And then came SSE instruction set, and the world got better and nowadays compilers even use it up to SSE2 while knowing it will be there when the code runs for 99% today. However, they still make a SSE test to be sure they can use it, I believe this check will always stay in code for assurance anyway, can’t harm, you know.

Now almost 10 years later, we see another split in technologies, Intel came up with SSE4, I already talked about it in an earlier post, which contained really valuable instructions. And then, guess what? AMD added several instruction on top of Intel’s.

In August 07 AMD announced a new set: SSE5. Aren’t we sick off SSE anymore??? Anyway, in April this year, Intel announced in response a new AVX instruction set (Advanced Vector Extensions).

This doesn’t go anywhere. Every company in its turn announces a new instruction set. The first company doesn’t support the other’s and vice versa. This is just going wrong and it’s our nightmare. Basically, most developers should not care anyway, they don’t use these sets, and they (partly) are not out officially (means you can’t use it as for now). Therefore, the game matters for compiler developers and those who write the tools that mess with machine code, etc. Probably many codecs and crypto algos will use it too directly if it exists on the processor they get to run on…

The thing is, I decided to take a side, and to stick only to AVX, and therefore I won’t support SSE5 in diStorm. If anyone from the community is gonna help with that, I will gladly accept it, don’t get me wrong. Since I don’t have much time for it anyway, and I hate this mess up, I will stick to Intel. I don’t know if it’s a good or bad news for diStorm, but as the only owner and the way I see it, I gotta do something against this lack of standard thing. Everything has standards today, why can’t the damned x86 have one as well?

I’m not the only one who speaks about this, Agner Fog has also talked about it here. The difference is that I can make my own small change.

What do you think should happen with this issue??? Who’s right, Intel or AMD? Should developers code twice now for each instruction set?

arrrrgh so many questions

Import Symbols by Ordinals in Link-Time (using VS)

August 7th, 2008

I bet some of you have always wanted to use a system dll file which has a few exported functions, though by ordinal numbers only. Yesterday it happenned to me. I had a dll which exported many a function by ordinal. For instance, you can take a look yourself at ws2_32.dll to learn quickly using dependency walker or another tool the exports of this file. Sometimes you have a 3rd party dll file which you lack both the header file (.h) and a .lib file. And let’s say that you know which functions (their ordinals) you want to use and their prototypes (which is a must). There are several approaches to use the dll in that case.

1) Use LoadLibrary and GetProcAddress programmatically to control the load and pointers retrieval for the functions you’re interested in. This is like the number one guaranteed way to work. I don’t like this idea, because sometimes you are bound to use the library as it is already linked to the file at link-time. Another disadvantage I find important is that you have to change your code in order to support that 3rd party dll functinoality. And for me this is out of question. The code should stay as it is nevertheless the way I use outsider dlls. And besides, why to bother add extra code for imported functions when the linker should do it for you?

2) Use delay-load dll’s, but you will still have to use some helper functions in order to import a function by its ordinal. If you ask me, this is just lame, and probably you don’t need a lazy load anyway.

3) Patching your PE file manually to use ordinals instead of function names. But there’s a catch here, how can you compile your file, if you can’t resolve the symbols? Well, that depends on the support your 3rd party dll has. For example, you might have a .lib file for one version of the 3rd party dll and then everything is splendid. But with a newer version of that dll, all symbols are exported by ordinals. So you can still compile your code using the older .lib, but then you will have to patch the PE. Well, this is the way I did it the first time yesterday, only to witness that everything works well. But even if you change your code slightly and recompile it you will have to fix the PE again and again. This is out of question. And besides you don’t solve a solution with atomic bomb when you can use a small missle, even if you have the skills to do that, something it’s possibly wrong and you should look for a different way, I believe this is true for most things in life anyway.

4) Using IMPORTS statement in your .def file. Which there you tell the linker how to resolve the symbols and everything. Well, this sounds like the ultimate solution, but unforetunately, you will get a linker warning: “warning LNK4017: IMPORTS statement not supported for the target platform; ignored”. Now you ask yourself what target platform are you on? Of course, x86, although I bet the feature is just not supported, so don’t go that way. The thing is that even Watcom C supports this statement, which is like the best thing ever. It seems that Visual C has supported it for sometime, but no more. And you know what’s worse? Delphi: procedure ImportByOrdinal; external MyLib index Ordinal; When I saw that line I freaked out that VS doesn’t have support for such a thing as well, so cruel you say. ;)

5) And finally the real way to do it. Eventually I couldn’t manage to avoid the compiler tools of VS anymore and I decided to give them a try. Which I gotta say was a pleasant move, because the tools are all easy to use and in 10 mins I finished hacking the best solution out there, apparently…

It all comes down to the way you export and import functions. We all know how to export functions, we can use both dllspec(dllexport) or use a .def file which within we state which symbols to export, as simple as that. Of course, you could have used dllspec(dllimport) but that is only when you got a symbolic name for your import.

So what you have to do is this, write a new .def file which will export (yes, export) all the functions you want to import! but this time it’s a bit more tricky since you gotta handle specially the symbolic names of the imports, I will get to that soon. An example of a .def file that will import from ‘ws2_32.dll’ is this:

LIBRARY ws2_32.dll

EXPORTS

    MyAnonymousWSFunction @ 555

That’s it. We have a function name to use in our code to access some anonymous function which is exported by the ordinal 555. Well but that ain’t enough, you still need to do one more thing with this newly created .def file and that is to create a .lib file out of it using a tool named  ‘lib’: lib /DEF:my.def /OUT:my.lib

Now you got a new .lib file which describes the imports you want to use in your original PE file. You go back to that code and add this new ‘my.lib’ file in the linker options and you’re done.

Just as a tip, you can use ‘dumpbin /exports ida.lib’ in order to verify your required symbols.

We can now understand that in order to import symbols, VS requires a .lib file (and in order to export you need a .def file). The nice thing about it is that we can create one on our own. However, as much as we wish to use a .def file to import symbols too, it’s impossible.

One or two more things you should know. The linker might shout at you that a symbol __imp__blabla is not found. That does NOT mean you have to define your symbol with a prefix of __imp__. Let’s stick to demangled names, since I haven’t tried it on C++ names and it’s easier sticking to a simple case.

For cdecl calling convention you add a prefix of ‘_’ to your symbol name. For stdcall calling convention you don’t add any prefix nor suffix. But again, for stdcall you will have to instruct the linker how many bytes on stack the function requires (according to the prototype), so it’s as simple as the number of params multiplied by 4. Thus if you got a prototype of ‘void bla(int a, int* b);’ You will define your symbol as ‘bla@8 @ 555′. Note that the first number after the @ is the number of bytes, and the second one is the ordinal number, which is the most important thing around here.

And one last thing you should know, if you want to import data (no cdecl nor stdcall that is) you have to end the symbol’s line with ‘DATA’. Just like this: ‘MyFunc @ 666 DATA’.

I am positive this one gonna save some of you guys, good luck.

Context of Execution (Meet Pyda)

August 5th, 2008

While I was writing an interactive console Python plugin for IDA pro I stumbled in a very buggy problem. The way a plugin works in IDA is that you supply 3 callbacks, init, run, term. You can guess which is called when. I just might put in that there are a few types of plugins and each type can be loaded in different times. I used, let’s say, a UI plugin which gets to run when you hit a hotkey. At that moment my run() function executes in the context of IDA’s thread. And only then I’m allowed to use all other data/code from IDA’s SDK. This is all good so far, and as it should be from such a plugin interface. So what’s wrong? The key idea behind my plugin is that it is an interactive Python console. Means that you have a console window which there you enter commands which eventually will be executed by the Python’s interpreter. Indirectly, this means that I create a thread with a tight loop that waits for user’s input (it doesn’t really matter how it’s being done at the moment). Once the user supplies some input and presses the enter key I get the buffer and run it. Now, it should be clear that the code which Python runs is in that same thread of the console, which I created. Can you see the problem? Ok, maybe not yet.

Some of the commands you can run are wrappers for IDA commands, just like the somewhat embedded IDC scripting in IDA, you have all those functions, but this time in Python. Suppose that you try to access an instruction to get its mnemonic from Python, but this time you do it from a different thread while the unspoken contract with IDA plugin is that whatever you do, you do it in run() function time, and unforetunately this is not the case, and cannot be the case ever, since it is an interactive console that waits for your input and should be ready anytime. If you just try to run SDK commands from the other (console) thread then you experience mysterious crashes and unexplained weird bugs, which is a big no no, and cannot be forgiveable for a product like this, of course.

Now that we know the problem we need to hack a solution. I could have done a very nasty hack, that each time the user runs something in Python, I simulate the hotkey for the plugin and then my plugin’s run() will be called from IDA’s context and there I will look at some global info which will instruct me what to run, and later on I need to marshal back the result to the Python’s (console) thread somehow. This is all possible, but com’on this is fugly.

I read the SDK upside down to find some way to get another callback, which I can somehow trigger whenever I want, in order to run from IDA’s context, but hardly I found something good and at the end nothing worked properly and it was too shakey as well.

Just to note that if you call a function from the created thread, most of the times it works, but when you have some algorithm running heavily, you get crashes after a short while. Even something simple like reading some info of an instruction in a busy loop and at the same time scrolling the view might cause problems.

Then I decided it’s time to do it correctly without any hacks at all. So what I did was reading in MSDN and looking for ways to run my own code at another thread’s context. Alternatively, maybe I could block IDA’s thread while running the functions from my own thread, but it doesn’t sound too clever, and there’s no really easy way to block a thread. And mind you that I wanted a simple and clean solution as much as possible, without crafting some Assembly code, etc. ;)

So one of the things that jumped into my mind was APC, using the function QueueUserAPC, which gets among other params an HTHREAD, which is exactly what I needed. However, the APC will get to run on the target thread only when it is in an alertable state. Which is really bad, because most UI threads are never in that state. So off I went to find another idea. Though this time I felt I got closer to the solution I wanted.

Later on a friend tipped me to check window messages. Apparently if you send a message to another window, which was created by the target thread (if the thread is a UI one and has windows, of course we assume that in IDA’s case and it’s totally ok), the target thread will dispatch that window’s window-procedure in its own context. But this is nothing new for me or you here, I guess. The thing was that I could create a window as a child of IDA’s main window and supply my own window-procedure to that window. And whenever a message is received at this window, its window-procedure gets to run in IDA’s thread’s context! (and rather not the one which created it). BAM mission accomplished. [Oh and by the way, timers (SetTimer) work the same way, since they are implemented as messages after all.]

After implementing this mechanism for getting to run in original’s IDA’s context, just like the contract wants, everything runs perfectly. To be honest, it could easily have not worked because it’s really dependent on the way IDA works and its design, but oh well. :)

About this tool, I am going to release it (as a source code) in the upcoming month. It is called Pyda, as a cool abbreviation for Python and IDA. More information will be available when it is out.

My Turn on the IPhone

July 25th, 2008

I tried adding some feature to the IPhone, and therefore I decided that I will do it my way, an in-memory patch, rather than on-disk patch. The reason I went for memory patching is simple, version 2 of IPSW contains code signing. Why should I smash my head against the wall trying to remove that code signing checks where I can easily do anything I want in runtime? Although, I read somewhere that you can sys-call something and disable the checks, they also said it makes the system a bit shaky… and later on I learnt there is a way to sign your own code, or existing patched files using ldid.

The thing is, I started my coding on beta 3 of version 2, and there everything worked well and wasn’t long time before my code work as expected. Then I updated my IPhone to the final second version and first thing I tried was my patch to know it doesn’t work. Now since there are no debuggers for the IPhone are really problematic, and except GDB, the other two I heard of are not free (DataRescue’s and DebuggerX).

Ok, to be honest, debuggers won’t even work because you can’t attach them to some of the tasks, this is since there is a new feature in OSX called ‘PT_DENY_ATTACH’, which simply means, no debugger can attach to this task, and if there is something currently attached, then detach it… This feature is implemented in the ptrace function, which lacks other important features, like reading and writing fro and to memory or getting registers’ values, etc. Of course, there are other ways to bypass those problems, and if you look well, you can find some resources about it.

Anyway, back to the story, I had to spot in my (long long) code what was wrong. After some short time I found that vm_protect failed. Another frustrating thing was to spot that failure in my code, because I didn’t have any way to print to debug (ala DebugView) or printf, or anything. I was kinda doomed, I had to crash the task in order to know that my code reached some point, and each time move the crash further a long the code, this is really lame, I know. Maybe if I had more knowledge with Linux/BSD/OSX I could track it down quicker. Hey, if you still got an idea how to do it next time, please drop a line, heh?

So once I knew the failing code, I tried to fix it, but actually I didn’t know what was wrong. The vm_protect returned something like ‘protection error’, and hell this doesn’t say much. I got really crazy at some point, that I used that same call on my own code block, and that failed too. I didn’t know what to do, I kept playing with that shit for hours :( and nothing came up to my mind. Then I left it and went to sleep with a bad mood (I hate when it happens, usually I keep on trying until I make it, but it was 6am in the morning already…) So later the next day, I decided I will read more in the MAN, and nothing special there, it only shows the parameters, the return value, bla bla, etc. By that time, I was sure that Apple touched something in the code related to vm_protect somehow, that was my hunch. The idea to RE the kernel and this function to see what was changed from beta 3 to the final version crossed my mind, not once. But I knew that I was missing something simple and I should not go that far, after all it’s a usermode API.

As stuck as I was, I googled for as much as possible information on this vm_protect and other OSX code snippets. Eventually I hit something interesting that used vm_protect and used another flag that I didn’t know that exists in OSX. The flag name is ’vm_prot_copy’. This might finish the story for you if you know it. Otherwise, it means that when you try to write to a page, it will make a copy of that same page particulary for the requesting task and then let you write to it. This is used in many operating systems, when some file (code usually) is being loaded from disk, and the OS wants to optimize memory, it maps the same physical page of that code/data to all tasks which loaded that file. Then if you just want to write to that page, you are forbidden, of course. Here comes the COW (copy on write ) to save us.

The annoying thing is that since the documentation sucks I didn’t find this vm_prot_copy anywhere. I even took a look in the header files, where the ‘vm_prot_execute’ for instance, was defined, and didn’t see this extra flag. Only after I knew the solution I came back to that file again and I found this flag to be declared almost in the end of the file, LOL. The cool thing, which came too late, that Cydia had some notes regard ‘how to port applications from earlier versions to final version’ and they wrote something about NX and protections, though they didn’t say anything directly about this COW thingy…

It was kinda a surprise to see that I had to specify such a low level flag. As I come from the Windows world mostly, there you don’t have to specify such a thing when you try to change some page’s protection and write to it. Therefore I didn’t expect it to be the case in other OS’s.

Just wanted to share this frustrating story and the experience of how fun (or not) it is to code for the IPhone.