Import Symbols by Ordinals in Link-Time (using VS)

August 7th, 2008

I bet some of you have always wanted to use a system dll file which has a few exported functions, though by ordinal numbers only. Yesterday it happenned to me. I had a dll which exported many a function by ordinal. For instance, you can take a look yourself at ws2_32.dll to learn quickly using dependency walker or another tool the exports of this file. Sometimes you have a 3rd party dll file which you lack both the header file (.h) and a .lib file. And let’s say that you know which functions (their ordinals) you want to use and their prototypes (which is a must). There are several approaches to use the dll in that case.

1) Use LoadLibrary and GetProcAddress programmatically to control the load and pointers retrieval for the functions you’re interested in. This is like the number one guaranteed way to work. I don’t like this idea, because sometimes you are bound to use the library as it is already linked to the file at link-time. Another disadvantage I find important is that you have to change your code in order to support that 3rd party dll functinoality. And for me this is out of question. The code should stay as it is nevertheless the way I use outsider dlls. And besides, why to bother add extra code for imported functions when the linker should do it for you?

2) Use delay-load dll’s, but you will still have to use some helper functions in order to import a function by its ordinal. If you ask me, this is just lame, and probably you don’t need a lazy load anyway.

3) Patching your PE file manually to use ordinals instead of function names. But there’s a catch here, how can you compile your file, if you can’t resolve the symbols? Well, that depends on the support your 3rd party dll has. For example, you might have a .lib file for one version of the 3rd party dll and then everything is splendid. But with a newer version of that dll, all symbols are exported by ordinals. So you can still compile your code using the older .lib, but then you will have to patch the PE. Well, this is the way I did it the first time yesterday, only to witness that everything works well. But even if you change your code slightly and recompile it you will have to fix the PE again and again. This is out of question. And besides you don’t solve a solution with atomic bomb when you can use a small missle, even if you have the skills to do that, something it’s possibly wrong and you should look for a different way, I believe this is true for most things in life anyway.

4) Using IMPORTS statement in your .def file. Which there you tell the linker how to resolve the symbols and everything. Well, this sounds like the ultimate solution, but unforetunately, you will get a linker warning: “warning LNK4017: IMPORTS statement not supported for the target platform; ignored”. Now you ask yourself what target platform are you on? Of course, x86, although I bet the feature is just not supported, so don’t go that way. The thing is that even Watcom C supports this statement, which is like the best thing ever. It seems that Visual C has supported it for sometime, but no more. And you know what’s worse? Delphi: procedure ImportByOrdinal; external MyLib index Ordinal; When I saw that line I freaked out that VS doesn’t have support for such a thing as well, so cruel you say. ;)

5) And finally the real way to do it. Eventually I couldn’t manage to avoid the compiler tools of VS anymore and I decided to give them a try. Which I gotta say was a pleasant move, because the tools are all easy to use and in 10 mins I finished hacking the best solution out there, apparently…

It all comes down to the way you export and import functions. We all know how to export functions, we can use both dllspec(dllexport) or use a .def file which within we state which symbols to export, as simple as that. Of course, you could have used dllspec(dllimport) but that is only when you got a symbolic name for your import.

So what you have to do is this, write a new .def file which will export (yes, export) all the functions you want to import! but this time it’s a bit more tricky since you gotta handle specially the symbolic names of the imports, I will get to that soon. An example of a .def file that will import from ‘ws2_32.dll’ is this:

LIBRARY ws2_32.dll

EXPORTS

    MyAnonymousWSFunction @ 555

That’s it. We have a function name to use in our code to access some anonymous function which is exported by the ordinal 555. Well but that ain’t enough, you still need to do one more thing with this newly created .def file and that is to create a .lib file out of it using a tool named  ‘lib’: lib /DEF:my.def /OUT:my.lib

Now you got a new .lib file which describes the imports you want to use in your original PE file. You go back to that code and add this new ‘my.lib’ file in the linker options and you’re done.

Just as a tip, you can use ‘dumpbin /exports ida.lib’ in order to verify your required symbols.

We can now understand that in order to import symbols, VS requires a .lib file (and in order to export you need a .def file). The nice thing about it is that we can create one on our own. However, as much as we wish to use a .def file to import symbols too, it’s impossible.

One or two more things you should know. The linker might shout at you that a symbol __imp__blabla is not found. That does NOT mean you have to define your symbol with a prefix of __imp__. Let’s stick to demangled names, since I haven’t tried it on C++ names and it’s easier sticking to a simple case.

For cdecl calling convention you add a prefix of ‘_’ to your symbol name. For stdcall calling convention you don’t add any prefix nor suffix. But again, for stdcall you will have to instruct the linker how many bytes on stack the function requires (according to the prototype), so it’s as simple as the number of params multiplied by 4. Thus if you got a prototype of ‘void bla(int a, int* b);’ You will define your symbol as ‘bla@8 @ 555′. Note that the first number after the @ is the number of bytes, and the second one is the ordinal number, which is the most important thing around here.

And one last thing you should know, if you want to import data (no cdecl nor stdcall that is) you have to end the symbol’s line with ‘DATA’. Just like this: ‘MyFunc @ 666 DATA’.

I am positive this one gonna save some of you guys, good luck.

Context of Execution (Meet Pyda)

August 5th, 2008

While I was writing an interactive console Python plugin for IDA pro I stumbled in a very buggy problem. The way a plugin works in IDA is that you supply 3 callbacks, init, run, term. You can guess which is called when. I just might put in that there are a few types of plugins and each type can be loaded in different times. I used, let’s say, a UI plugin which gets to run when you hit a hotkey. At that moment my run() function executes in the context of IDA’s thread. And only then I’m allowed to use all other data/code from IDA’s SDK. This is all good so far, and as it should be from such a plugin interface. So what’s wrong? The key idea behind my plugin is that it is an interactive Python console. Means that you have a console window which there you enter commands which eventually will be executed by the Python’s interpreter. Indirectly, this means that I create a thread with a tight loop that waits for user’s input (it doesn’t really matter how it’s being done at the moment). Once the user supplies some input and presses the enter key I get the buffer and run it. Now, it should be clear that the code which Python runs is in that same thread of the console, which I created. Can you see the problem? Ok, maybe not yet.

Some of the commands you can run are wrappers for IDA commands, just like the somewhat embedded IDC scripting in IDA, you have all those functions, but this time in Python. Suppose that you try to access an instruction to get its mnemonic from Python, but this time you do it from a different thread while the unspoken contract with IDA plugin is that whatever you do, you do it in run() function time, and unforetunately this is not the case, and cannot be the case ever, since it is an interactive console that waits for your input and should be ready anytime. If you just try to run SDK commands from the other (console) thread then you experience mysterious crashes and unexplained weird bugs, which is a big no no, and cannot be forgiveable for a product like this, of course.

Now that we know the problem we need to hack a solution. I could have done a very nasty hack, that each time the user runs something in Python, I simulate the hotkey for the plugin and then my plugin’s run() will be called from IDA’s context and there I will look at some global info which will instruct me what to run, and later on I need to marshal back the result to the Python’s (console) thread somehow. This is all possible, but com’on this is fugly.

I read the SDK upside down to find some way to get another callback, which I can somehow trigger whenever I want, in order to run from IDA’s context, but hardly I found something good and at the end nothing worked properly and it was too shakey as well.

Just to note that if you call a function from the created thread, most of the times it works, but when you have some algorithm running heavily, you get crashes after a short while. Even something simple like reading some info of an instruction in a busy loop and at the same time scrolling the view might cause problems.

Then I decided it’s time to do it correctly without any hacks at all. So what I did was reading in MSDN and looking for ways to run my own code at another thread’s context. Alternatively, maybe I could block IDA’s thread while running the functions from my own thread, but it doesn’t sound too clever, and there’s no really easy way to block a thread. And mind you that I wanted a simple and clean solution as much as possible, without crafting some Assembly code, etc. ;)

So one of the things that jumped into my mind was APC, using the function QueueUserAPC, which gets among other params an HTHREAD, which is exactly what I needed. However, the APC will get to run on the target thread only when it is in an alertable state. Which is really bad, because most UI threads are never in that state. So off I went to find another idea. Though this time I felt I got closer to the solution I wanted.

Later on a friend tipped me to check window messages. Apparently if you send a message to another window, which was created by the target thread (if the thread is a UI one and has windows, of course we assume that in IDA’s case and it’s totally ok), the target thread will dispatch that window’s window-procedure in its own context. But this is nothing new for me or you here, I guess. The thing was that I could create a window as a child of IDA’s main window and supply my own window-procedure to that window. And whenever a message is received at this window, its window-procedure gets to run in IDA’s thread’s context! (and rather not the one which created it). BAM mission accomplished. [Oh and by the way, timers (SetTimer) work the same way, since they are implemented as messages after all.]

After implementing this mechanism for getting to run in original’s IDA’s context, just like the contract wants, everything runs perfectly. To be honest, it could easily have not worked because it’s really dependent on the way IDA works and its design, but oh well. :)

About this tool, I am going to release it (as a source code) in the upcoming month. It is called Pyda, as a cool abbreviation for Python and IDA. More information will be available when it is out.

My Turn on the IPhone

July 25th, 2008

I tried adding some feature to the IPhone, and therefore I decided that I will do it my way, an in-memory patch, rather than on-disk patch. The reason I went for memory patching is simple, version 2 of IPSW contains code signing. Why should I smash my head against the wall trying to remove that code signing checks where I can easily do anything I want in runtime? Although, I read somewhere that you can sys-call something and disable the checks, they also said it makes the system a bit shaky… and later on I learnt there is a way to sign your own code, or existing patched files using ldid.

The thing is, I started my coding on beta 3 of version 2, and there everything worked well and wasn’t long time before my code work as expected. Then I updated my IPhone to the final second version and first thing I tried was my patch to know it doesn’t work. Now since there are no debuggers for the IPhone are really problematic, and except GDB, the other two I heard of are not free (DataRescue’s and DebuggerX).

Ok, to be honest, debuggers won’t even work because you can’t attach them to some of the tasks, this is since there is a new feature in OSX called ‘PT_DENY_ATTACH’, which simply means, no debugger can attach to this task, and if there is something currently attached, then detach it… This feature is implemented in the ptrace function, which lacks other important features, like reading and writing fro and to memory or getting registers’ values, etc. Of course, there are other ways to bypass those problems, and if you look well, you can find some resources about it.

Anyway, back to the story, I had to spot in my (long long) code what was wrong. After some short time I found that vm_protect failed. Another frustrating thing was to spot that failure in my code, because I didn’t have any way to print to debug (ala DebugView) or printf, or anything. I was kinda doomed, I had to crash the task in order to know that my code reached some point, and each time move the crash further a long the code, this is really lame, I know. Maybe if I had more knowledge with Linux/BSD/OSX I could track it down quicker. Hey, if you still got an idea how to do it next time, please drop a line, heh?

So once I knew the failing code, I tried to fix it, but actually I didn’t know what was wrong. The vm_protect returned something like ‘protection error’, and hell this doesn’t say much. I got really crazy at some point, that I used that same call on my own code block, and that failed too. I didn’t know what to do, I kept playing with that shit for hours :( and nothing came up to my mind. Then I left it and went to sleep with a bad mood (I hate when it happens, usually I keep on trying until I make it, but it was 6am in the morning already…) So later the next day, I decided I will read more in the MAN, and nothing special there, it only shows the parameters, the return value, bla bla, etc. By that time, I was sure that Apple touched something in the code related to vm_protect somehow, that was my hunch. The idea to RE the kernel and this function to see what was changed from beta 3 to the final version crossed my mind, not once. But I knew that I was missing something simple and I should not go that far, after all it’s a usermode API.

As stuck as I was, I googled for as much as possible information on this vm_protect and other OSX code snippets. Eventually I hit something interesting that used vm_protect and used another flag that I didn’t know that exists in OSX. The flag name is ‘vm_prot_copy’. This might finish the story for you if you know it. Otherwise, it means that when you try to write to a page, it will make a copy of that same page particulary for the requesting task and then let you write to it. This is used in many operating systems, when some file (code usually) is being loaded from disk, and the OS wants to optimize memory, it maps the same physical page of that code/data to all tasks which loaded that file. Then if you just want to write to that page, you are forbidden, of course. Here comes the COW (copy on write ) to save us.

The annoying thing is that since the documentation sucks I didn’t find this vm_prot_copy anywhere. I even took a look in the header files, where the ‘vm_prot_execute’ for instance, was defined, and didn’t see this extra flag. Only after I knew the solution I came back to that file again and I found this flag to be declared almost in the end of the file, LOL. The cool thing, which came too late, that Cydia had some notes regard ‘how to port applications from earlier versions to final version’ and they wrote something about NX and protections, though they didn’t say anything directly about this COW thingy…

It was kinda a surprise to see that I had to specify such a low level flag. As I come from the Windows world mostly, there you don’t have to specify such a thing when you try to change some page’s protection and write to it. Therefore I didn’t expect it to be the case in other OS’s.

Just wanted to share this frustrating story and the experience of how fun (or not) it is to code for the IPhone.

Anti-Unpacker Tricks

July 18th, 2008

Peter Ferrie, a former employee of Symantec, who now works for MS wrote a paper about Anti Unpacker tricks. I was really fascinated reading that paper. There were so many examples in there for tricks that still work nowadays. Some I knew already some were new to me, he covers so many tricks. The useful thing is that every trick has a thorough description and a code snippet (mostly Assembly). So now it becomes one of the most valueable papers in the subject and you should really read it to get up to date. The paper can be found here.

One idea I that I really like from the paper, is something that Peter himself found, that you can use ReadFile (or WriteProcessMemory) to override a memory block so no software breakpoints will be raised when you execute it. But on a second thought, why a simple memcpy won’t do the same trick?

If you guys remember the Tiny PE challenge I posted 2 years ago in Securiteam.com, then Peter was the only one who kicked my ass with a version of 232 byts,  where I came with 274 bytes. But no worries, after a long while I came back with a version of 213(!) bytes (over here) and used some new tricks. Today I still wait for Peter’s last word…

Have fun

It’s JS Again

May 14th, 2008

More things I hate about JS. Why you give a shit about this? Well, actually you don’t, but maybe together we can find better ways to solve stuff.

So we all know that there are no associate dictionaries in JS, and it’s really a hack of the Object ‘class’. I dare to use class here, bah. Anyway, say you are passed an object as a parameter and you want to know if it’s empty before you scan it. And say the only way is the most straight-forward one:

function f(x) {
var isEmpty = true;
for (var i in x) { isEmpty = false; break; }
return isEmpty;
}

You really have to iterate the items in order to find out if the dictionary is empty or not. Things like x == {}, didn’t work, but was worth trying. And you cannot access anything like children,nodes,child or whatever to see how to iterate the keys on your own.

If you know any shorter and correct way to do it, I would really like to hear it.

Now there’s thing ugliness with the values you put in the dictionary for example:

f({bla:0}) will call f with a dictionary that contains a key “bla” with a value of 0. But what if you add a line preceding that call with:
var bla = “something”;
f({bla:0})

Well, the people who really know JS well, or had fallen into this pit before will know that the dictionary will look the same as buffer. JS doesn’t care if you put any kind of quotes, if at all, surrouding the name of the key. Now if you want to pass a dictionary as a parameter inline, you must declare the whole dictionary before the call and pass it as a parameter.

var tmp = {}’
var bla = “something”;
tmp[bla] = 0;
f(tmp);

Another thing I really didn’t like about JS is that you can start a regexp out of the blue. /bla/.exec … Now stop and think about this. This is not PERL, which regexps are really part of the language. This is a ugly way to create a regexp, and to think that you get an object from that thing and you can execute it.

Now I see this thing often: var myRE = new RegExp(/bla/);

Which is a bit better, but then why you need the slashes to denote a regexp? You went that far for free. Sucker ;) But yes, it makes the code more readable, I agree to that.

Oh why, another so lovely thing happened to me today when I was using some SOAP library written in JS to send a request to my server, back at work. There was some function which tried to serialize the parameters you pass to it automagically without knowing the types into SOAP. Of course, as JS is a scripting language we can know the types of the parameters passed to us easily, right? That’s what I thought, until I saw that Safari doesn’t declare a constructor for its Array’s as some people expect it to (or as some other browsers do). The code to get the type of a parameter:

(/function\s+(\w*)\s*\(/ig).exec(o[p].constructor.toString());

Again, my favorite regexp out of the blues. Leave that aside. See the way it gets the constructor (yes objects apparently have those) and tries to get its string. Well, beat me why Safari returns an Object here where all others return Array (in my specific case). But kill me why this fugly hack and not an elegant safe:

instanceof (o[p]). toString();

Ok, I lied, this doesn’t really work, and I’ve wished it would. Unforetunately instanceof can be used only as a boolean operator kind of stuff. Therefoer,

if o[p] instanceof Array
if o[p] instanceof Object
and etc, date, string, whatever.

So maybe, there lies the answer it’s a piece of a few lines rather than one. But if you ask me, I would prefer latter.
One more catch, if you test instance of Object first, all types will return true to that one :)
Another point is that ‘new Array’  and ‘[]’ are of the same type…Strong types, nay.
I forgot to mention that typeof return ‘object’ almost for everything.

Overall, I really don’t understand how web-apps work. There are so many pits to fall into. It’s really amazing how the world work with Standards Suggestions! Now don’t get me started on CSS.

JavaScript Sucks

April 29th, 2008

I really know many languages pretty well, but this language is really ugly or stupid or what not. So many features are only “hacks”, browsers do whatever they want with the code differently from each other and there’s chaos about JS everywhere you go.

For example, what we call a ‘dictionary’, which is an associate array is a big hack in the language. It is practically an object which you can set properties, and then iterate over them. There’s no formal way to remove a key from the dictionary, like you would expect in a scripting language; by doing myDict.remove(“key”). You will have to do delete myDict.key. Not mentioning how to know if you have any keys in the dictionary, because who said you have the length property? Well, if you think you have it, then you’re wrong, that’s because you used an array as a dictionary instead of creating an object using { }.

Another thing I encountered was that if you have a dictionary with the last defined element ends with a comma, then the browser (IE) will shout at you while other browsers eat it well. It reminds me the macro’s in C/C++ that you don’t know where’s the originating code which caused the problem, since it gets compiled after it’s substituted… So {a:1,} will kick.

Another ugly thing is this fake OOP, now who are you kidding? Adding a special use for the “this” keyword, but otherwise everything else is just nested function, err sorry, methods. This is another ugly hack, and some people even use inheritance. Do me a favor. The errorprone “class” that you declare will probably have memory leaks, because the methods were really defined as nested rather than using something like MyClassName.prototype.myMethodName, which will certainly work better and not get allocated per instance. Did you say private member? Oh yeah, right. That’s what you think and this time you’re right. Because they are local variables to the “class” which is really a function that gets run when you create an instance. However, you don’t have control over public/readonly, etc, which is pretty much useful. So constructor is free of charge because it’s the code in the “class” function, where you also define the private variables. And I won’t call them private members. Now you say, “of course, there’s no need for a destructor, a scripting language has a GC”. Well, that’s right, but when an element points to code, using onClick for example, and that handler has a variable that points to that same element, then you’re in a circular trouble ;) So this time you might want to have a destructor right? Or having some function that will be called on unload so you can null() a few variables to break the circular references…But yes, this problem might happen in many environments, but Java for the sake of conversation solve this one unlike Python, AFAIK.

Now why the heck browsers need to compile (yes, in a way) code??? We just all grew up into believing that’s something normal, but stop and give it a thought. I guess those guys didn’t hear about standards.

You can even open a new nested block using curly braces, but all the variables you declare there are become globals. So you end up deleting some objects you have to manually. Now don’t start with why you wanna delete a variable, there are good reasons for that sometime and that’s another story.

Did you know about javascript compiler time machine? Ahh of course not, let me show you:

var a = “DEFINED”;

function f() {
 alert(a);
 var a = 5;
}

Will this code snippet open an alert with a text of “DEFINED”? No, now keep on reading.

If you run that code snippet above you will get an exception with “a is undefined”, now the compiler or whatever freak under there sees the a, which is really defined in the global scope, right? Yes, it is, seriously. But then it sees later on that the ‘a’ variable is being defined in the scope of the function ‘f’ and decided to make the first one undefined. Make an experiment and remove the ‘var’ from the definition of the ‘var a = 5;’ and see for yourself the results.

And there are more and more quirks in this language that I will leave for another time. So what do you think, is Silverlight the best next thing?

Signed Division In Python (and Vial)

April 25th, 2008

My stupid hosting company decided to move the site to a different server blah blah, the point is that I lost some of the recent DB changes and my email hasn’t been working for a week now :(

Anyways I repost it. The sad truth was that I had to find the post in Google’s cache in order to restore it, way to go.

Friday, April 18th, 2008:

As I was working on Vial to implement the IDIV instruction, I needed to have a signed division operator in Python. And since the x86 is a 2’s complement based, I first have to convert the number into Python’s negative (from unsigned) and only then make the operation, in my case a simple division. It was supposed to be a matter of a few minutes to code this function which gets the two operands of IDIV and return the result, but in practice it took a few bad hours.

The conversion is really easy, say we mess with 8 bits integers, then 0xff is -1, and 0×80 is -128 etc. The equation to convert it to a Python’s negative is: val – (1 << sizeof(val)*8). Of course, you do that only if the most significant bit, sign bit, is set. Eventually you return the result of val1 / val2. So far so good, but no, as I was trying to feed my IDIV with random input numbers, I saw that the result my Python’s code returns is not the same as the processor’s. This was when I started to freak out. Trying to figure out what’s the problem with my very simple snippet of code. And alas, later on I realized nothing was wrong with my code, it’s all Python’s fault.

What’s wrong with Python’s divide operator? Well, to be strict, it does not round the negative result toward 0, but towards negative infinity. Now, to be honest, I’m not really into math stuff, but all x86 processors rounds negative numbers (and positive also to be accurate) toward 0. So one would really assume Python does the same, as would C, for instance. The simple case to show what I mean is: 5/-3, in Python results in -2. Rather than -1, as the x86 IDIV instruction is expected and should return. And besides -(5/3) is not 5/-3 in Python, now it’s the time you say WTF. Which is another annoying point. But again, as I’m not a math guy, though I was speaking with many friends about this behavior, that equality (or to be accurate, inequality) is ok in real world. Seriously, what we, coders, care about real world math now? I just want to simulate a simple instruction. I really wanted to go and shout “hey there’s a bug in Python divide operator” and how come nobody saw it before? But after some digging, this behavior is really documented in Python. As much as I would hate it and many other people I know, that’s that. I even took a look at the source code of the integer division algorithm, and saw a ‘patch’ to fix the numbers to be floored if the result is negative because of C89 doesn’t define the rounding well enough.

While you’re coding something and you have a bug, you usually just start debugging your code and track it down and then fix it easily while keeping on working on the code. Because you’re in the middle of the coding phase. There are those rare times that you really get crazy when you’re absolutely sure your code is supposed to work (which it does not) and then you realize that the layer you should trust is broken (in a way). Really you want kill someone  … being a good guy I won’t do that.

Did I hear anyone say modulo?? Oh don’t even bother, but this time I think that Python returns the (math) expected result rather than the CPU. But what does it matter now? I really want only to imitate the processor’s behavior. So I had to hack that one too.

The solution after all, was to make the Python’s negative number to be absolute and remember its original sign, that we do for both operands. And then we make an unsigned division and if the signs of the input are not the same we change the sign of the result. This is because we know that the unsigned division works as the processor does and we can then use it safely.

res = x/y; if (sign_of_x != sign_of_y) res = -res;

The bottom line is that I really hate this behavior in Python and it’s not a bug, after all. I’m not sure how many people like me encountered this issue. But it’s really annoying. I don’t believe they are going to fix it in Python 3, never know though.

Anyway, I got my IDIV working now, and that was the last instruction I had to cover in my unit tests. Now It’s analysis time :)

Debugging Symbols

April 5th, 2008

We all like to use the PDB files when they are available. Sometimes we have to get those from the Internet from MS. Usually I download the whole package of symbols for my current OS and get done with it. The problem is that sometimes after updates and the like, they are out of date and then the files I got are not relevant anymore. And as I use WinDbg I decided to set the symbols path variable in the system environment, to have it supported for other apps as well. Though I really ask myself how come I haven’t done it before. Because at my work, I already use it for a long time now…

Anyhow, I set that variable to: SRV*http://msdl.microsoft.com/download/symbols;C:\symbols;

And I was happy then that everything loads automatically. Afterwards I noticed that everytime I start debugging code in MSVS it accessed the inet for something, blocked the whole application for a few seconds and then resumed with starting my application and let me debug it. The point was that it was my own application with full source and everything and it still accessed the inet everytime, maybe for checking timestamps with loaded modules, etc. It even caused some problems like saying that my source isn’t the same like the binary I want to debug and it hasn’t let me use the source when debugging that code. So after a few times with this same confusion, I couldn’t continue work like that anymore and I tried to think what I changed that caused this weird debugging behaviors and only then it got to my mind that I added that extra variable in the environment. So I took a look at the variable and by a hint from a friend, I switched the places of the local symbols directory with the http address. And since then I don’t have any weird seeks to the inet to get/check PDBs when not required and everything run fast as normal as before.

That’s it, just wanted to share this issue. Of course, I don’t blame any application for using the first address in the variable first, because it’s up to the user how to define the priorities. It is just that I didn’t think it would matter… To learn I was wrong.

AAA/AAS Quirk

April 3rd, 2008

As I was working on the simulation of these two instructions I found that they have some quirk, although the algorithms for these instructions are described in Intel’s specs, which (seems to) make the output defined for all inputs, it is not the case. Everytime I finish writing an implementation for a specific instruction I add that instruction to my unit tests. The instruction is being simulated with random(and some smarter) input and then checked against pure native execution to see if the results are correct. So this way I found a quirk for a range of input that reveals how the instruction is really implemented (microcode stuff prolly) rather than how it’s documented.

AL = AL + 6 is done when the AF is set or low nibble of AL is above 9. According to the documentation the destination register is AL, but in reality the destination register is AX. Now how do we know such a thing?

If we try the following input:
mov al, 0xff
aaa

The result will be 0x205, rather than 0x105 (which is what we expect according to the docs).

What really happens is that we supply a number that when added with 6 creates a carry into AH, thus incrementing AH by 1. Then looking at the docs again, we see that if AL was added with 6, it also increments AH by 1 manually. Thus AH is really incremented by 2. :P

The question is why they do AX = AX + 6, rather then operating on AL. No, actually the biggest question is why I get this same behavior on an AMD processor (whereas I work on an Intel processor). And we already by my last post about SHLD that they don’t work the same in some undefined behavior aspects (although some people believe AMD copied Intel’s architecture and implementation)…

There might be some people who will say that I went too far with testing this instruction, because I, somewhat, supply an input which is not in the valid range (it’s unpacked BCD after all), which therefore I must not rely on the output. The thing is, the algorithm is defined well to receive any input I pass it, hence I expect it to work for even undefined input. Though I believe there is no such a thing as undefined input, only undefined output, and that’s why I implemented my instrction as they both did. Specifically where they both didn’t state anything about undefined input/output, which makes my case stronger. Anyway, the point is they don’t tell us something here, the implementation is not similar to this documented in both AMD/Intel docs.

This quirk works the same for AAS, where instead of doing AL = AL -6, it’s really AX = AX – 6. I also tried to see whether they work on the whole EAX, but I saw that the high word wasn’t changed (by carry/borrow). And I also tried to see if this ill behavior is found in both DAA/DAS, but no.

Shift Double Precision

March 29th, 2008

Were you asking me I had no idea why Intel has support for shift double precision in the 80×86. Probably their answer would be “because it used to be a CISC processor”. The shift double precision is pretty easy to implement algorithm. But maybe it was popular back then and they decided to support it hardware-ly. Like now that they add very important instructions to the SSE sets. Even so, everyone (includes me) seems to implement the algorithm like this:

(a << c) | (b >> (32-c))

Where a and b are the 32 bits input variables(/registers) and c is the count. The code shows a shift left double precision. Shifting right will require to change the shifts direction for each one of the shifts. However, if a and b were 16 bits, the equation of the second shift amount changes to (16-c). And now there is a problem, why? Because we might enter into the magical world of undefined behavior. And why is that? Because the first thing that describes the shift/rotate instructions is that the count operand is masked to preserve only the 5 least significant bits. This is because the largest shift amount for a 32 bits input is 32 shifts (and then you get a 0, ignore SAR for now). And if the input is 16 bits, the count is still masked with 31. That means that you can shift a 16 bits register more than its size. Which doesn’t make much sense, but possible for other shift instructions. But when you use a shift double preicision, not that it doesn’t makes sense, it is also undefined. That is the result is undefined, because then you try to move bits from b into a. But the count becomes negative. For example: shld ax, bx, 17. And internally the second shift amount is calculated as (16-c) which becomes (16-17). And that’s bad, right?

In reality everything is defined when it comes to digital logic. Even the undefined stuff. There must be a reason to the result I get from executing such an instruction like in the example above, even though it’s correctly and officially undefined. And I know that there is a rational behind it, because the result is consistent (at least to my Intel Core2Duo processor). So being the stubborn I am, I decided I want to know how that calculation is really being done in the hardware level.

I forgot to mention that the reason I care of how to implement this instruction is because I have to simulate it for the Vial project. I guess eventually it’s a waste of time, but I really wanted to know what’s going on anyway. Therefore I decided to research the matter and get with the algorithm my processor uses. Examining the results of officially undefined results, I quickly managed to see how to calculate the shift like the processor does, and it goes like this for 16 bits input (I guess, it will work the same for 8 bits input as well, and note that 32 bits input can’t have an undefined range, because you can’t get a negative shift amount):

def shld(a, b, c):
 c &= 31
 if c <= 15:
  return ((a << c) | (b >> (16-c))) & 0xffff
 else:

  # Undefined behavior:
  c &= 15
  return ((b << c) | (a >> (16-c))) & 0xffff

Yes, the code is in Python. But you can see that if the the count is bigger than 15, then we are replacing the input order. And then comes the part where you say “NOW WTF?!”. Even though I got this algorithm to return the same results as the processor does for defined and undefined input, I could wager the processor won’t do this kind of stuff internally. So I sat down some (long) more, and stared at the code, doing a few experiments here and there. Eventually it occurred to me:

def shld(a, b, c):
 c &= 31
 x = a | (b << 16)
 return ((x << c) | (x >> (32-c))) & 0xffff

Now you can see that the input for the original equation is the same bits-buffer input, which contains both inputs together as one. Taking a count of 17, won’t yield a negative register, but something else. Anyway, I have no idea why they implemented this instruction like they did (and it applies to SHRD as well), but I believe it has something to do with the way their processor so-called ‘engine’ works and hardware stuff.

After I learned how it works I was so eager to see how it works on AMD. And guess what? They don’t work the same, where it comes to the undefined behavior, of course. And since I don’t have an AMD anymore I didn’t see how they really implemented their shift double precision instructions.

In the Vial project, where I simulate these instructions, I added a special check for the count, to see that it’s not bigger than the input size, and if it is, I mark the destination register and some of the flags as Undefined. This way I will know when I do code-analysis that something is really wrong/buggy with the way the application works. Now what if the application is purposely uses the undefined behavior? Screw us both then. Now why would a sane application do that? ohh and that’s another story…

By the way, other shift/rotate instructions don’t have any problem with the shift amount since they can’t yield negative shift amount internally in any way, therefore the results are always defined for every input.