Open main menu

Difference between revisions of "HOWTO-Reverse Engineering"

m
Spelling, grammar, and wording fixes
m (Formatting fix for list in the DosBox Debugger section)
m (Spelling, grammar, and wording fixes)
Line 19: Line 19:


== Using the DosBox Debugger ==
== Using the DosBox Debugger ==
It's up to the individual whether they want to use a debugger when reverse engineering a program. Some prefer a more cerebral challenge of only figuring out code execution using a debugger, whereas others may find using a debugger useful for figuring out what values are passed to functions. I would recommend using a debugger particularly when reversing a game for the purpose of adding ScummVM support.  When you start implementing code to implement game functionality once you've got portions of the game disassembled, it can be immensely useful for tracking down bugs, particularly if you initial implement your methods to closely match the methods in the disassembly.
It's up to the individual if you want to use a debugger when reverse engineering a program. Some prefer a more cerebral challenge of only figuring out code execution using a decompiler tool, whereas others may find using a debugger useful for figuring out what values are passed to functions. I would recommend using a debugger particularly when reversing a game for the purpose of adding ScummVM support.  When you start implementing code to implement game functionality, once you've got portions of the game disassembled, it can be immensely useful for tracking down bugs. Particularly if you initially implement your methods to closely match the methods you give the methods in the disassembly.


For debugging purposes, if the game is a DOS game the DosBox Debugger is the best tool I've found for executing and debugging DOS programs. The default distribution of DosBox doesn't have it enabled, but you can either compile DosBox with it enabled, or download a previously compiled executable. See the [http://vogons.zetafleet.com/viewtopic.php?t=7323 DosBox Debugger Thread] for more information.
For debugging purposes, if the game is a DOS game, the DosBox Debugger is the best tool I've found for executing and debugging DOS programs. The default distribution of DosBox doesn't have it enabled, but you can either compile DosBox with it enabled, or download a previously compiled executable. See the [http://vogons.zetafleet.com/viewtopic.php?t=7323 DosBox Debugger Thread] for more information.


One of the biggest initial steps when using the DosBox debugger is matching addresses in executable at run-time with your disassembly in IDA. This can be done either from the debugger to IDA, or from IDA to the debugger:
One of the biggest initial steps when using the DosBox debugger is matching addresses in executable at run-time with your disassembly in IDA. This can be done either from the debugger to IDA, or from IDA to the debugger:
Line 35: Line 35:
- look at the IDA view to find out the current file offset at the bottom of the screen. You'll quickly find it if you try selecting different instructions, since it will keep changing. Now:
- look at the IDA view to find out the current file offset at the bottom of the screen. You'll quickly find it if you try selecting different instructions, since it will keep changing. Now:


- Get the value from the beginning of the segment. This is just to make the calculations easier, since the start of the segment will have an instruction offset between 0h and 0Fh, which means it won't be messing with  
- Get the value from the beginning of the current code segment. This is just to make the calculations easier, since the start of the segment will have an instruction offset between 0h and 0Fh, which means it won't be messing with  
our segment calculations
our segment calculations


Line 52: Line 52:


=== Naming Methods ===
=== Naming Methods ===
Methods can be renamed using the general 'N' hotkey (as well as via the menus), and the 'Y' can be used to specify a C-like prototype for a method. This is particularly useful when some of the parameters for a method are passed using registers. By explicitly documenting what the method expects, it makes it easier to remember later on when you're reversing methods that call it. If a method does have parameters passed in registers, prototypes like the below can be used:
Methods can be renamed using the general 'N' hotkey (as well as via the menus), and the 'Y' can be used to specify a C-like prototype for a method. This is particularly useful when some of the parameters for a method are passed using registers. By explicitly documenting what the method expects, it makes it easier to remember later on when you're reversing methods that call it. Standard methods where parameters are passed via the stack are easy, since IDA can automatically set up the function prototype for you. If a method does have parameters passed in registers, prototypes like the below can be used:
<syntax type="C">
<syntax type="C">
int __usercall sub_100FB<ax>(__int8 param1<al>, int param2<bx>)
int __usercall sub_100FB<ax>(__int8 param1<al>, int param2<bx>)
</syntax>
</syntax>


In this case, the method takes an 8-bit parameter in al, and another 16-bit value in bx, then returns a result in the ax register.
In this case, the method takes an 8-bit parameter in the al register, and another 16-bit value in bx, then returns a result in ax


=== Using Structures ===
=== Using Structures ===
Line 69: Line 69:
</syntax>
</syntax>


In this case, an initial index in the ax register is multiplied by 30h (30 hexadecimal = 48 decimal). So from this we can determine that the given structure is 48 bytes in size. For smaller sized structures, you may want to create as many 2 byte word fields as needed to make up the correct size for the structure. For larger sizes, the easiest way is to simply declare an array of the needed structure size - 1, and follow it with a single byte field. You can then delete/undefine the array. The remaining byte will keep the structure at the correct size, and you can then later fill in the fields as you find references to them.
In this case, an initial index in the ax register is multiplied by 30h (30 hexadecimal = 48 decimal). So from this we can determine that the given structure is 48 bytes in size, and can create a new structure accordingly. For smaller sized structures, you may want to create as many 2 byte word fields as needed to make up the correct size for the structure. For larger sizes, the easiest way is to simply declare an array of the needed structure size - 1, and follow it with a single byte field. You can then delete/undefine the array. The remaining byte will keep the structure at the correct size, and you can then later fill in the fields as you find references to them.


Secondarily, the value of '2D00h' indicates an offset in the data segment, representing the rough starting address of the first element of the given array in memory. Here we run into a minor problem. The offset of '2D00h' may not indicate the precise start of the array. If the code in question wanted to get the value at offset 8 in the structure, then the array may actually start 8 bytes earlier in memory, at address '2CF8h'.
Secondarily, the value of '2D00h' indicates an offset in the data segment, representing the rough starting address of the first element of the given array in memory. Here we run into a minor problem. The offset of '2D00h' may not indicate the precise start of the array. If the code in question wanted to get the value at offset 8 in the structure, then the array may actually start 8 bytes earlier in memory, at address '2CF8h'.


In such cases, the only way to tell for sure is to start searching for immediate values in the program of values one byte backwards at a time, until you can't find any values. For example, if you find references in the code of values '2CFFh', and '2CFEh', each with previous multiplications by 30h/48, but none for '2CFDh', then you can probably be confident that the array starts at offset 2CFDh.  
In such cases, the only way to tell for sure is to start searching for immediate values in the program of values bytes backwards at a time, until you can't find any more values. For example, if you find references in the code of values '2CFFh', and '2CFEh', each with previous multiplications by 30h/48, but none for '2CFDh', '2CFDCh', or '2CFBh', then you can probably be confident that the array starts at offset 2CFDh.  


Once that's determined, you can then create a dummy structure of the correct size, and convert the given address of 2CFDh to an instance of that structure type. Until you're more familiar with the range of values the original array index may be, it'll likely be easier to simply leave the defined array with the single index. Later on, you can always change the structure in memory to specify how many elements it has later on.
Once that's determined, you can then create a dummy structure of the correct size, and convert the given address of 2CFDh to an instance of that structure type. Until you're more familiar with the range of values the original array index may be, it'll likely be easier to simply leave the defined array with the single index. Later on, you can always change the structure in memory to specify how many elements it has later on.
Remember that fields in structures can vary in size, so it's always possible you'll get the starting address wrong. In which case, you may have to later on correct the address of the structure in the data segment. This will affect any fields you figure out as well. In the above example, if you mistakenly presumed the array started at offset 2CFDh, then 2D00h would be thought to be a field at offset 3 in the structure (2CFDh + 3 = 2D00h, as per the above example code fragment). However, if the array structure really starts at 2CF8h, then the same field should be at offset 8 within the structure (2CF8h + 8 = 2D00h). So you need to rebuild the list of fields you'd figured out in the structure, since they'll all be at the wrong position. Overall, it's better when encountering an array to spend the extra time to ensure where it starts in memory so you don't need to fix offset problems later on.


== Disassembly Strategies ==
== Disassembly Strategies ==
Line 81: Line 83:


=== File Access ===
=== File Access ===
One of the easiest places to start a disassembly is generally by identifying file accesses. Using IDA, you can, for example, do a text search for 'open file' to find occurrences of file opening. Likewise for file reading, writing, and closing. Normally, a program will encapsulate these calls into a method of it's own, so your first disassembly step can be in identifying the methods and naming them appropriately with names like 'File_open', 'File_read', and so on. Likewise, giving the passed parameters an appropriate name. In IDA, the 'Y' command can be used to set up an appropriate method signature for methods. By properly naming the method and it's parameters, this will help you in all the methods that call those methods.  
One of the easiest places to start a disassembly is generally by identifying file accesses. Using IDA, you can, for example, do a text search for 'open file' to find occurrences of file opening. IDA provides standard comments for many operating system calls, so even in a new disassembly you should be able to locate such calls by their comment text. Likewise for file reading, writing, and closing. Normally, a program will encapsulate these calls into a method of it's own, so your first disassembly step can be in identifying the methods and naming them appropriately with names like 'File_open', 'File_read', and so on. Likewise, giving the passed parameters an appropriate name. In IDA, the 'Y' command can be used to set up an appropriate method signature for methods. By properly naming the method and it's parameters, this will help you in all the methods that call those methods.  


For example, if a read method has a 'size' parameter and a 'buffer' parameter, then if a method that calls passes '200' for the size, and a reference into the stack, you can be confident that the stack entry can be called something like 'readBuffer', and use the '*' (array size) key when looking at the Stack View to set the size of the array to 200 bytes.
For example, if a read method has a 'size' parameter and a 'buffer' parameter, then if a method that calls it passes '200' for the size, and a reference from a location on the stack, you can be confident that the stack entry can be called something like 'readBuffer', and use the '*' (array size) key when looking at the Stack View (Ctrl-K) to set the size of the array to 200 bytes.


You should hopefully then be able to start working on methods that call the file access functions and hopefully start decoding them. Some examples:
You should hopefully then be able to start working on methods that call the file access functions and hopefully start decoding them. Some examples:
Line 89: Line 91:
1. If the game consists of only a few large data files, the methods that call the open/read/close functions may a resource manager responsible for loading subsets of the file. In which case, the methods may likely load some kind of index into memory and then have a separate 'get resource' method that scans through the list for a resource with a given Id, resulting in a specific portion of the data file being read. In this case, you can identify all the methods with appropriate names like 'ResourceManager_init', 'ResourceManager_loadIndex', 'ResourceManager_getResource', and so on.
1. If the game consists of only a few large data files, the methods that call the open/read/close functions may a resource manager responsible for loading subsets of the file. In which case, the methods may likely load some kind of index into memory and then have a separate 'get resource' method that scans through the list for a resource with a given Id, resulting in a specific portion of the data file being read. In this case, you can identify all the methods with appropriate names like 'ResourceManager_init', 'ResourceManager_loadIndex', 'ResourceManager_getResource', and so on.


The DosBox debugger may prove useful when dealing with games using large resources - try using 'BPINT 21 42' to put a breakpoint on any calls to seek within a file. By clearly identifying the 'Seek method', you can then step out of that routine to find what called it. Hopefully, you can then examine the logic in the disassembly used to generate the file offset to help you figure out how file offsets are generated for specific resources, and from that figure out how the resource's index works.
The DosBox debugger may prove useful when dealing with games using large resources. In DOS, Interrupt 21h is one of the primary system interrupts. Specific command Ids are passed in AH, and the other registers are set with values depending on which function is being called. For example, command 42h of Interrupt 21h is the command for seeking within a file.
Try using 'BPINT 21 42' to put a breakpoint on any calls to seek system function. By clearly identifying the 'Seek method', you can then step out of that routine to find what called it. Hopefully, you can then examine the logic in the disassembly used to generate the file offset to help you figure out how file offsets are generated for specific resources, and from that figure out how the resource's index works.


Remember that game resource managers not only typically merge multiple individual resources into one single bigger file, they frequently also compress them as well, to save space and prevent people from seeing textual resources when viewing the contents of the file. If the resource manager tends to If you can figure out the strategy used for extracting single resources, it may be worthwhile taking the time to code a standalone program to extract and, if necessary, decompress single resources into separate output files. That way, you can more easily look at individual resources that are used by the game without having to worry about manually locating them in the archive/resource file.
Remember that game resource managers not only typically merge multiple individual resources into one single bigger file, they frequently also compress them as well, to save space and prevent people from seeing textual resources when viewing the contents of the file. In such cases, if you can figure out the strategy used for extracting single resources, it may be worthwhile taking the time to code a standalone program to extract and, if necessary, decompress single resources into separate output files. That way, you can more easily look at individual resources that are used by the game without having to worry about manually locating them in the archive/resource file.


2. If the game consists of many different files, it's likely the game will be manually calling the open/read/close methods whenever it wants to access a particular file/resource.  
2. If the game consists of many different files, it's likely the game will be manually calling the open/read/close methods whenever it wants to access a particular file/resource.  
Line 98: Line 101:


=== Graphics access ===
=== Graphics access ===
Another place to get started on the disassembly is the graphic draw routines, those responsible for copying raw pixels to the screen surface. Graphic display is complicated in the early PC days by different modes writing to memory in different ways. In the Monochrome/Hercules mode, for example, 8 pixels are stored per 8-bit byte. In EGA, the addressing can be complicated by how the display is configured - the same areas of memory may  
Another place to get started on the disassembly is the graphic draw routines, those responsible for copying raw pixels to the screen surface. Graphic display is complicated in the early PC days by different modes for the different graphics cards writing to memory in different ways. In the Monochrome/Hercules mode, for example, 8 pixels are stored per 8-bit byte. In EGA, the addressing can be complicated by how the display is configured - the same areas of memory may  
be used to represent different parts of pixels - what part of a pixels index being updated depending on specific values sent to hardware ports. Finally, of them all, the most common 3200x200x256 colour mode is the easiest to deal with, with each pixel taking up a single byte.
be used to represent different parts of pixels - what part of a pixels index being updated depending on specific values sent to hardware ports. Finally, of them all, the most common 3200x200x256 colour mode is the easiest to deal with, with each pixel taking up a single byte.


For most of the graphics mode, you can look at them in a similar manner - as a block of data in memory starting at offset A000h:0, and a number of bytes per line that will vary per line. Assembly routines that deal with the graphics sreen will typically have code to figure out screen offsets based on a passed and y parameter, so it will frequently be easy to identify the parameters and figure out how the screen offsets work. For example, in 320x200x256 MCGA mode, an offset will be calculated using the formula (y * 320) + x.
For most of the graphics mode, you can look at them in a similar manner - as a block of data in memory starting at offset A000h:0. Only the number of bytes per line will vary, depending on what the graphics mode is. Assembly routines that deal with the graphics screen will typically have code to figure out screen offsets based on a passed x and y parameter, so it will frequently be easy to identify the parameters and figure out how the screen offsets work. For example, in 320x200x256 MCGA mode, an offset on the screen will be calculated using the formula (y * 320) + x.


For finding the graphic routines you have two options. The first is to entirely use IDA, and simply search for immediate values of 'A000h'. Since this is the area of memory graphics and displayed in, it can be a quick way to locate graphic routines.
For finding the graphic routines you have two options. The first is to entirely use IDA, and simply search for immediate values of 'A000h'. Since this is the area of memory graphics are commonly displayed in, it can be a quick way to locate graphic routines.


The other alternative is to use the DosBox Debugger. It has a use command called 'bpm' that allow you to set a memory breakpoint, which then gets triggered if the given memory address changes. So you could do 'bpm A000:0' to set a breakpoint on the first byte of the screen memory (i.e. the top left hand corner of the screen). Then whichever routine modifies it first will trigger the breakpoint. Using the previous discussed techniques, you can find the same place in your IDA disassembly, and look into reversing that method first. Generally speaking, it will likely be a clear screen or fill rectangle routine.
The other alternative is to use the DosBox Debugger. It has a use command called 'bpm' that allow you to set a memory breakpoint, which then gets triggered if the given memory address changes. So you could do 'bpm A000:0' to set a breakpoint on the first byte of the screen memory (i.e. the top left hand corner of the screen). Then whichever routine modifies it first will trigger the breakpoint. Using the previous discussed techniques, you can find the same place in your IDA disassembly, and look into reversing that method first.  


It will be likely that realted functions will be next to each other, so once you've looked into the given identified function, you may also be able to review previous or following functions to see if they have identifiable graphic routines.
It will be likely that related functions will be next to each other, so once you've looked into the given identified function, you may also be able to review previous or following functions to see if they have identifiable graphic routines.


=== Data Segment strings ===
=== Data Segment strings ===
The strings in the data segment can be an excellent source for identifying the purposes. If you're very lucky, there may be error messages that contain the name of the function as part of the error. In which case, you can then out what code references it, and name the method appropriately.
The strings in the data segment can be an excellent source for identifying the purposes of various methods. If you're very lucky, there may be error messages that contain the name of the function as part of the error. In which case, you can then find out what code references it, and name the methods appropriately.


Note: IDA is good, but it's not perfect., It's not always guaranteed to be able to figure out that a given value loaded into a register somewhere in the program is for a reference into the data segment. As such, if the cross-reference command doesn't give any references for a given string, try searching for an immediate value of the offset of the string. Chances are, any reference you find is likely pointing to the string. You can then use the 'O' command to change the operand from an immediate value to instead point to the offset in the data segment.
Note: IDA is good, but it's not perfect., It's not always guaranteed to be able to figure out that a given value loaded into a register somewhere in the program is for a reference into the data segment. As such, if the cross-reference command doesn't give any references for a given string, try searching for an immediate value of the offset of the string. Chances are, any reference you find is likely pointing to the string. You can then use the 'O' command to change the operand from an immediate value to instead point to the offset in the data segment.


Even if any error messages don't contain method names, an error message can prove invaluable. For example, an error message like "Unable to initialise mouse" tells you that whatever method uses it is seting up the mouse for access, so could be given a name like 'initialise_mouse', or 'initialise_events'.  
Even if any error messages don't contain method names, an error message can prove invaluable. For example, an error message like "Unable to initialise mouse" tells you that whatever method uses it is setting up the mouse for access, so could be given a name like 'initialise_mouse', or 'initialise_events'.  


Likewise, have a look at the context of what needs to happen for the error message to be printed, since the message can give you insights into what is being done. A message like "No more free inventory slots" tells you that the code that references this error message is likely a routine for adding an item to the inventory (hence the error if no more slots are available).  From there you might be able to identify the area of memory containing the inventory list, and then cross reference other methods that also access it - you could end up identifying a whole group of methods related to inventory manipulation.
Likewise, have a look at the context of what needs to happen for the error message to be printed, since the message can give you insights into what is being done. A message like "No more free inventory slots" tells you that the code that references this error message is likely a routine for adding an item to the inventory (hence the error if no more slots are available).  From there you might be able to identify the area of memory containing the inventory list, and then cross reference other methods that also access it - you could end up identifying a whole group of methods related to inventory manipulation.


Another possibility for data segment strings is to pick strings that may be descriptive enough to guess their puprose. For example, if a string has a name like 'FNT20', or 'UI.FNT', it's likely that it's a font file, containing the images for each character for use in displaying on the screen. In that case, any code that uses it is likely to be passing it to a 'font_open' function, which loads up the font. If you can disassemble that method, you may be able to determine how the loaded font is stored in memory. From there, you can use the cross references function of IDA to find other methods that use the same memory address as where the font is stored, and which will likely give you the methods for actually displaying text on the screen.
Another possibility for data segment strings is to pick strings that may be descriptive enough to guess their purpose. For example, if a string has a name like 'FNT20', or 'UI.FNT', it's likely that it's a font file, containing the images for each character for use in displaying on the screen. In that case, any code that references it is likely to be passing it to a 'font_open' function, which loads up the font. If you can disassemble that method, you may be able to determine how the loaded font is stored in memory. From there, you can use the cross references function of IDA to find other methods that use the same memory address as where the font is stored, and which will likely give you the methods for actually displaying text on the screen.


From there, you may be able to go even further, and start figuring out methods that call the 'write string' function.. menus, hotspots, conversation handlers, and so forth.  
From there, you may be able to go even further, and start figuring out methods that call the 'write string' function.. menus, hotspots, conversation handlers, and so forth.  


=== Program execution ===
=== Program execution ===
Another method for identifying methods may be simply executing the program itself. When running the program in DosBox,you may find it useful to start stepping through the main procedure to see what happens from stepping over every method. IDA may be helpful in identifying the main procedure, but even if not, most programming languages have a series of method calls for setting up the initial application state,and then a single final call to the 'main' method.
Another method for identifying methods may be simply executing the program itself. When running the program in DosBox, you may find it useful to start stepping through the main procedure to see what happens from stepping over every method. IDA may be helpful in identifying the main procedure, but even if not, most programming languages have a series of method calls for setting up the initial application state,and then a single final call to the 'main' method.


With the main method identified, stepping over each method may produce interesting results. For example, a single method call may contain all the code for showing the game's introduction sequence. If this can be identified, you could name the method appropriately. You may then be able to gleam information from how the method is called and for the method itself:
With the main method identified, stepping over each method may produce interesting results. For example, stepping over a single method call may run all the code for showing the game's introduction sequence before returning control to the debugger. If this can be identified, you could name the method appropriately. You may then be able to gleam information from how the method is called and for the method itself:


For where the method is called, there may be conditional checks to see whether the method is called or not. For example, some games may have a stored settings file to flag whether the introduction has been shown, and not show it again after the first time the game is played..  
For where the method is called, there may be conditional checks to see whether the method is called or not. For example, some games may have a stored settings file to flag whether the introduction has been shown, and not show it again after the first time the game is played..  
Line 133: Line 136:
Within the method itself, you may likewise be able to figure out further specific details of how the method is implemented from the method calls it makes. For example, a 'play introduction' method may consist of calling the same method multiple times with different parameter values. These values may well be offsets within the data segment for resource or file names for specific animations to run. In which case, you now know the sub-method is an animation player, and name it accordingly. You could then start work on the animation player, figuring out how it loads it in data, and what method it uses to build up and display graphics on the screen.
Within the method itself, you may likewise be able to figure out further specific details of how the method is implemented from the method calls it makes. For example, a 'play introduction' method may consist of calling the same method multiple times with different parameter values. These values may well be offsets within the data segment for resource or file names for specific animations to run. In which case, you now know the sub-method is an animation player, and name it accordingly. You could then start work on the animation player, figuring out how it loads it in data, and what method it uses to build up and display graphics on the screen.


Particularly for cases like that, identifying and naming the graphic/screen methods may be helpful, since you could work the disassembly from both the front end reading the animation, and from the low level drawing of the graphics.
Particularly for cases like that, identifying and naming the graphic/screen methods may be helpful, since you could work the disassembly from both the front end reading the animation, and from the low level drawing of the graphics of the animation.


== Final Words ==
== Final Words ==
265

edits