Difference between revisions of "SCI/Specifications/SCI virtual machine/The Sierra PMachine"
(Merging of the SCI documentation. Work in progress) |
(No difference)
|
Revision as of 13:13, 7 January 2009
WORK IN PROGRESS. DOCUMENT INCOMPLETE
The Sierra PMachine
Original document by Lars Skovlund, Dark Minister and Christoph Reichenbach
This document describes thee design of the Sierra PMachine (the virtual CPU used for executing SCI programs). It is a special CPU, in the sense that it is designed for object oriented programs. There are three kinds of memory in SCI: Variables, objects, and stack space. The stack space is used in a Last-In-First-Out manner, and is primarily used for temporary space in a routine, as well as passing data from one routine to another. Note that the stack space is used bottom-up by the original interpreter, instead of the more usual top-down. I don’t know if this has any significance for us.
Scripts are loaded into the PMachine by creating a memory image of it on the heap. For this reason, the script file format may seem a bit obscure at times. It is optimized for in-memory performance, not readability. It should be mentioned here that a lot of fixup stuff is done by the interpreter. In the script files, all addresses are specified as script-relative. These are converted to absolute offsets. The species and superClass fields of all objects are converted into pointers to the actual class etc.
There are four types of variables. These are called global, local, temporary, and parameter. All four types are simple arrays of 16-bit words. A pointer is kept for each type, pointing to the list that is currently active. In fact, only the global variable list is constant in memory. The other pointers are changed frequently, as scripts are loaded/unloaded, routines called, etc. The variables are always referenced as an index into the variable list. I’ll explain the four types below - the names in parentheses will be used occasionally in the rest of the text:
Local variables (LocalVar)
This variable type is called "local" because it belongs to a specific script. Each script may have its own set of local variables, defined by script block type 10. As long as the code from a specific script is running, the local variables for that script are "active" (pointed to by the mentioned pointer).
Global variables
These, like the local variables, reside in script space (in fact, they are the local variables of script 0!). But the pointer to them remains constant for the whole duration of the program.
Temporary variables
These are allocated by specific subroutines in a script. They reside on the PMachine stack and are allocated by the link opcode. The temp variables are automatically discarded when the subroutine returns.
Parameter variables
These variables also reside on the stack. They contain information passed from one routine to another. Any routine in SCI is capable of taking a variable number of parameters, if need be. This is possible because a list size is pushed as the first thing before calling a routine. In addition to this, a frame size is passed to the call* functions.
Objects
While two adjacent variables may be entirely unrelated, the contents of an object is always related to one task. The object, like the variable tables, provides storage space. This storage space is called properties. Depending on the instructions used, a property can be referred to by index into the object structure, or by property IDs (PIDs). For instance, the name property has the PID 17h, but the offset 6. The property IDs are assigned by the SCI compiler, and it is the "compatible" way of accessing object data. Whereas the offset method is used only internally by an object to access its own data, the PID method is used externally by objects to read/write the data fields of other objects. The PID method is also used to call methods in an object, either by the object itself, by another object, or by the SCI interpreter. Yes, this really happens sometimes.
The PMachine “registers”
The PMachine can be said to have a number of registers, although none of them can be accessed explicitly by script code. They are used/changed implicitly by the script opcodes:
- Acc
- The accumulator. Used for result storage and input for a number of opcodes.
- IP
- The instruction pointer.10 Points to the currently executing instruction
Vars an array of 4 values, pointing to the current variables of each mentioned type Object points to the currently executing object.
- SP
- The current stack pointer. Note that the stack in the original SCI interpreter is used
bottom-up instead of the more usual top-down.
The PMachine, apart from the actual instruction pointer, keeps a record of which object is currently executing.
The instruction set
The PMachine CPU potentially has 128 instructions (however, a couple of these are invalid and generate an error). Some of these instructions have a flag which specify whether the opcode has byte- or word-sized operands (I will refer to this as variably-sized parameters, as opposed to constant parameters). Other instructions have only one calling form. These instructions simply disregard the operand size flag. Ideally, however, all script instructions should be prepared to take variably-sized operands. Yet another group of instructions take both a constant parameter and a variably-sized parameter. The format of an opcode byte is as follows:
bit7-1 | opcode number |
bit 0 | operand size flag |
Relative addresses
Certain instructions (in particular, branching ones) take relative addresses as a parameter. The actual address is calculated based on the instruction after the branching instruction itself. In this example, the bnt instruction, if the branch is made, jumps over the ldi instruction.
<syntax type="assembler">
eq? bnt +2 ldi byte 2 push
</syntax>
Relative addresses are signed values.
=Dispatch addresses
The callb and calle instructions take a so-called dispatch index as a parameter. This index is used to look up an actual script address, using the so-called dispatch table. The dispatch table is located in script block type 7 in the script file. It is a series of words - the first one, as in so many other places in the script file, is the number of entries.
Frame sizes
In every call instruction, a value is included which determines the size of the parameter list, as an offset into the stack. This value discounts the list size pushed by the SCI code. For instance, consider this example from real SCI code:
<syntax type="assembler">
pushi 3 ; three parameters passed pushi 4 ; the screen flag pTos x ; push the x property pTos y ; push the y property callk OnControl, 6
</syntax>
Notice that, although the callk line specifies 6 bytes of parameters, the kernel routine has access to the list size (which is at offset 8)!
PErrors
These are internal errors in the interpreter. They are usually caused by buggy script code. The PErrors end up displaying an ”Oops!” box in the original interpreter (it is interesting to see how Sierra likes to believe that PErrors are caused by the user - judging by the message ”You did something we weren’t expecting”!). In the original interpreter, specifying -d on the command line causes it to give more detailed information about PErrors, as well as activating the internal debugger if one occurs.
Class numbers and adresses
The key to finding a specific class lies in the class table. This class table resides in VOCAB.996, and contains the numbers of scripts that carry classes. If a script has more than one class defintion, the script number is repeated as necessary. Notice how each script number is followed by a zero word? When the interpreter loads a script, it checks to see if the script has classes. If it does, a pointer to the object structure is put in this empty space.
The instructions
The instructions are described below. I have used Dark Minister's text on the subject as a starting point, but many things have changed; stuff explained more thoroughly, errors corrected, etc. The first 23 instructions (up to, but not including, bt) take no parameters.
These functions are used in the pseudocode explanations:
<syntax type="C">
pop(): sp -= 2; return *sp;
push(x): *sp = x; sp += 2; return x;
</syntax>
The following rules apply to opcodes:
- Parameters are signed, unless stated otherwise. Sign extension is performed.
- Jumps are relative to the posisition of the next operation.
- *TOS refers to the TOS (Top Of Stack) element.
- "tmp" refers to a temporary register that is used for explanation purposes only.
op 0x00: bnot (1 byte)
op 0x01: bnot (1 byte)
- Binary not
<syntax type="C"> acc ^= 0xffff; </syntax>
op 0x02: add (1 byte)
op 0x03: add (1 byte)
- Addition:
<syntax type="C"> acc += pop(); </syntax>
op 0x04: sub (1 byte)
op 0x05: sub (1 byte)
- Subtraction:
<syntax type="C"> acc = pop() - acc; </syntax>
op 0x06: mul (1 byte)
op 0x07: mul (1 byte)
- Multiplication:
<syntax type="C"> acc *= pop(); </syntax>
op 0x08: div (1 byte)
op 0x09: div (1 byte)
- Division:
<syntax type="C"> acc = pop() / acc; </syntax> Division by zero is caught => acc = 0.
op 0x0a: mod (1 byte)
op 0x0b: mod (1 byte)
- Modulo:
<syntax type="C"> acc = pop() % acc; </syntax> Modulo by zero is caught => acc = 0.
op 0x0c: shr (1 byte)
op 0x0d: shr (1 byte)
- Shift Right logical:
<syntax type="C"> acc = pop() >> acc; </syntax>
op 0x0e: shl (1 byte)
op 0x0f: shl (1 byte)
- Shift Left logical:
<syntax type="C"> acc = pop() << acc; </syntax>
op 0x10: xor (1 byte)
op 0x11: xor (1 byte)
- Exclusive or:
<syntax type="C"> acc ^= pop(); </syntax>
op 0x12: and (1 byte)
op 0x13: and (1 byte)
- Logical and:
<syntax type="C"> acc &= pop(); </syntax>
op 0x14: or (1 byte)
op 0x15: or (1 byte)
- Logical or:
<syntax type="C"> acc |= pop(); </syntax>
op 0x16: neg (1 byte)
op 0x17: neg (1 byte)
- Sign negation:
<syntax type="C"> acc = -acc; </syntax>
op 0x18: not (1 byte)
op 0x19: not (1 byte)
- Boolean not:
<syntax type="C"> acc = !acc; </syntax>
op 0x1a: eq? (1 byte)
op 0x1b: eq? (1 byte)
- Equals?:
<syntax type="C"> prev = acc; acc = (acc == pop()); </syntax>
op 0x1c: ne? (1 byte)
op 0x1d: ne? (1 byte)
- Is not equal to?
<syntax type="C"> prev = acc; acc = !(acc == pop()); </syntax>
op 0x1e: gt? (1 byte)
op 0x1f: gt? (1 byte)
- Greater than?
<syntax type="C"> prev = acc; acc = (pop() > acc); </syntax>
op 0x20: ge? (1 byte)
op 0x21: ge? (1 byte)
- Greater than or equal to?
<syntax type="C"> prev = acc; acc = (pop() >= acc); </syntax>
op 0x22: lt? (1 byte)
op 0x23: lt? (1 byte)
- Less than?
<syntax type="C"> prev = acc; acc = (pop() < acc); </syntax>
op 0x24: le? (1 byte)
op 0x25: le? (1 byte)
- Less than or equal to?
<syntax type="C"> prev = acc; acc = (pop() <= acc); </syntax>
op 0x26: ugt? (1 byte)
op 0x27: ugt? (1 byte)
- Unsigned: Greater than?
<syntax type="C"> acc = (pop() > acc); </syntax>
op 0x28: uge? (1 byte)
op 0x29: uge? (1 byte)
- Unsigned: Greather than or equal to?
<syntax type="C"> acc = (pop() >= acc); </syntax>
op 0x2a: ult? (1 byte)
op 0x2b: ult? (1 byte)
- Unsigned: Less than?
<syntax type="C"> acc = (pop() < acc); </syntax>
op 0x2c: ule? (1 byte)
op 0x2d: ule? (1 byte)
- Unsigned: Less than or equal to?
<syntax type="C"> acc = (pop() >= acc); </syntax>
op 0x2e: bt W relpos (3 bytes)
op 0x2f: bt B relpos (2 bytes)
- Branch relative if true
<syntax type="C"> if (acc) pc += relpos; </syntax>