Some improvements in HEC virtual machines
Chen Shuo 2004-02-13
In the book "Design and Implementation - C / C ", the author Bill Blunden describes a simple but complete virtual machine - HEC design and implementation. In the process of reading the third chapter, I found a few places worth improvement. Not for the overall design and code structure of HEC - the overall situation is moving, but for some details of the HEC virtual machine implementation code.
1. Convert word sequence
HEC's byte sequence is BIG-Endian, and the host platform (Win32 On Intel) is Little-Endian, so it is necessary to convert word sequence when loading .Run file . The author first defines various types of HEC virtual machines (P.61, original book P.67):
#define u1 unsigned char
#define u2 unsigned short
#define u4 unsigned long
#define u8 unsigned __INT64
Then define 10 conversion functions (P.62, the original book P.68). For example, the role of u2 bytecodetoWord (U1TES [] is to convert Big-Endian's 16-bit unsigned integer to native format:
U2 BytecodetoWord (u1 bytes [])
{
U2 Word;
U1 * buffer;
Buffer = (u1 *) & word;
Buffer [0] = bytes [1];
Buffer [1] = BYTES [0];
Return (Word);
}
This function does not consider portability, which determines that the word sequence of this unit is Little-Endian. A better transplant is:
U2 BytecodetoWord (u1 bytes [])
{
U2 Word = (Bytes [0] << 8) bytes [1];
Return Word;
}
This can operate correctly regardless of the host-Endian or Big-endian. Before performing left-shifting operation (<<), the C language will automatically put Bytes [0] from the unsigned char type increase (promote) as an int type, so don't worry that the value is "removed".
Another good way is to use the NTOHS () function in the Sockets API:
#include
// OR
On unix-like OSES
U2 BytecodetoWord (u1 bytes [])
{
Return ntoHs (* (u2 *) bytes);
}
The U4 BytecodetodWord (u1 bytes []) can also be handled:
U4 bytecodetodword (u1 bytes [])
{
U4 DWORD = (bytes [0] << 24) (bytes [1] << 16)
(Bytes [2] << 8) bytes [3];
Return DWORD;
}
// or using ntohl ()
U4 bytecodetodword (u1 bytes [])
{
Return NTOHL (* (U4 *) bytes;
}
With the same idea, the function of this machine data is converted to the Big-endian to rewrite (with dwordtobytecode () as an example):
Void DWORDTOBYTECODE (U4 DWORD, U1 Arr []) {
Arr [0] = DWORD >> 24;
Arr [1] = (DWORD >> 16) & 0xFF;
Arr [2] = (DWORD >> 8) & 0xFF;
Arr [3] = DWORD & 0xFF;
}
// or using htonl ()
Void DWORDTOBYTECode (U4 DWORD, U1 Arr [])
{
* (U4 *) Arr = HTONL (DWORD);
}
On the Intel processor, the word sequence for 16-bit integer is only one statement (assuming data has read register AX):
Xchg Al, AH
Change the word order of 32-bit integers to only three statements (assuming data read register EAX):
Xchg Al, AH
Ror Eax, 16
Xchg Al, AH
When the GCC compiler is used to optimize the switch, the NTOHL () function and the HTONL () function are compiled into the above form.
2. Verify the bytecode
In the verification phase, the HEC virtual machine will check if the format of each instruction is correct, including the operation code, register number, the legitimacy of the address value, and the integrity of the instruction, and the byte sequence of the converted operands. This can be considered as an anti-pollier. Its basic structure is
While (current_byte U1 opcode = ram [current_byte]; Switch (opcode) { Case LBI: / * LBI $ R1, BYTE_CONSTANT BBB * / ... Case Add: / * Add $ R1, $ R2, $ R3 BBBB * / ... Default: / * Bad opcode * / Report the Error } } Among them, the code of the Switch-Case structure has nearly 550 lines, column of 12 pages (P. 104 ~ P.106, the original books P.109 ~ P.119). Carefully observe that you can find that many of the code segments are repeated, it is difficult to maintain. Since the format of the instruction is very law, we can multiplex code and simplify the verification process. My idea is to store the format of each directive into array char * i_fmt []; verify the bytecode, retrieve the format string according to the command code (opcode), and then verify the legality of the instruction according to the format string . In other words, a small language for describing instruction format is designed here. Basic structure Char * i_fmt [256] = { // LBI, LWI, LDI, LQI "ORB", "ORW", "ORD", "ORQ", ... } // in Reformat () While (current_byte U1 opcode = ram [current_byte]; Char * fmt = i_fmt [opcode]; Char field; IF (fmt == null) {// invalid opcode Report the Error and Return } Field = * fmt ; While (Field! = '/ 0') { Switch (field) { Case 'o': // opcode ... Case 'r': // integer register ... DEFAULT: Fatal_ERROR (); Break; } Field = * fmt ; } } This entire While cycle is only 80, and the original 1/6, the complete code can be downloaded from my personal page (http://www.chenshuo.com). Here 'o' indicates an operation code, 'R' represents an integer register, 'F' indicates a floating point register, 'L' represents the double precision floating point register, 'b' indicates that the number of 8 bits is immediately, 'W' indicates 16-bit immediate number , 'D' indicates that the 32-bit immediate number, 'q' indicates that the number of 64 bits is immediately, 'a' indicates the 64-bit address (LAD command), 'F' indicates the number of floating point (32 bits), 'L' represents double precision Floating point immediately (64-bit), 'V' represents an interrupt vector (int instruction), and the like. 3. Other improvements The author claims that HEC has a symbol number (SIGNED), but his SLT statement is determined to compare the size of unsigned signs (P.134, the original book P.137). That is to say, HEC actually does not have a statement with a symbol integer size. It treats the integers of the participation in size as unsigned (meaning 0 <-1), which is opposite to the author's ideas. Therefore, I recommend adding a SLTU statement to compare unsigned integers, and the original SLT language is changed to compare the number of symbols. This is the same. .Finish.