Implement Machine Forth for ARM Processors

xiaoxiao2021-03-06 36

Author REUBEN THOMAS

Computer Laboratory, University of Cambridge

23rd August 1999

Summary

FOX and Moore [2] have recently proposed a new Forth virtual machine model called Machine Forth. Using a simple and specific model, it is said that it can easily adapt to different hardware, and do not need to turn to the assembler to produce high efficiency code, and it works as a foundation of a excellent Forth compiler. This paper discusses an ARM processor to implement Forth side, comparing a traditional Forth system based on an MF Forth system and a traditional Forth system implemented using the ARM machine model.

1 Introduction

In March 1999, there is a discussion of MachineForth about Charles Moore in Comp.lang.Forth. This virtual machine model is said to be used for all of his own Forth programming and has been implemented on several wafers such as a class of F21.

Jeff Fox is a help hand of Moore. He said that Moore believes that MF VM has much better than the classic Forth VM for Forth, which allows it to get closer to the machine, while maintaining an identical virtual processor regardless of it is below What kind of real processor. In addition, the simplicity of the design means that in order to implement this system, the required tools will not be more than one small macro.

Thus, it is interested, I decided to test this model for me by implementing a Forth on the ARM processor. I want to see how new instructions like automatic incremental addressing are conducive to program design, and how is the instructions that lack SWAP and ROT, how is the system's implementation, speed and What is the difference between the code density and the traditional Forth system. Of course, I also care about what changes need to be in order to adapt to the ARM hardware.

My test about MF Forth includes writing a Forth compiler with MF, which contains the minimum requirements of Moore. (Another obvious experiment is to transplant the system-an extended ANS compatible system-portable system-portable system-portable system-transplantable ANS compatible system). Then, I reuse the ARM code to write a compiler to compare the differences between the ARM write system and use MF. On these two systems, the benchmark test procedure is run on the traditional Forth and C compilers.

The following discussion assumes that the reader is familiar with the MF model and the details of F21. From here, "Machine Forth" is short-written as "MF", and my own version, I am sorry, it is called "MF32".

Change of machine model

Since I only have the F21 specification as MF guidance, I don't know if MOORE implements MF on the traditional processor.

Data width

Since ARM is a 32-bit architecture, the data width becomes 32 bits.

Testing

On the F21 processor, the instruction C = 0 tests an extension of the stack element. If implemented on a conventional processor, this instruction will be particularly slow, so turn this instruction into a T31 bit, which is usually used to test the value is negative, which is not helpful for testing, unless we regard the register as 31 Bit. However, the need for 32-bit precision multiplication is usually less than 20 registers.

Stack

The F21 processor places its stack on the chip, so they are fixed size; they can be any size because ARM has no hardware stack.

Subprogram calls and returns

Similar to the usual RISC processor, ARM does not automatically place the subroutine returns to the stack, but transfer it to a register so that the end subroutine can be used to access the memory access. Since the return address is always slow, and each call requires 8 bytes instead of 4 bytes, it adds a new register L (LINK), which buffers like data stack top element register T Return the top elements. Two new instructions are RET and: (colon) for implementing terminal subroutine optimization. When considering the L registers, the semantics associated with the return stack have only a small change. Byte addressing

Since ARM is byte addressing, the incremental increment must be 4, adding to the reading and storage instructions of bytes look wise.

multiplication

ARM has hardware multiplication instructions, so * can be replaced ( * In addition to multiplication, there are other purposes, but when hardware can provide multiplication, deleting a rarely used instruction seems reasonable).).

NOP

ARM does not require NO-OP operation, so the NOP instruction is removed. This can be simulated by a phrase of Push Pop or Dup DROP at any time.

OS access

An SWI instruction is added to allow ARM's SWI (software interrupt instruction) to implement access to the operating system.

2 revised VM model

Data element

Stack: Data Stack S, return to the stack R

Registers: T, L, A, P

2.2 Perform a loop

Execute the instructions at P, if the instruction does not modify P, P = P 4, repeat

2.3 Instruction Set

Table 1 shows the instruction set and its semantics. The first column gives the name of the instruction, the second column is the number of immediate operands. They are called V1, V2, ... The third column gives semantics. The following simplification is used: "PUSH X to T" means "pressing T to the stack S, set the value of X", "POP T TO X" means "calculates X, to store the value of X in T, Play T in S ".

Directive Operation Operation

# 1 push v to t

Else 1 Jump to v

T = 0 1 JUMP TO V if T = 0

C = 0 1 JUMP TO V if T31 = 0

Call 1 set L to p_ 4, jump to v

RET 0 set P to L

: 0 Push L TO R

; 0 POP P from r

A @ 0 push a to t

A! 0 Pop T TO A

@A 0 push [a] to t

! A 0 pop t to [a]

B @ a 0 push [a] 0-7 to t

B! A 0 POP T0-7 TO [A] 0-7

@ A 0 push [a] to t, add 4 to a

! A 0 Pop T To [A], Add 4 to a

B @ a 0 push [a] 0-7 to t, add 1 to a

B! A 0 POP T0-7 TO [A] 0-7, Add 1 TO A

POP 0 Push L to T, POP R to L

Push 0 push l to r, pop t to l

@ R 0 push [l] to t, add 4 to L

! R 0 Pop T To [L], Add 4 To L

B @ r 0 push [l] 0-7 to t, add 1 to L

B! R 0 POP T0-7 TO [L] 0-7, Add 1 To LCOM 0 One's Complement T

2 * 0 Shift T One Place LEFT

2/0 Shift T Arithmetically One Place Right

* 0 set t to [s] = t, POP S

-or 0 set t to eloclusive-or of [s] and t, pop s

And 0 set t to both 【s] and t, pop s

0 set t to [s] _ t, pop s

Dup 0 Push T TO T

OVER 0 Push S to T

Drop 0 Pop t from s

SWI 3 POP V2 Arguments from T Into Arm Registers R0 To RV2-1,

Call SWI V1, PUSH V3 Results from Arm Registers R0 To RV3-1 To T

Table 1 MF32 instruction set

3 assembler

As you are committed, the write compulsor is easy. I added a commonly used compilation control structure to implement if ... then conditions and begin ... Repeat / Again / Until cycle, increase the MF-IF-based variants.

After completing the compiler, I added a simple snap-hole optimization for the assembler to delete PUSH-POP pairs; this may work more than other assembler needs, but the code is short.

4 compiler

The compiler provides enough tools for the smallest interpretation environment: an interpreter / compiler, numeric input, and output, with a new defined dictionary capability.

5 Machine Forth's difficulties

Start

As before, it is difficult to write because it is not familiar with the stack and arithmetic operator, I found it hard to remember A @ / a! And @ a /! A is doing what is done, I occasionally use the latter as @A /! A successfully remembers them.

Over time, I started to discover some universal habits, such as using the A and R stacks to change the sequence of elements, sometimes using R to store cyclic counts and final values, DUP BEGIN DROP to remove the logo of the condition cycle (because MF The IF and -IF instruction do not pop up T).

Consider portability

No matter what feeling from MF, I found that when considering how it is best to implement a word, it is very difficult to compare the difference between different MF instruction sequences and ARM translation. For example, in my MF32 implementation, in order to get constant 0 in the stack, the method defined using 0 constant is the fastest. In F21, using DUP DUP - may be the best. This seems to be contradictory contradiction to the portability of MF, although the same result can be obtained in different architectural sequences. However, this problem will appear in any portable language, but it is more obvious in this simple structure of MF.

Directive cache synchronization

One problem is that any Strongarm dynamic code generator must be resolved, this is its instruction Cache can't automatically synchronize with the data cache, a simple and secure solution is to add a word code!, It is synchronized when adding , Construct and use it to store the dictionary (there is another word in the MF32).

Increasing primitive

I found that I need a few MF that is not available. I added NEGATE, Minus, OR, LSHIFT and RSHIFT. The clumsy rule I use is: If a word is used by the ARM instruction, I can use ARM encoding than several instructions with MF32.

In fact, all mentioned above is implemented as embedded primitives, as if they are part of the MF32 assembler.

Other words encoded with assembly

There are several messages written to Comp.Lang.Forth to omit SWAP perform lengthy discussions. I didn't participate in this discussion, but in my system, I added SWAP, but it only used four times; at other times, it will be more effective in A or in the return stack. Other additional primitives are rarely used, generally 2-3 times, unique exceptions are OR, which uses 8 times.

EXECUTE, MIN and Dependent OS Emit, Space, Cr and Bye are also written in ARM compilation. For Execute requires special instructions: It is a simple PUSH RET in the F21 processor, but it is more difficult in MF32, although it can be done. You can practice a practice, but before you see the answer from Part 9, let's think about it.

Lack of stacks

MF lacks the words that smash the stack, which increases the pressure to the general Forth programmer, requires them to avoid the stack and more strongly require the factor decomposition. It seems that this is a tight spell, but I found that only 1, 2 words have become difficult to write due to lack of Pick and Roll (Number especially embarrassing). Despite this, I am still not accustomed to MF.

division

For the input and output of the value, the division is required. MF and ARM have no division instructions, so I use the divider program.

Read and write stack pointers

A serious shortcomings is that MF cannot read and write stack pointers: it is very important to implement a depth reset stack in Quit. In the F21 processor, since the stack is implemented with hardware, it is not too important, but for the software, this is very different because it prevents the memory leakage. The method I avoid this problem is: Let Quit simply branch to the initialization code, reset the stack pointer.

6 ARM FORTH

After ending the MF32 of ARM, I wrote the compiler with ARM code. This sounds like the so-called "hardware fortH", but there is a difference: when the hardware structure of the CMForth compiler is specified by a machine (in this case, the Novix Forth chip), ARM's Machine Forth is strictly Machine Forth is consistent, just using the ARM machine model and is no longer an MF model.

The code is not only smaller, but at least easier to write (in fact, many words work for the first time, they are not the MF32 version of local code translation, but completely rewriting to utilize the advantages of ARM). ARM is small and the rule set can be compared to Machine Forth (which contains approximately 20 basic instructions) in code size, but it has a richer arithmetic and logical operation, addressing mode, and Machine Forth have no characteristics, such as Condition execution.

On the other hand, since ARM is a register-based machine, this causes two compilers: Some words are very useful in the interactive environment, such as DUP and , but rarely in compiling code, in compiling code, need Develop registers addressing capabilities.

7 MF32 and ARM FORTH comparison

size

Table 2 shows some of the comparisons of the MF32 and ARM Forth systems. Table indicates that the code density is that each MF32 command is 1.5 ARM instruction or is 6 bytes.

Table 2: Comparison of MF32 and ARM FORTH

Symphony optimization saves about 100 instructions, which is equivalent to 9% of all generated code. But we also see that ARM Forth is more than MF32, because ARM chips are more complicated than Machine Forth assembler, and binary code is 32 units, including 326 smaller code. In addition, many instructions in both systems are smaller than using the ARM directive than using the MF32 instruction. It seems that the code density of ARM is twice the MF32.

speed

We put two benchmarks in the MF32 system, ARM Forth system, a local codes (AFORTH 0.75, can run from http://sc3d.org/rt/) and GNU C. The first code is a simple random number generator, as shown in Figure 1 (considering the reason we have slightly C code, it is a text translation of an ANS FORTH version, can be obtained in the Machine Forth release).

Table 3: Random number benchmark test

Table 4: Proud number benchmark

The first test runs 10,000,000 cycles, the second time is a simple prime quota lookup, find 10,000 prime numbers (it is best to run some famous benchmarks, such as ERTL's integer test set, but there is not enough time to put them Translated into Machine Forth and ARM Forth.

ARM Forth is the smallest and fastest and version, but amazing is that the ANSI Forth code is smaller than the MF32 version and is faster (the prime reference test costs about 60% of the time to perform software division procedures, so this is Not important, although the same tendency is also shown in the random number reference test).

In an earlier version of this article, I have stupidly wrote "MF32 speed will be clearly located between the subroutine series and the local code compiler." Despite several reference tests, it is equally stupid, but the speed and code density of MF requires further investigation, at least in a typical desktop architecture (in Intel's implementation, similar Code Density See Comp.lang.Forth.

Figure 1 Random number benchmark test procedure

Programming Comfort

ANSI Forth and MF32 are the same in programming comfort. Contrary to the experience of writing compilation, ARM Forth is more difficult. The ARM code is more difficult than Forth, in fact, ARM Forth is more difficult to factor, because in ARM FORTH, the definition of the return flag will naturally modify the ARM register instead of returning a flag value in the top of the stack, but this is It is currently impossible because the logo blocks the call of across subprograms. Perhaps this is to change.

However, this compares Machine Forth and ARM code are unfair because the former is designed for MF, but the latter is not. Using a more natural coding style, ARM assembly will be as easy to write as Machine Forth, and if you use it from a traditional Forth compiler, it can also be tested interactively.

8. Conclusion

This article is only a summary of very little experience, and it is always imperative by writing a compiler to get about FORTH programming. Obviously, MF is easy to implement, but that is just the conclusions obtained from the perspective of the machine model. For me, Machine Forth has fallen into two difficulties: For advanced code, I am more willing to write a complete standard Forth, for low-level code, I am more willing to have powerful, high-speed, tightened assembler (of course in one Use in the Forth compiler). This impression may come from my own processor (6502, MC68000 and ARM) having a small and simple instruction set. For an unfamiliar processor, especially for the processors of those large and complex instructions, the simplicity of MF may help us generate reasonable code faster than learning local assembly languages.

The portability of MF can also be mentioned as its characteristics, but it is just a advantage when preparing advanced code; when MF is an alternative to assembly language, the encoding is related to the machine, which is discussed here. There is no meaning.

However, MF has also provided some beneficial ideas. For small Forth compilers that cannot provide an optimizer, explicitly use the address latch is the easiest way to improve performance. More interesting is that it is non-destructive conditional testing can be easily used in traditional Forth.

Therefore, MF's novelty and mixing of traditional simplicity are carefully studied and mixed, although I will not give up the traditional combination of Forth and assembler because of these.

9 exercise answer

Execute can be written in MF32 like this

A! Pop a @ push push;

10 gratitude

Jeff Fox was kind enough to read and comment on the paper, a referee made several helpful comments, Hanno Schwalm pointed out the weakness of the primes benchmark, and Marcel Hendrix asked for the comparison with C.

References

[1] M. Anton Ortl. Performance of Forth Systems, 1996. http://www.complang.

Tuwien.ac.at/forth/performance.html.

[2] Charles Moore and Jeff Fox. Preliminary Specification of the F21, 1995. HTTP:

//pisa.rockefeller.edu: 8080/misc/f21.specs.

[3] VLSI Technology Inc., Eaglewood Cliffs, NJ. Acorn Risc Machine Family Data Manual,

1990.

转载请注明原文地址:https://www.9cbs.com/read-49771.html

9cbs

New Post(0)