Principle and implementation of stack computers

xiaoxiao2021-03-06 22

Original book "Stack Computers: The New Wave"

(Original book cover)

The original author Philip J. Koopman, Jr.

Compiler Zhao Yu Zhang Wencui

This is the first book that discusses a new generation of stack computers, and the first piece of chip that implements this architecture is Novix's NC4016 chip. This book author is used to calculate the calculation from the stack, classify the past and now about more than 70 stacks, and the seven stack computers have been described in detail, and the block diagrams and instruction sets are given. These computers come from Harris Semiconductor, Novix, Johns Hopkins University / APL, MISC, WISC Technologies and Wright State University. The issues discussed in this book also include stack computer architecture analysis, stack computer software strategy, stack computer application field and future development prospects.

This book reflects engineers and programmers to understand the needs of publications. The fields of his (she) work include: computer architecture, real-time control system, expert system in control applications, computer graphics, image processing, military electronics And any applications where the requirements are compact, powerful computers.

table of Contents

Chapter 1 Introduction and Overview

Chapter II Hardware Supported Stack Computer Classification

Chapter 3 Multi-Stack, 0 Operational Computer

Chapter IV 16 - bit Stack Computer System Architecture

Chapter 5, 32-bit system architecture

Chapter 6 Understanding Stack Computers

Chapter 7 Stack Computer Software Development

Chapter 8 Stack Computer Application

Chapter 9 Stack Computer's Future

Appendix A Computer Overview of Hardware Stack Support

Appendix B Forth Original Summary

Appendix C Complex Frequency Statistics

Appendix D References

Chapter 1 Introduction and Overview

1.1 overview

The backward first outstanding stack (LIFO) of hardware support has begun to be used in the computer. The purpose of the early increasing these stacks is to speed up the execution speed of Algol's advanced language. From that time, the stack was used in front of the hardware designer, and the other was abandoned, and finally became a secondary data processing structure in most computers. Let the stack advocate puzzle, the stack is used as the main data processing structure, which has never been widely accepted by the register-based machine designer.

As the super large scale integrated circuit (VLSI) technology is introduced into the microprocessor, the traditional computer design method has been questioned again. The complex command computer system (CISC) is replaced by an instruction set. Reximel Instruction Computers (RISC) Use a simplified processor core to reach a higher raw proactive speed in many applications.

The stack computer is once again considered as a design style. A new generation of stack computers based on VLSI design technology, providing additional benefits that the traditional stack computer cannot provide. These new stack computers have been impressive, flexible, and simplicity through many combinations of characteristics.

The complexity of the stack computer is much lower than CISC, and the system complexity is much lower than RISC and CISC; the high performance of the stack computer does not require complex compilers or Cache hardware controllers to be able to achieve; they also provide The raw performance of competitiveness can provide higher performance in many programming environments; their first successful fields are embedded real-time control environments, where they are better than other system design methods. In addition, stack computers also show wonderful prospects in performing other applications, including: logical programming language prolog, function programming language Miranda and Scheme, artificial intelligence research language OPS-5 and LISP.

Compared with the old stack computer system, the older difference is that the old system places the stack to the program memory, and the new stack computer maintains the separated memory chip, and even put the stack memory on the chip. The reason is that these huge, high speed, dedicated stack memory is currently very low. These stack computers provide extremely high-speed sub-degree calling capabilities, excellent interrupt processing and stack switching performance. Integrate these characteristics together, we get a new computer system: fast, smart, compact. This chapter will first discuss the arithmetic rules of the stack, then introduce the terminology of the stack computer design of hardware, discuss an abstract stack computer and several commercial realmes, and then study the performance characteristics of the stack computer, hardware and software considering, and finally give a few The future trend of stack computer design.

1.2 What is a stack?

LIFO stack, also known as "push stack", from concept, it is the easiest way to calculate expressions and processing of recursive subprogramme in the general computer operation in general computer operations.

1.2.1 Examples of a buffet restaurant

Let us use the example in a life to illustrate how the stack works. I recall the type of spring container that is often seen in the buffet restaurant. It is assumed that we have a label on each tray, and the disk is put on the top, then other tray compression springs. Leave a space for subsequent trays. As shown in Figure 1.1, the plates 42, 23, 2, and 9 are in turn, and the 42 is first placed, and the top 9 is finally placed.

Figure 1. An example of a stack operation

The last entry is numbered 9. In this way, the most out of the tray is also 9. Suppose consumers always take a tray from top, the first one taken is 9, each of which is taken is 2. Now let us consider adding other trays at this time. If we have to take the plate earned, then these newly added trays must be taken away. After Push (loading) and POP (taken away), the 42 plate is still at the bottom. If the No. 42 plate is taken from the top, then the stack will turn it again in an empty.

1.2.2 Example of software implementation

The LIFO stack can be programmed on a traditional computer through several ways. The most direct method is to assign an array in the memory and record the number subscript of the highest available element with a variable. Programmers that pursue efficiency can also assign a memory, and use a pointer to the actual address of the top element. Regardless of the situation, "Push" is positioned to locate a new word on top of the stack, then put it into the process of data; "POP" operation is removed from the top of the stack and then returns its value to the required subroutine. the process of.

Stacks are typically placed on the highest address of the machine memory, which is actually growing from the high address to low address, which can be used in the use of memory: the area between the top and program memory space is available. In the case of we discussed, the stack is not a very large relationship down or up. The "TOP" element is the element of the final stack, which will be first popped up. The "bottom" element is such an element: When it is removed, the stack is emptied.

The stack has a very important feature: in the purest sense, they only allow access to the top elements of this data structure. Behind we will see this feature far: it makes the procedure, the hardware is simple, and the execution speed is improved.

The stack is an excellent mechanism for temporary data in the storage subroutine call. The fundamental reason is that they allow process recursive without considering the data that will not undermine the previous access, they also support retrieval code. As an additional advantage, the stack can transmit parameters between these same subroutines. Finally, they save space by providing different processes using the same memory space to save space, otherwise you have to assign space for the temporary variable in each process.

In addition to arrays, we can also create stacks using several other software methods. The chain of elements can be used to allocate the stack of words without having to know the address of the actual memory. Similarly, you can also use a heap to allocate stack space, which is not a problem, because heap management is actually supercoming stack management. 1.2.3 Example of hardware implementation

With hardware to achieve a significant advantage, it can run faster than the software, and this efficiency is critical to the high performance of the system in a machine that has a large number of stack instructions.

Although the various methods of software implementation stack can be implemented in hardware, the usual practice is to use a plurality of memory and point the stack pointer to this area. The pointer is a dedicated hardware register that increments or reductions in PUSH and POP elements. Sometimes it is necessary to add an offset on the pointer to access the memory so that you can simply skip the previous elements to access the specified elements to avoid one way to pop up these elements. In most cases, the stack and program code reside in the same memory device, but sometimes consider efficiency, the stack also resides in its own storage device.

Another possible way to use a hardware construct stack is to use a large number of shift registers. Each shift register is a long chain of a register, and one of its ends can be considered one of the top elements of the stack. 32 such N-bit shift registers are arranged to form a stack of the 32-bit width of n elements. However, in the early stage of this program, there is no actual use, and when using the VLSI design stack computer, it does not simulate it as a traditional register-based approach.

1.3 Why is the stack computer so important?

In theory, the stack itself is very important because the stack is the most basic and natural structure that deals "structural good code" (Wirth 1968). Machines with LIFO stacks are also needed to compile computer languages, which may also be required to translate natural language (EVEY 1963). Any computer with hardware support stack can be faster when performing applications that require stack structures.

Some views believe that the stack computer is more easier than traditional computer programming, and the stack computer program is more reliable than other programs (McKeeman 1975). Compile compilers for stack computers. Because they have few exceptions to compilers, they have made compilers very complicated (Lipovski 1975). Because in many cases, the running compiler takes up a large number of computer resources, so it is very important to construct a computer that is a highly efficient compiler.

As we will see in this book, the stack computer is more efficient than register-based computers when running certain types of programs, particularly those modular programs. Stack computers are also simpler than other types of computers, can provide powerful computing power with little hardware. A particularly worth mentioning of the stack computer is a real-time embedded control application. This application requires a combination of small size, high processing speed, and good support for interrupt processing, only stack computers can provide all of these capabilities at the same time.

1.4 Why use the stack in your computer?

Hardware and software stacks have been used to support 4 main calculations: expression computing, subroutines return address storage, dynamic allocation local variable memory, and subroutine parameters pass.

1.4.1 Expression Computing Stack

Expression computing stack is the first stack that is widely supported by hardware. When an arithmetic expression is explained, it must use an expression stack to record the intermediate steps and priorities of the operation. If it is explained the language executed, you must use two stacks: a stack contains universal operations waiting for higher priority, and the other stack saves the intermediate input corresponding to these pertain operators. In the compile language, the compiler maintains track of the unconscious operation during its instruction generation process, and the hardware uses a single expression to calculate the stack to keep intermediate results.

To understand why the stack is suitable for expression computing, considering how the following expression is calculated: x = (a b) * (C D)

First, A and B should be added; then, where the intermediate result must be saved, such as that can be pushed into an expression computing stack; then, the results added by C and D are also entered the expression computing stack. Finally, the two stack top elements (A B) and (C D) are multiplied and the result is written to X. Expression Computing Stack provides automatic management capabilities of intermediate results in expression, allowing expressions in the expressions that have multiple levels as the available stack elements. If the reader uses the HP's use of the inverse Polan representation, the expression calculation stack should be directly impressed.

Expression Computing Stack is very basic in expression calculations, so even in register-based machines, the compiler also allocates registers in a manner as expressing the stack.

1.4.2 Returns Address Stack

In the 1950s, recursive is proposed as a necessary characteristic as a language. At this time, a method is needed to dynamically allocate a memory to save the return address of the subroutine. The problem at the time is: In the non-recursive language like Fortran, the method of saving the subroutine returns an address is to assign a space in the inside of the subprofer. This certainly stops a subroutine from calling it directly or indirectly, because the previously saved address will be lost.

The solution to the recursive problem is to use the stack to store the return address of the subroutine. Because each subroutine is called, the machine saves the return address of the call program to the stack, which ensures that the return address of the subroutine is processed in the order required by recursive. Because the new element is automatically assigned when each subroutine is called, the subroutine can call itself without any problem.

Modern computers use some type of hardware support to return the address stack. In a conventional machine, this support often manifests as a stack pointer register, execute subroutine call, and a subroutine return. Returning Add Address Stacks are typically saved in the area where the program memory is not used.

1.4.3 Local variable stack

Another problem occurs while recursive: Management of local local variables. This problem is more pronounced in allowing reinny (the same code that is used multiple times multiple times). Let's look at the old language, such as in Fortran, which is the management of a subroutine information simply assigns a constant storage area in the subroutine code. This method of static allocation memory is suitable for programs that neither revenue nor recursive.

However, as long as a subroutine may be connected concurrently by multiple threads or may be recursively called, it is almost impossible to define local variables during the process, and the variable value of a thread that is executing is easily threaded by another competition. change. The most commonly used solution is to allocate space on a local variable stack. Each subroutine call assigns a new memory block on a local variable stack to create a work space for the subroutine, even in the case of saving temporary variables using register, The calling subroutine also requires some local variable stack that saves them before the value of the register is destroyed.

The local variable stack not only allows reincarries and recursive, but also saves memory. In the subroutine of the memory with a static manner, the variable will take up the space regardless of the subroutine activity, and the space on the stack can be accompanied by the depth of the stack during the subroutine, and the space is used to increase or decrease in the depth of the stack when using the local variable stack. Use again.

1.4.4 Parameter Stack

The last area of the stack in the computer is a subroutine parameter stack. Whenever a subroutine is called, a series of operational parameters must be given, and these parameter values can be placed in the register, but this method has a disadvantage that is limited by the number of available registers. You can also copy the value of the parameters to or through the pointer to a list when the subroutine is called, but the re-entry and recursive are impossible. The most flexible method is to simply copy the elements to the parameter stack before the subroutine call, and the parameter stack can be recursively and re-in.

1.4.5 Complex Stack

The actual machine combines different stack types. In register-based machines, the most common is to combine local variable stacks, parameter stacks, and return address stacks into active records or a single stack called "frame". In these machines, the expression computing stack is omitted by the compiler, which is replaced by register allocation to perform expression calculations.

The stack computer described later is the method of using the stack is to separate the hardware expression computing stack and the return stack, and the expression calculation stack is also used to deliver parameters and perform local variable allocation. Sometimes, especially when the traditional language is executed, the frame pointer register indicates a local variable of the program memory area when executed by conventional languages such as C or PASCAL.

1.5 New Generation Stack Computer

The new generation of stack computers is the focus of this book discussed, inherit of the rich historical experience of the stack computer design, while using VLSI new process technology. This combination provides the simplicity and flexibility of all types of computers in the past, and the results of these features and their difference between the traditional design are: multiple stacks with hardware buffers, zero operands based on stacking instructions The processing power of the set and fast process call.

These design features have produced a series of features in the final machine, including: do not need to use the pipeline to achieve extremely high performance, the system has a very low system complexity, the program code size is small, the program execution is fast, the interrupt response overhead is small In all time metrics, there is a consistent program execution speed, low context switch overhead. Some of these conclusions are obvious, and some conclusions are completely opposite to the well-recognized wisdom in the traditional computer architecture.

Most of this stack computer design can have their root: Forth programming language, because these stack computers with two stacks, Forth can be used as high-level language and assembly language: a stack for expression The calculation, parameter passes, and the other stack is used to save the return address of the subroutine call. In a sense, the Forth language actually defines a stack-based computer architecture that is simulated when performing the Forth program by the main processor. The similarity of the Forth language and hardware design is not accidentally designed without exception, and allows people with the Forth program design.

One interesting point that needs to be aware of: Although some machines start to be designed to run the Forth language, they can also run the traditional language well. Thus, both of them cannot be selected to replace a core processor in a personal computer or workstation, but can actually be used in many traditional languages development, application, and programming. The most interesting thing is: These applications pay attention to the unique advantages of the stack computer: small system sizes, good external events respond, effective hardware resources are high efficiency.

1.6 The content of this book includes

All types of stack processors can be classified by the number of stacks, the size of the dedicated stack memory, and the operands in the instructions. The stack computer discussed in this book is a computer that is plurality of stacks, 0 operands addressing. The size of the stack buffer is designed according to the system cost and operation speed, so "stack computer" in this book refers to such a machine.

The stack computers have the characteristics of small program code size, high system performance, consistent with different conditions. The stack computer runs the process of programs written in the traditional language, and the hardware used to achieve this performance is less than the hardware required to achieve the same nature based on the register computer.

The stack computer is extremely good when running the Forth language. Forth is known as its interactive, flexibility, and fast program development capabilities, the Forth language can generate very tightened code, so it is especially suitable for real-time control issues.

This book discusses 4 16-bit stack computers, compromising with design, including integration and speed, including WISC CPU / 16, MISC M17, Novix NC4016 and Harris RTX 2000.

The 3 32-bit stack of computers in this book considers more problems when designing, which is: Johns Hopkins / APL Frisc 3 (also known as Silicon Composers SC32), Harris RTX 32P and Wright State University SF1. Understand Stack Computer A large amount of information is required to collect and analyze and compare with register-based computers. The comparison standards currently can use include: execution frequency of approximately 10 million FORTH dynamics and static instructions, the impact of combined operation code and subroutine calls in the same instruction on the RTX30P, stack size requirements, stack overflow management strategy, processing large number of interrupts and context switching The degree of performance of time performance.

Many factors need to be considered for stack computer selection software. A large number of programs written in a large number of languages should be able to run quite effectively on the stack computer, especially if you make small modifications to frequently used code segments. The stack computer has a good application area is embedded real-time control, which is a major part of computer applications. Of course, we also discussed other fields of interest.

For stack computers, future hardware and software efforts may include increasing the efficiency of the traditional program language on the stack computer, including hardware without having to be limited to the adverse effect of memory bandwidth as other processors.

Chapter II Hardware Supported Stack Computer Classification

Historically, a computer design that provides a large number of hardware support for supporting advanced language can include many support for stack hardware, which supports multiple hardware stack memories from the hardware pointer register to the CPU unit. There are two new types of processors to use hardware support stacks to show special interest: one is the RISC processor, which often needs to treat large register files as a stack; the other two-class processors are controlled by real-time control, which uses Stack instructions to reduce the size of the program and reduce the complexity of the processor.

Understanding the important step based on a stack computer is to classify them. A good classification method can be more clearly examined, rather than falling into a particular machine, and classification can also help us understand why it uses a certain architecture in an existing design. From the classification, we will find more processor types before faceting a plurality of stacks, 0 operands.

In 2.1, we will describe the classification of the stack computer based on 3 properties: the number of stacks, the stack buffer size, the number of operands in the instruction. We will also discuss the advantages and disadvantages of each type of system.

In 2.2, we will discuss existing stack processors in this classification. In 2.3, we will discuss similar and differences in each processor in each type of structure, which can help us think about stack computer design decisions. .

2.1 Three directions of stack design space

Figure 2. Three directions of stack design space

The design space of the stack computer can be classified according to the coordinate map of Figure 2.1, and the three coordinates in the figure are: the number of hardware support stacks, the size of the stack element dedicated buffer, how many operands allowed in one instruction.

Although from some respects, 3D can represent a continuous body, but from our classification purposes, they should be divided into 12 categories: 3 possible values:

Quantity of stack = single S or multiple M

Stack buffer size = small S or big L

How much = 0 or 1 or 2 operands

2.1.1 Single Stack and Multiple Stacks

The most obvious example of the stack support function is to return the address with a single stack support subroutine, which is often used to deliver parameters to subroutines. Sometimes one or more stacks are added to process subripes calls without affecting the parameter list, or process values on a expression stack separated from call information.

A single stack computer has only a unique stack supported by the instruction set. This stack is mainly used to save subroutine calls and interrupt information, which is also used for expression calculations. Regardless of the case, it is used by a language compiler to deliver subroutine parameters. Typically a single stack makes the hardware simple, but the cost of mixing the data parameters and the return address information is high. The advantage of a single stack is to simplify the operating system, and each process of the operating system only manages a variable size memory. Matters designed for structural programming languages typically only use a stack combination subroutine parameter and subroutine return address, and often cooperate with a certain type of frame pointer mechanism.

The disadvantage of a single stack is that the parameters and return addresses have to be nested each other. If the modular software design technology requires the parameter list element to propagate through multiple software interface layers, they have additional overhead multiple times to new activity records.

Multi-stack computers have two or more stacks supported by instruction sets, and a stack is usually dedicated to storage returns, other stacks are used for expression computing and subroutine parameters. Multiple stacks allow control flow information to be separated from the data operand.

When the parameter stack is separated from the return address, the software can pass the parameters through several subroutine layers without having to copy the data into the new parameter list.

An important advantage of multiple stacks is the speed. Multiple stacks can access multiple values within a clock cycle. For example, if a machine can access the data stack and returns an address stack at the same time, it can perform subroutine calls and returns while performing data operations.

2.1.2 Size of stack buffer

The size of memory specifically for buffering stack elements is critical to performance. The actual implementation method includes using program memory storage stack elements, which can set several stack top element registers to have a fully separated stack memory unit. The classification here includes almost complete resident in program memory (there may be several buffer elements in the CPU) and provide an efficient stack buffer design.

The architecture using a small size stack buffer typically treats the stack as a partition of the general program memory address space. The stack uses the same memory subsystem to use the instructions and variables, and you can use the regular memory access instruction to access the stack operand when needed. The elements of the stack can also be accessed by a stack pointer or frame pointer pointing to the memory.

In order to improve the running speed, the stack computer must have at least one or two stack elements buffered in the processor. To understand the reasons, we can consider an addition operation without a buffer element machine:, a single addition instruction uses more than three memory cycles to access two operands and save results. If there are two elements in the stack buffer, the addition is only one memory cycle. This cycle is used to read the new set of top elements to fill the stack parameters consumed by the addition.

If you use a small stack buffer to reside in the original stack in the program memory, we can quickly switch between the stacks of different tasks, because the stack elements have already been in the memory.

Small size dedicated stack buffers are easy to implement and manage, so its application is also very common. Most data elements reside in the main memory also make management pointers, strings, and other data structures. The disadvantage of this way is that a large number of memory bandwidth is removed by the read and write of the stack element.

If an architecture has a sufficiently large stack buffer, you usually do not consume the bandwidth of the main memory when accessing the stack element. This architecture of this "have enough big stack buffers" uses one of the following structures: it can be a series of registers, access, such as RISC I, such as RISC I ((Sequin & Patterson 1982); with program memory Isolated a separate memory cell; or a dedicated stack memory Cache in the processor (Ditzel & Mclellan 1982).

In general, if several subroutine calls (such as 5 or more) have not used all stack memory, we can say that the stack buffer "is large enough". If the stack is used by the operation stack, about 16 elements can be considered "sufficient enough", because a single expression is generally not too complicated. In Chapter 6, we will count on some programs, give "how big is enough big enough", a more clear answer. One advantage of using large-sized stack buffers is that the program memory cycle is not occupied when accessing the data element and the subroutine return address, which significantly increases the program execution speed, especially for subroutine call sensitive programs.

One disadvantage of using the separation stack memory cell is that its size may not be sufficient to meet the needs of all applications. In this case, you may need to split data to the program memory when the new stack element is added. In addition, in a multi-task environment, all stack memories are saved when context switching, which may not be allowed, of course, can be considered to assign the stack memory by task. From a lower level, the separation of the outer stack memory and program memory will increase some pin, which is more expensive for microprocessors.

Obviously, the description of "big" and "small" is relatively blurred, but these are usually clear for designers when designing.

2.1.3 0-, 1- and 2-Operators Addressing

It seems that the number of operands in a machine directive is not used for hardware support stack computers. In fact, the number of addressing modes has a great impact on how the stack makes up and how the program is used.

0-Operation Data instructions are not allowed to contain any operands in the opcode, all operations implicit the operand on the specified stack, which is generally referred to as "pure" stack addressing.

Of course, 0-Operational Architecture must use its own stack for expression computers.

That is, in the pure stack machine, there must be several instructions to specify the address for loading and storeing variables in the program memory, loading the volume (constant) value, subroutine call, and conditional branch. These instructions generally use very simple formats, usually saved the operand using a memory word behind the operand.

Simple 0-Operation Directive has several advantages. Since one or two positions of the top of the stack each time, we can simplify the structure of the stack memory because one or two stack top registers can use single-port stack memory. Another advantage is that the speed is the speed of the operand of each instruction, and we can load an operand register in parallel while instructing decoding. This can completely omit the finger and store operational number of water lines.

Another advantage is that each instruction can be extremely tightened, and the 8-bit instruction format can meet the needs of 256 different opcodes. Further, the instruction decoding is also simplified, and the decoding hardware does not need to explain the addressing mode of the operand.

0 - A disadvantage of 0-Operational addressing mode is that complex adding mode for data structures requires several instruction combinations to be implemented. If you want to access a deep data element bury inside the stack, it is also difficult to provide a "copy native" operating instruction in advance.

1-Operation instruction computer typically specifies an operand and uses a stack top element as the second implicit operand. 1-Operation Addressing is also known as stack / accumulator addressing, more flexible than 0-operands because it combines operand read and stack operations.

Keedy (1978) Conclusion, the front of the stack / accumulator architecture is compared with the pure stack architecture, the former can use fewer instructions to calculate the expression. His argument is that all 1-operand instructions are shorter than 0-operand design. Of course, this also has a trade-off. Since an operand is specified by the instruction, one method must be employed in order to effectively access the operand: or have an operand read the pipeline, or use a longer clock cycle. If a parameter is on the subroutine parameter stack or the calculation stack, the addressing of the stack memory must be read using an offset of the operand. This requires more execution time or more pipe hardware than the stack top element is prefethed and wait for the operation. 1-Operator Stack Architecture Almost always has an expression computing stack, many 1-operating digital architecture also supports 0-operand addressing mode to save instruction bits when not using operational numbers.

2-Operation instruction format, including the purpose of this book, we also use the 3-operand command as its special case, allowing each instruction to specify the source and purpose. If the stack is only used to store the return address, the 2-operand computer simplifies into a general register computer. If the subroutine parameter is passed through the stack, the 2-operand machine or specifies a register pair of the offset relative to the stack or frame or specifies the current register window for operation. 2-Operator instructions The computer does not require an expression computing stack, but it hands the burden of tracking expressions to the compiler.

The 2-operand machine provides the greatest flexibility, but in order to improve efficiency, more complex hardware is required. Because it is impossible to know the operand before an instruction decoding, you must provide an operand to the execution unit using the data line and the double-port register file.

2.2 Classification Expression Method

2.2.1 symbol

To facilitate discussion, we use three classification axes of characters to record a architecture. The first letter abbreviation specifies the number of stacks (s = single, m = multiple), each two alphanukes indicate the size of the dedicated stack memory (S = small, L = large), the third is a number, used to represent The number of operands in the instruction (0, 1, 2). So SS0 means such an architecture that has a single stack, a small size dedicated stack memory and 0-operative addressing mode. ML2 specifies the architecture of multiple stacks, large-size dedicated memory, and 2 operand address modes.

2.2.2 list

Table 2.1 Since our classification method gives a directory that currently exists and historicals have been based on stack-based architectural computers. Appendix A simply discusses the features of each structure and their implementation.

Table 2.1 Stack Computer Classification

Classification Computer SS0 Aerospace Computer, Burroghs Family, Caltech Chip, Euler, Gloss, Hitac-10, ITS, LAX2, MESA, MicroData 32 / S, Transputer, WD9000 SS1 AAMP, BUFFALO Stack Machine, EM-1, HP300 / HP3000, ICL2900 , IPL-VI, McODE, MU5, POMP PASCAL SS2 Intel 80x86 SL0 G Machine, NORMA SL1 AADC, Micro-3L SL2 AM29000, CRISP, DRAGON, PYRAMID 90X, RISC I, SOAR MS0 ACTION Processor, APL LANGUAGE MACHINE, FORTRAN MACHINE, HUT, Internal Machine, MISC M17, Rockwell Microcontrollers, Symbol, Tree Machine MS1 PDP-11 MS2 Motorola 680x0 ML0 ALCOR, An ALGOL Machine, FRISC 3, KDF-9, Kobe University Machine, MF1600, NC4016, OPA, PASCAL Machine, QForth Reduction Language Machine, Rekursive, RTX 2000, RTX 32P, RUFOR, THE FORTH ENGINE, TM, VAUGHAN & Smith's Machine, Wisc CPU / 16, WISC CPU / 32 ML1 Lilith, Lisp Machines, SF1, SOVIET MACHINE ML2 PSP, SF1, Aspect we are interested in Socrates2.3 classification

It may be surprising that all 12 processors in the classification space have design implementations, which means that different stack architectures have been studied in a large number of studies. Another feature is that different machine types tend to use the operandax axis as the primary design parameters, and the difference in each design group is different from the size and quantity of the stack buffer.

0-Operational address mode is a "pure" stack computer. Little is not surprising, such systems have the most collegeized and most conformational design projects because they include standardized stack computer forms. Due to its inherent simplicity, the SS0 machine is usually used for hardware resource resource, and the design cycle is limited or both are limited. In the SS0 design, if the effective deputy stack element replication is not provided, the efficiency problem will be generated on the stack to stamp the return address and data elements.

The SL0 system seems to be only used for Combinator Graph Reduction applications (a technique for performing functional programming language applications, see Jones 1987), this application executes trees traversal, and uses stacks to store nodes in the implementation of the stack address. It is not necessary to express a computer stack because the result is stored in the tree memory itself.

MS0 and ML0 are very similar, and its main difference is the number of memorys on the chip or board buffer stack elements. All Forth Language Processors and many other high-level language processors belong to this range. These machines are very useful in real-time embedded control fields, of course, because of their simplicity, high speed processing and small size program code (Danile & Malinowski 1987, Fraeman et al. 1986). Many MS0 and ML0 are designed to allow very fast and even subroutine calls and returns.

1-Operational address is a bottleneck attempt to break the 0-operand design, the method is to transform the pure stack model into a stack / accumulator model. SS1 design is more easier to use address or frame pointer than SS0 design. Typically, a significant advantage of 1-operand design is that the PUSH operation can be combined with an arithmetic operation, saving instructions in some environments. In addition, both PASCAL and MODULA-2 machines use 1-operative addressing due to the nature of P-Code and M-CODE. 2-Operational address mode is a more mainstream design, and the traditional processor is classified as SS2. Since the register window is used, the RISC can be classified as SL2, but there is no other design to be attributed to this. MS2 is classified with a 680x0 family of Motorola, which reflects the flexibility of this machine, which can use any of the 8 address registers as a stack pointer. The PSP machine in the ML2 machine reflects the effort that its conceptual design contains a register window and greatly enhances the speed of the subroutine. SF1 machines also use multiple stacks, but in real-time control environments, each activated process uses a dedicated hardware stack.

From the above discussion, we can see that computer design can be divided into 12 categories we delineated. Different designs in each category show great similarity, and the design between different categories is very big. Differences, these differences affect the implementation and operation of the system. In this way, the classification is a useful tool for examining the stack computer properties.

In the next chapter, we will focus on the specific part of the stack computer design space: MS0 and ML0, which will use the "Stack Computer" or "Stack Processor", is specifically mean MS0 and ML0 computers.

Chapter 3 Multi-Stack, 0 Operational Computer

This chapter mainly discusses the multi-stack of MS0 and ML0 classes, 0-operating computers.

In 3.1, we will compare the difference between the stack computer and the traditional CISC and RISC architecture.

In 3.2, we will describe a prototype stack computer structure called Canonical Stack Machine, and we will give an intent set of square box and implementation. This two stack of machines can be used as the starting point of the real stack computer behind each chapter.

In 3.3, we will simply discuss the Forth programming language. Forth is a non-traditional computer programming language that uses a two-stack computing model that encourages many short-term process calls. Many ML0 and MS0 are designed to work in the Forth language, of course, is also very suitable for programming with Forth.

3.1 Why are we interested in such computers?

Multi-stack, 0-operating computer compared to other computers, there are two inherent advantages: 0-operand addressing Make the minimum size, and multiple stacks allow subroutine calls and data processing. These features and other features make the program code short, low system complexity, high system performance. The main difference between MS0 and ML0 is that MS0 uses the least resource to construct a stack buffer in order to reduce the CPU cost. Of course, do some performance at the same time.

We will discuss how the stack computer get these advantages in Chapter 6, and now we examine the details behind these advantages.

First, let us summarize these advantages.

The stack computer why the program is narrowed by two aspects: First, by encouraging more use of subroutines to reduce the size of the code, the second is the fact that the stack computer instruction is short. Small program size reduces memory costs, the number of components and power consumption, and improves the speed of the system by more efficient, smaller, higher speed memory chips. Additional advantages include better performance in virtual storage environments, almost no need to design Cache in improving hit rates. 0-operand computers have smaller code sizes than other machines.

Reducing system complexity is also reduced, and the chip size is also reduced. This also left more chip area for the on-board program memory and semi-customization characteristics. A system's performance is not only the original execution speed, but also includes the performance of the entire system and the adaptability of the system in the actual real world. The speed in system performance is not only how many linear instructions can be performed per second, and the performance reduction resulting in branches and process processing should be considered. In the stack computer, 0-operand addressing mode and more use sub-program calls to reduce the actual result of code size and system complexity is: Improve system performance for the application.

The stack processor supports the additional benefits of high efficiency process calls is to encourage programmers from the architecture to improve the code structure. By encouraging better coding habits, we can improve maintainability, by putting small The subroutine is used as a build block to increase the reusability of the code.

3.2 A prototyping specification stack computer

Before we involve real MS0 and ML0 design details, we need to establish a certain benchmark, so we will examine a normative ML0 computer design. This design is as simple as possible as a common starting point for comparing other design.

3.2.1 block diagram

Figure 3.1 is a block diagram of a standard stack computer. Each of the figures represents logical resources corresponding to ML0's most basic components. These components are: Data Bus, Data Stack DS, Return Stack RS, Arithmetic Logic Calculation Unit Alu, Program Counter PC with Stack Top Register TOS, control logic with instruction register IR, and input and output portions I / O.

Figure 3.1 Specification Stack Computer

3.2.1.1 Data Bus

For the sake of simplicity, the specification stack computer has only one resource of a single bus connection system. The real processor can use multiple data bus to facilitate the instructions to read and calculate parallel operations. In the specification stack computer, the data bus allows a single transmission function module and a single reception function module in any single operation cycle.

3.2.1.2 Data Stack

The data stack is a memory that implements the LIFO stack using the internal mechanism. The usual implementation method is to use a conventional memory with an additional / down counter to generate a memory address. Data stack allows two operations: PUSH and POP. The PUSH operation allocates a new unit on the top of the stack to write the value on the data bus to this unit. POP operation puts the elements at the top of the stack on the data bus, then delete this unit, exposes the next element on the stack to perform the next operation.

3.2.1.3 Back to the Stack

The return stack is a LIFO stack that implements the same method as the data stack. The unique difference is that the return stack is used for the address of the subroutine instead of an operation of the instruction.

3.2.1.4 Alu and Stack Top Elements Registers

The ALU function blocks the two data elements to perform arithmetic and logical calculations, one of the two data elements is a stack top element register TOS, which saves the top elements on the top of the data stack used by the programmer. Thus, the actual data module top element is the two elements that the programmer can see because the first element is in a register TOS of the ALU. Such a strategy ensures that when two elements on the stack, such as addition, a single port data stack memory can be used.

The ALU supports any primitive operation required for calculation. To illustrate convenience, only addition, subtraction, logical function (AND, OR, XOR), zero test. Since it is just a concept design, the arithmetic operation is integer. Of course, there is no reason to say that the ALU is extended to floating point arithmetic operations.

3.2.1.5 Program Counter

The program counter saves the address of the next instruction to be executed. The PC can be loaded from the data bus to achieve a branch, or when the program memory is sequentially removed.

3.2.1.6 Program Memory

The program memory module includes a memory address register (Mar) and a number of random access memory cells. In order to access the memory, Mar first writes the address to read or stored, and then read in the next system cycle, the program memory unit or from the data bus or writes data to the data bus.

3.2.1.7 I / O Like many conceptual design, we will also have a discussion of the input and output. It can be explained here is also I / O is also system resources, I / O modules handle this task, but it is enough to say that these are enough.

3.2.2 Data operation

Table 3.1 gives the minimum operating instruction set of the specification stack computer, the purpose of selecting such an operational instruction set is to explain the use of the computer - it is obvious that the execution of high efficiency is not sufficient. In fact, we do not include multiplication, division and shift operation, or to simplify. The method of the expression is referred to in the Forth language (see 3.3), which is also used in the following chapters to discuss the expression method. It is worth noting that Forth often uses some special characters, such as use! (Read as "Storage") and @ (read "Read" in Forth).

Table 3.1 Specifies the integrated set of stack computers

Command data stack input-> output command function description! N1 addr -> storage N1 to program memory ADDR position n1 n2 -> n3 n2 and N2 addition, the result is N3 - n1 n2 -> N3 N1 minus N2, resulting N3> R N1 -> N1 Enter the return stack @ addr -> n1 read program memory addr, return N1 and N1 n2 -> n3 n1 and n2 bits, the result is N3 DROP N1 -> from the stack to N1 DUP N1 - > N1 N1 Copy N1 OR N1 N2 -> N3 N1 and N2 Bits or, the result is N3 over N1 N2 -> N1 N2 N1 copy the second element N1 to the stack item R> -> N1 pop-up returns to the top element To the data stack SWAP N2 -> N2 N1 switched stack top two elements of the two elements of the two elements are different or the result is N3 [if] n1 -> If N1 is false (value 0), The branch is performed (the next unit of the memory) otherwise [call] -> execute subroutine calls in the next unit [exit] -> executive returns [LIT] -> N1 Put the next unit of the program memory For integer constants, put it in the stack item

3.2.2.1 Reverse Polaum Expression

The stack computer uses a suffix representation to perform data processing operators, which are often referred to as "reverse Poland representation (RPN)". The most obvious feature of the suffix operation is that the operand is before the operator. For example, traditional expressions (prefix) indicate such content:

(12 45) * 98

In this expression, brands are used to force the addition to calculate before multiplication. Even in the expressions without parentheses, there is also an implicit operator priority, for example, multiplication without parentheses should be calculated before the addition. The above-mentioned expression of the brackets is equivalent to the equivalent suffix:

98 12 45 *

In the suffix representation, the operator operates a recently visible operand, implies using a stack for calculation. In this suffix, number 98, 12 and 45 are pushed into the stack as shown in Figure 3.2, and then operation calculates the two elements (number 25 and 12) of the top of the stack 57, and finally * operation pair two new The stack top element 57 and 98 operation, the result is 5586.

Figure 3-2 Example of reverse Poland representation

Compared with the prefix representation, there is a very economical place, which does not require any operator priority, nor does it require any parentheses, which is very suitable for the needs of the computer. In fact, both the compiler translate the infix expression in the C or Fortran language into auxiliary machine code, just sometimes assigning an expression stack with an explicit register. The specification stack computer described above is designed to perform the suffix operator directly, and the compiler processing register allocation is required.

3.2.2.2 Arithmetic and logical operators

In order to complete the basic arithmetic operation, the specification stack computer requires arithmetic and logical operators, and each of the instructions is discussed, and the register transfer grade pseudo code will be used, which will be self-explanatory, for example, the first operation Is an addition:

Operator stack N1 N2 -> N3 describes N1 and N2, resulting in n3 pseudo code TosReg <= TosReg POP (DS)

For the operation, the user visible stack top 2 element N1 N2 is plugged back, and the result N3 is pressed into the stack. From an implementation point of view, this means that DS (it gives N1) and adds the value of TOSREG containing N2, and the result is that N3 remains in the TOSEG register as the top of the user visible data stack. The DS element accessed by POP (DS) is actually the top of the programmer visible, but is the top element of the actual hardware stack. The representation of Toseg as a stack top register is consistent with POP (DS). Note that the operation of the POP N2 and the rear PUSH N3 will be omitted after the stack buffer of one element.

Operator - Stack N1 N2 -> N3 Description Subsets N1 and N2, resulting in N3 Differential Pseudo Code TosReg <= TOSREG - POP (DS)

Operator and stack N1 N2 -> N3 describes the logic of N1 and N2, and the result is N3 pseudo code TOSREG <= TOSREG AND POP (DS)

OR OR Stack N1 N2 -> N3 Description Put N1 and N2 Logic or, result is N3 Pseudo code TOSREG <= TOSREG or POP (DS)

Operator stack N1 N2 -> N3 describes the logic variations of N1 and N2 or the result is N3 pseudo code TOSREG <= TOSREG XOR POP (DS)

We can clearly see that when performing these operations, use the stack top element register to save a lot of work.

3.2.2.3 Stack processing operation

The pure stack computer has a problem, that is, they can only access two elements of the top of the stack during arithmetic operation. Thus, some additional instructions are required to prepare operands for other operations. Of course, it should be said that some register-based computers need to spend a lot of instructions to do copy work for registers to registers to prepare for operations, so which way is better, which is more complicated.

The following instructions are related to the processing stack element.

Operator DROP stack N1 -> Description from the stack to N1 pseudo code TOSREG <= POP (DS)

TosReg <= POP (DS) representation is used in the definition of this directive and some of the instructions below. To accomplish this, you need to place the information of the data stack to the data bus, and then perform a dumb operation (such as add 0) to the top element register via the ALU.

Operator DUP Stack N1 -> N1 N1 Description Copy N1, returning to each of the two copies of Push (DS) <= TosReg operator OVER stack N1 n2 -> n1 n2 n1 on the stack Copy to Stack Top Pseudo Code Push (RS) <= TOSREG (Place N2 ON RS) TosREG <= POP (DS) Push (DS) <= TOSREG (Push N1 ONTO DS) Push (DS) PUSH (DS) ) <= POP (RS) (Push N2 ONTO DS)

When we observe the above definition, it is found that Over seems very simple from concept, but the operation is very complicated, it requires temporary storage N2. In the actual machine, you can add one or several temporary storage registers to reduce the burden on Over SWAP and other stack operations.

Operator SWAP Stack N1 N2 -> N2 N1 Description Exchange Stack Top Two Element Sequence Pseudo-Code Push (RS) <= TOSREG (Save N2 on The RS) TosReg <= POP (DS) (Place N1 INTO TOSREG) PUSH (DS ) <= POP (RS) (Push N2 ONTO DS)

Operators> R (read "TO R, to R") stack N1 -> Description Push N1 into the return stack pseudo code PUSH (RS) <= TosReg TosREG <= POP (DS)

Operator R> (read "R from, R Take") stack à n1 description pop-up returns to the top elements, put it into the dump pseudo code Push (DS) <= TosReg TosReg <= POP (RS)

Instruction> R and it corresponds to R> to exchange data between the data stack and the returns. This technique accesses data hidden in the stack in the stack by placing the data stack element into the backrest.

3.2.2.4 Memory Read

Although all arithmetic and logical operations are operated on the data elements on the stack, these data must be stored from the memory to the stack from the memory before the operation is previously stored in the memory. Specifies the stack computer uses a simple LOAD / SATORE architecture, so there is only a single load instruction @ and a single storage instruction!

Since the instruction does not operate a digital segment, the address of the memory is obtained from the stack. This is very convenient to access the data structure because the stack can be used to save the pointer to the array elements. Since the memory must be accessed twice, it is used for the instruction read, once used for data, so the execution of these instructions requires two memory cycles.

Operator! (Reading Deposit) Stack N1 Addr -> Description Point Pseudo-Code MAR <= TOSREG MEMORY <= POP (DS) TosReg <= POP (DS)

Operator @ (Reading) Stack Addr -> N1 Description N1 Pseudo-Code MAR <= TOSREG TOSREG <= MEMORY

3.2.2.5 text quantity

Sometimes a constant needs to be placed on the stack, and instructions that complete this function are called constant instructions, and are often referred to as load constant instructions based on register-based computers. Constant instructions use two consecutive instructions: an instruction for actual, and the other is a constant in which the stack is pressed. Constants require two memory cycles, one for instructions, one for memory elements.

The operator [LIT] stack -> N1 describes the content of the next program memory unit as an integer, press it into the stack pseudo code Mar <= PC (Address of Literal) PC <= PC 1 Push (DS) < = TOSREG TOSREG <= Memory This implementation assumes that the PC has point to the next instruction after execution of the current operation code.

3.2.3 Instruction Execution

So far, we have ignored an instructions if it actually reads and executed from the program memory, this execution process contains a typical instruction finger, decoding, and execution sequence.

3.2.3.1 Program Counter

The program counter is a register that is a pointer to record the next executed instruction. After the finger, the program counter automatically increments to point to the next word of the memory. In the case of a branch or subroutine call, the program counter is loaded with the destination address value of the branch.

Operation [Remove a Directive] Pseudo code Mar <= pc instruction register <= memory PC <= PC 1

3.2.3.2 Conditional branch

To achieve judgment, a computer must have a certain method of conditional branching. Perhaps the specification stack computer is the easiest way: Condition branch is performed by judging whether the top element is 0. This branch can save the conditional code and allow all control flow structures to be implemented.

Operator [IF] Stack N1 -> Description If the N1 is false (value 0), the branch is executed, its address is in the next program unit, otherwise, if the pseudo code is added if TosReg is 0 Mar <= PC PC <= Memory otherwise PC < = PC 1 then TosReg <= POP (DS)

3.2.3.3 Subprogramme

Finally, standardizing computers must have a method to achieve high efficiency subroutine calls. Because there is a dedicated return stack, the subroutine call simply presses the value of the current program counter into the stack, and then loads the new value into the program counter. We assume that the instructions used as a sub-program call can specify all subroutine addresses in a command word, ignore the process of separating the code actual address field from the instruction. In order to complete this feature, the actual machine discussed later will only require small hardware overhead.

The subroutine returns to simply obtain the return address from the top of the return stack and places this address in the program counter. Since the data parameters are maintained on the data stack, the subroutine returns no need to operate pointer or memory location.

Operator [CALL] Stack Description Perform subroutine call pseudo code PUSH (RS) <= PC (save return address) PC <= instruction register address Field

Operator [exit] stack description execution subroutine return pseudo code PC <= POP (RS)

3.2.3.4 Using a hard wiring or a microde to implement the instruction

Standard stack computers are considered to be simplified from concept, and of course, there is no need to consider implementation, a major consideration of design is true. One of these problems is to use hard wirings or microcode control, which is described in the introduction of these technologies (Koopman 1987a).

The hard-wired design is usually faster and can also get more efficient spatial performance. The cost increased for performance is usually the complexity of increasing the design of the decoding circuit. A main risk is that if the instruction set is changed at the end of a product design cycle, the entire control logic is required to be redesigned.

For a stack computer, its instruction set is very simple, usually there is no combined explosion in other computer architectures. For this reason, the stack computer of the hard-wired mode is relatively straightforward.

As an additional advantage, if a stack computer has 16-bit instructions or more bit lengths, it is small for specifying the canable instruction bit to the length of the word. A stack computer implemented by a hard-wired implementation can utilize this feature to use precoding instruction formats to further simplify design and improve flexibility. The format of the precoding (also known as a non-encoded) instruction is similar to the microcode format, indicating a particular action through a particular bit. This is capable of combining several separate operations in one instruction (such as DUP and [EXIT]). If the 16-bit instructions are also wasting, the fixed length instruction is selected to simplify the decoding, and allow the sub-program call to encode the same length of other instructions. A simple policy called a subroutine call is to set the highest bit of 0 to specify the subroutine call (15-bit address field) or set the highest bit to 1 for encoding (given 15-bit non-encoding command field). Typically, the advantages of the fixed length command speed and the possibility of compression of multiple operational instructions allow us to choose a fixed length instruction format. Using non-encoded hard wiring techniques on the stack computer started from Novix NC4016, it will be used for other designs.

Since the stack computer has so many benefits because the hard-wired instruction decoding is implemented, we can naturally think that it will never use the microcodes to implement the stack computer. However, there are several advantages to implementation using microcodeling.

The main advantage of the microcode mode is flexibility. Since many combinations in a non-encoded hard wiring instruction are not used, the microcode computer can use a few bits to specify the same operation, including optimization instructions for performing a series of stack functions. This architecture left space for the user to specify an operator code. A microcode computer can have several complex, multi-cycle user-specified instructions, and it is inconvenient to use hard-wired technology. If some or all of the microcode memory is composed of RAM, the instruction can be customized for each user even each application.

One potential disadvantage using a microded method is to compensate for the loss of speeds from accessing the microcode memory, which typically needs to create a microcode to refer to the pipeline. This may make the execution time of the command more than one cycle, but in turn, the hardwired machine is optimized to perform single-cycle execution.

But then come back, this is not a real shortcoming. If the processor speed is properly matched to the speed of the memory, many processors can perform two internal operations in a single memory access cycle. Though whether it is a hard-wired design or a microcodel design, an instruction can be performed in a memory cycle. In addition, since the microcode can be more than twice in each memory cycle, there are more opportunity to optimize the code and perform user custom instructions.

That is actually, the microcode implementation is more convenient for separation element design, so they are plate-level applications; most single-chip design is hard-wired.

3.2.4 Status change

In real-time control applications, an important consideration is how the processor handles interrupt and task switching. The specification stack computer-specific instruction set avoids this problem within a specific range, so we will discuss the standard approach to deal with these issues to build a benchmark for future design comparisons.

A stack computer using a stand-alone stack memory has an implicit responsibility: switched the stack to the program memory when the task changes, which requires a lot of status information. We will see how this state changes can be avoided in most cases. Chapter 6 discusses better techniques that can further reduce task switching burden.

3.2.4.1 Stack overflow / underflow interrupt

Interrupt or due to an exception event, such as a stack overflow, or is caused by the I / O service request. All of these things require quick processing without interrupting the current task flow.

In the stack computer, the stack overflow / overflow is the most common abnormal situation so far, so it will be used as an example to explain how the interrupt is processed.

When the application runs out of the stack memory, the stack overflow / underflow occurs. There are several possible responses to this situation: ignore overflows and make software crashes (it is easy, but the effect is not good), or use a fatal error to stop the program, or copy some of the stack memory to the program memory The program is allowed to continue. Obviously, this method has the greatest flexibility, achieving simpler in some environments, but a fatal execution error may be accepted. Other abnormal conditions such as memory parity can also be processed by the system.

The processing of abnormal conditions requires a number of times, but they all have a common nature: requires the current state in the process of processing, so that the task is restarted in the possible case. On the stack computer, these conditions do not require processor actions, but only forced hardware generating a subroutine call for the case processing encoded.

3.2.4.2 I / O service interrupt

I / O Services is a potential event that requires rapid execution of potential events in real-time control systems. Fortunately, interrupt usually only requires small processor resources, and there is almost no temporary memory. For this reason, the stack computer treats the interrupt as a subroutine generated by a hardware.

These subroutine calls press the parameters into the stack, perform their calculations, and then execute a subroutine to return to restart the interrupted program. The only restriction is that the interrupt service program cannot leave "garbage" on the stack.

In the stack computer, the interrupt is much lower than the cost of handling it on traditional computers. This has several reasons: because the stack is automatically assigned a memory, there is no need to save the register; because the branch condition is saved as a flag in the stack The upper, no need to save the condition register; most stack processors use only small data pipelines, or there is no pipeline at all, so there is no need to save the state of the pipeline when receiving the interrupt.

3.2.4.3 Task Switching

The processor needs to perform conversion between the program in order to show the simultaneous execution of multiple programs, and the switch between tasks will be generated. The status status of the stopped processed during task switching must be saved to be restored later. The status of the launcher must be placed in the appropriate location before recovery.

The traditional way to achieve task switches is to use a timer to exchange tasks at each clock Tick, and sometimes you need to meet the priority and task scheduling algorithms. In a simple processor, it is a huge burden on the status of saving stacks and memory, or reloading in each context. One of the solutions is to write a "lightweight" task that only uses few stack spaces. These tasks can put their own parameters above the existing stack elements. When the task is over, this will avoid the "heavyweight" process potentially, more expensively saved and restored the stack.

Another solution is to provide multiple stack poins for multiple tasks and point to the same stack memory hardware.

3.3 FortH programming language overview

3.3.1 Forth as a universal string

Because modern stack processors are derived from the Forth programming language, we now simply introduce this programming language.

The Forth programming language is a CHARLES MOORE inventive (Moore 1980) for a small computer-controlled an emitting telescope. Because of this, Forth emphasizes efficiency, tightening, flexible and high efficiency software and hardware interaction. At the same time, Forth is also very powerful, and can have been applied to a large number of common programming tasks: database management, accounting software, word processing, graphics, expert systems, and scientific calculations.

Appendix B includes a list of operational primitives in the Forth language.

Some advantages to use Forth language programming include: program is easy to modular, easy to debug, extremely flexible, very fast compile / editing / test cycle, highly portable, tightened source code and target code between a large number of computers (Jonak) 1986). Kogge (1982) describes the string code software environment that emphasizes the underlying mechanism of the Forth language. 3.3.2 Forth Language Virtual Machine

In order to solve the original telescope control problem, Forth requires several important quality: it must be suitable for real-time control, high interactivity to use the non-channel service, must meet several memory restrictions.

During this goal, this language has two main features: use the strip code and 0-operand stack instruction. In order to establish a concept from language operation, the FORTH virtual machine is used as a calculation model. This Forth virtual machine has two stacks: data stacks and returns. The Forth program is actually an analog to the MS0 machine code on the hardware of the host. The Forth program consists of a small subroutine that simply calls other subroutines or stack operation instructions. The composition of the program is like a tree, and each subroutine call is based on a small subset of the lower subroutine.

Obviously, Forth is a natural assembly language we have discussed at the 0-operand specification stack computer.

We should notice that although the processor is designed as a Forth processor, it still has the ability to execute any other high-level languages, because Forth's primitive is defined at a very low level, it corresponds to the machine code operation in any stack There are all in the computer. In this way, computers claim that "Forth Computers" are usually suitable for other advanced languages.

3.3.2.1 Stack Architecture / RPN

The original language of the Forth language includes all the operations of the specification stack computers listed in Table 3.1, and the names that are not enclosed by `[...] 'are accurately corresponding to Forth's functionality.

The names of brackets [if], [call], [exit], and [LIT] are automatically compiled into internal functions to support program operation. For example, [if] is compiled into a conditional branch when the compiler encounters conditional branches. [Call] references a Forth word when encountering a non-primary operation, [exit] is defined to compile the word, or a word exit Compilation. Finally, if you encounter a constant value in the program, if you are 1234 [LIT] is compiled.

There are several Forth structures, such as loop, variables, and constants, and do not directly specify stack computer support, but several simple operations can be used to simulate. Obviously, a Forth Language computer should provide direct support for common FORTH structures.

3.3.2.2 Short, common process calls

The FortH program is a primary difference between other language programs to high frequency subroutine calls. Good programming style encourages the use of small subroutines to develop and test. Subprogram usually contains 5 or 10 instructions. Statistics Frequency Display If there is approximately 50% of the instructions in the program, it is considered to be normal.

Such a software environment allows very fast and accurate constructors that are particularly effective in the environment where memory capability is restricted, which also encourages the use of computers through fast subroutine calls.

3.3.3 Emphasizing interactivity and flexibility

One main advantage of the Forth programming language is that the interactivity of each level in its development environment, the development tool includes an integrated incremental compiler and an editor to allow interactive testing and separate modification processes. Encourage the writing of small processes and module code can be easily tested and quickly tested during the development process. After the first writing, they rarely need to be positioned again. Most Forth programmakers represent a wide range of applications that the Forth language can be reduced by 10 times more in other languages.

The Forth program emphasizes the flexibility to solve the problem. Because Forth is an extensible language, new data structures, control structures can be added to support specific applications. This flexibility allows the problem of using other languages through one or two programmers, relying on a large pile-up program to reduce the overhead of project management, and greatly increase productivity. FORTH did not work hard to use large-scale procedures, so it has not been recognized in a large-scale application. Chapter IV 16 - bit Stack Computer Architecture

In this chapter, we will discuss several representative 16-bit stack computers, and the selected examples include many philosophical ideas and compromise factors in the implementation.

In 4.1, we first discussed the features of the 16-bit system. One of the most important issues is that the 16-bit system is sufficiently tightened, in embedded applications, can fully implement the entire system on a single-chip.

Other parts discussed four different 16-bit stack computers, and the order in which each part is increasing is incremented by integration, and is achieved from the board-level discrete component to a highly integrated single-chip implementation.

In 4.2, we discuss WISC CPU / 16, which is a stack processor implemented with discrete components, using a writable control memory, the goal of the CPU / 16 design is a simple and flexible technology development platform.

In 4.3, we discuss the MISC M17 processor, the M17 target is "low end", which is placed in a pricing application, so it places the stack in the program memory to save the hardware cost of a separate stack memory.

In 4.4, we discussed Novix NC4016, which is the first Forth chip that enters the market. The NC4016 provides medium price and performance, using dedicated sheet-out stack memories.

In 4.5, we discussed Harris RTX 2000, which is a high-performance processor designed based on NOVIX NC4016. RTX 2000 uses standard unit design methods and uses a piece of stack memory to increase the running speed. The standard unit method can also add a hardware multiplier and a counter / timer for the processor.

CPU / 16, NC4016, and RTX 2000 are all stack processors of the ML0 class, while M17 is designed as a MS0 stack processor in the case of ensuring low cost.

4.1 Characteristics of 16-bit design

We discussed here is a 16-bit stack computer or processor because 16 bits are the least least configured for most commercial stack processor applications.

4.1.1 Can be suitable for traditional Forth models

The initiation of the FORTH computer is the initiation is that the traditional Forth programming model is 16 bits. The following is the following: The average length of the Forth program code is 32K bytes, the first Forth compiler is there The microprocessor of the 64K byte addressing range is implemented.

4.1.2 Minimum width we are interested in

Forth is designed to have 16-bit language has its historical reasons. For the usual calculation and addressing data structures, 8 bits are obviously too small; in the early computer, I also tried to use 12 bits of width, but 16 is the minimum integer that is truly available. Forth is traditionally used not more than 16-bit computational models because it is developed before the 32-bit processor can be practical.

The 16-bit computer has 64K bytes of memory addressing capabilities, and the single-precision integer ranges between -32 768 to 32 767, which is sufficient for most calculations. Use double precision (32-bit integers), a 16-bit computer can represent an integer range from -2 147 483 648 to 2 147 483 647, which is also sufficient for the most demanding applications.

Of course, the 16-bit computer can also be simulated with a machine with a 4-position or 8-bit data channel, but its performance is usually unsatisfactory, because the 16-bit data can only reach half of the computer with the 8-bit computer. . Because all designs discussed in this chapter are high-speed processing applications, they all have 16-bit internal data channels. 4.1.3 Small size allows integrated embedded systems

The three Forth chips (M17, NC4016, and RTX 2000) discussed in this chapter are positioned in the embedded application market. The embedded application requires the use of a small processor with a small amount of program memory to meet the demanding requirements of power consumption, weight, size, and cost. The 16-bit processor is often a good compromise, it has higher performance than 8-bit machines, because 8-bit systems take more time to synthesize when doing 16-bit operation, while 32-bit systems are for many applications Its killing is too big.

4.2 WISC CPU / 16 architecture

4.2.1 Introduction

WISC Technology CPU / 16 is considered to be the simplest stack computer (calculated by the number of TTL components) while also providing flexibility and speed. The WISC CPU / 16 is all implemented using a discrete MSI component, which is a 16-bit computer that uses RAM-based microcode memory (writable control memory) to allow full user programmability. The CPU / 16 is designed as a PC tablet, which can be inserted into the IBM-PC compatible machine as a coprocessor.

The name of the processor WISC comes from "writable instruction set computers", of course, more complete technical terms should be "WISC / Stack" because the hardware stack is the main part of this design.

The initial purpose of developing the CPU / 16 is to design and evaluate the technique related to research before designing RTX32P (described in Chapter 5). Results The product is a processor with appropriate speed and very simple and neatly designed. CPU / 16 hard-wired prototype is suitable for an IBM PC extended card (13-inch x 4 inches), with 16K bytes of program memory, microcode memory hardware, and simple micro-instruction format also make it very suitable as teaching as computer design courses. tool.

4.2.2 block diagram

Figure 4.1 is a block diagram of the CPU / 16 architecture.

Figure 4.1 CPU / 16 block diagram

The data stack and the return stack are consisting of an 8-bit plus / reducer counter (stack pointer), providing addressing for 256 bytes of 16-bit memory. Stack pointers are readable to provide the ability to effectively access deep stack elements.

The ALU consists of a 74LS181 chip, which is a standard multifunctional ALU, including DHI registers for saving intermediate results. For convenience of design and implementation, the DHI register is used as a stack stack top element buffer register. This means that the data stack pointer actually points to the secondary stack top element that the programmer can see, the result is the operation of the two stack top elements, such as the addition, can be executed in single cycle, ALU's A section Each element of the data stack, the Part B of the ALU reads the top elements from DHL (Data Hi Register).

The machine language programming does not see the condition code. The carrying addition and multi-precision addition operations are supported by the microcode instructions, which puts the carry as a logic value pressed into the data stack (0 corresponds to the carry clear, -1 corresponds to the carry settings)

The DLO register saves intermediate results in a single instruction as a temporary register. DHI and DLO registers are shift registers, coupled to 32-bit shift mode to support multiplication and division.

The program counter is directly connected to the memory address bus, which allows the next instruction to be read in parallel with the data operation when the system is idle, so that the finger operation can be alternately executed with the data operation instructions accessed by the ALU and the data stack. In order to save the program counter in the subroutine call operation, use a "program counter Save Register", which saves the program counter before loading the subroutine address, and is pushed in the returning stack during the subroutine call. When the subroutine is returned, the saved counter value is sent to the program counter after the ALU plus 1 as a new value, which saves the clock cycle of the program counter increment. The program memory is organized into a 64K word of 16-bit width, which can only be accessed by word boundary, but a microcrecoded byte exchange operation supports a single byte process.

The microprogram memory is a read / write memory of 2K 32-bit elements, and the memory is organized into 256 pages addressing, 8 words per page. The microprogram counter provides an 8-bit address, while the microprogram is executed in this 8-word page. This strategy enables 3-bit space to the next micro-destruction instruction in the microinhore, one of which is one of the eight conditional micro-branch code. This allows for conditional branches and loops when performing a single opcode.

The instruction decoding is very simple: load the 8-bit operation code into the micro-serving counter and use it as a page address of the micro-serving memory. Because the microprogram counter is composed of hardware counters, the operation can span a 8 micro instruction page if needed.

The microinjug register holds the output of the micrographed memory to form a first-order pipeline. This pipeline allows the current micro instruction to perform the next instruction from the micro-serving memory when executed in parallel, which is eliminated from the system key path from the system critical path. It also forces the least two cycles, if a directive only needs a clock cycle, then ask a NO-OP instruction so that the next instruction can be correctly passed through the pipeline.

The host interface module allows CPU / 16 to operate in two possible modes: main mode and slave mode. In the slave mode, the CPU / 16 is controlled by the personal computer host PC, implements the program load, microgram, various system registers, or memory settings for initialization or debug operations. In the main mode, the CPU / 16 is free to run its own program, and the PC monitors a status register to respond to service requests, when the CPU / 16 works in the main mode, the PC can enter a specified service loop, and can also Perform additional services, such as reading the next disk block or displaying an image, or only periodically querying the status register. As long as it is necessary, the CPU / 16 is waiting for the PC service to complete.

4.2.3 Directive Collection Total

The CPU / 16 has two instruction formats: one is used to call the microcode, one for sub-program calls.

Figure 4.2A CPU / 16 instruction format - microde operation

Figure 4.2 B CPU / 16 instruction format - subroutine call

Figure 4.2A gives an instruction format for accessing the microcode instruction. Because the 256 pages of the microcode memory can support 256 possible opcodes, each directive only needs 8 bits to specify, the result is an instruction height of the microcode operation to exceed 1, and this can make Figure 4.2 B The subroutine call format shown in B can access any address as long as its high 8 bits are not 1. This strategy avoids other design in this chapter only limited to 15-bit subroutine addresses, which is that instruction parameters cannot be included in the instruction word, so that the target branch can only store the next memory word after the instruction, This is also contradictory with the minimum offset of the command. Our most interesting to this trade is the least logic used by the directive decoding.

Because the CPU / 16 uses the RAM chip as a microcode memory, the microcode can be changed completely by the user as needed. The standard software environment of the CPU / 16 is MVP-FORTH, which is a dialect of Forth-79 (Haydon 1983). The FORTH instructions included in the standard microcode instruction set are shown in Table 4.1. Of course, other software environments can be used, but other advanced language systems have not been implemented in addition to Forth. DDROP

DDUP

! DNEGATE

- DROP

0 DSWAP

0 = i

0Branch I '

1 J

1- Leave

2 * LIT

2 / NEGATE

Pick not

Roll OR

= OVER

> R r>

? Dup r @

@ROT

ABS S-> D

And swap

Branch u *

D! U / mod

D XOR

D @

Table 4.1 (a) CPU / 16 instruction set - FORTH primitive (see the description of Appendix B)

@ -

DROP;

DROP DUP

I @

Over

Over -

R> Drop

R> swap> r

SWAP -

SWAP DROP

DUP @ SWAP 1 (Take the incremental address)

DUP ROT ROT! 1- (save incremental address)

@ @ (Indirect)

@! (Indirect)

DUP @ @ 1 Rot ! (Entariance for software stack automatically after increment)

-1 Over ! @! (Except for the software stack automatic increment)

Table 4.1 (b). CPU / 16 Directive Collection Total - Combination Forth Original Language

Operation code data stack returns

Docol -> -> Addr

Performer

Semis -> Addr ->

Enforcement subroutine

Halt -> ->

Return control to the host processor

Syscall n -> ->

Requesting the I / O service from the host

Dovar -> addr -> for implementing Forth variables

Docon -> n ->

Used to achieve Forth constant

Table 4.1 (c) CPU / 16 Directive Collection Total - Special Word

The following FORTH operations have a microcode to support inner layer loop or run time behavior.

SP @ (Fetch Contents of Data Stack Pointer)

SP! (Initialize Data Stack Pointer)

RP @ (Fetch Contents of Return STACK POINTER)

RP! (Initialize Return STACK POINTER)

Match (String Compare Primitive)

Error CHECKING & Reporting Word

LOOP (Variable Increment Loop)

/ Loop (variable unsigned increment loop)

Cmove (String Move)

DO (loop initialization)

ENCLOSE (Text Parsing Primitive)

LOOP (increment by 1 loop)

Fill (Block Memory Initialization Word)

Toggle (Bit Mask / Set Primitive)

Table 4.1 (d) CPU / 16 Directive Convection Total - Advanced Support

Operation code data stack returns

EXP1 UD2 -> EXP2 UD4 ->

FLOATING POINT NORMALIZE OF UNSIGNED 32-BIT MANTISSA

ADC N1 N2 CIN -> N3 Cout ->

Add with carry. Cin and cout area logical flaggs on the

STACK.

ASR N1 -> N2 ->

Arithmetic Shift Right.

BYTESWAP N1 -> N2 ->

Swap high and low bytes forin n1.

D ! D Addr -> ->

SUM D INTO 32-bit Number at Addr.

D> r d -> -> d

Move d to return stack.

DLSLN D1 N2 -> D3 ->

Logical shift left of d1 by n2 bits.

DLSR D1 -> D2 ->

Logical Shift Right of D1 by 1 bit.

DLSRN D1 N2 -> D3 ->

Master SHIFT RIGHT OF D1 by N2 Bits.

DR> -> d D ->

Move D from Return Stack to Data Stack.drot D1 D2 D3 -> D2 D3 D1 ->

Perform Double-Precision Rot.

LSLN N1 N2 -> N3 ->

LEGCIAL SHIFT LEFT OF N1 by N2 Bits.

LSR N1 -> N2 ->

Logical shift right of n1 by 1 bit.

LSRN N1 N2 -> N3 ->

Master Shift Right of N1 by N2 Bits.

Q Q1 Q2 -> Q3 ->

64-bit addition.

QLSL Q1 -> Q2 ->

Logical shift left of q1 by 1 bit.

RLC N1 CIN -> N2 Cout ->

Rotate Left Through Carry N1 by 1 bit. Cin is carry-in,

Cout is carry-out.

RRC N1 CIN -> N2 Cout ->

Rotate Right Through Carry N1 by 1 bit. Cin IS Carry-in,

Cout is carry-out.

TDUP D1 N2 -> D1 N2 D1 N2 ->

DuPlicate a Temporary Floating Point Number (32-bit

Mantissa, 16-bit integer.

Note: The CPU / 16 uses the RAM microcode memory, so the user can add or modify all instructions as needed. The above list is just the instructions description of the standard development package.

Table 4.1 (e) CPU / 16 Directive Convection General - Extended Arithmetic and Floating Point Support Words

A point worth noting is the density of this instruction set instruction. In Table 4.1a, the instruction is a collection of a large Forth original language operation. Table 4.1B shows some Forth commonly used combination that can be implemented by single instructions. Table 4.1C shows the words that support the FORTH underlying operation, such as subroutine calls and exits. Table 4.1d is some advanced Forth Forth, with a microcode, the purpose is to speed up the execution speed. Table 4.1E The word is used to support extended precision integer operations and 32-bit floating point operations.

The execution time of the instruction varies in a large range according to the complexity of the instruction. Simply instructions for processing data on stacks such as and SWAP cost 2 or 3 micro-cycles, complex instructions require more clock cycles (such as Q , is 64-bit add, spend 18 cycles), but is more fast than advanced code . If necessary, you can also write a memory movement or other repetitive operation of thousands of clock cycles with a microcode.

As discussed earlier, each directive accesses a series of microinstructions, which correspond to the 8-bit operation code of the instruction in a microcontroller page. Figure 4.3 shows the microcode format of the microinstruction, the microcode uses the horizontal decoding method, that is to say there is only one format for the microcode, it is divided into several separate fields to control different parts of the machine.

Figure 4-3 CPU / 16 micro-instruction format

Due to the simplicity of the stack, only 32 bits are required in each micro-instruction. We can compare this 32 bits with other 48-bit or wider computers using the horizontal microcode machine, such as comparing those computers using the AMD2900 series components. This simplicity allows the writing of the microcode program to be more complicated than the assembly language program of the traditional computer. For example, our specification stack computer implements the pseudo code description of the addition of the addition:

TosREG <= TOSREG POP (DS)

The CPU / 16 microde operation description can be written:

Source = DS

Alu = a b

DEST = DHI

INC [dp]

Here, the micro-operation source = DS is placed on the top element of the current hardware data stack on the data bus, alu = a b indicates that Alu inputs A input (from data bus) and B input (buffered on the top elements in DHI) Add, dest = DHI put the result back to the data DHI register. Since then the INC [DP] instruction adds the data stack pointer after reading the data stack, which is to pop up the stack.

4.2.4 Architecture Features

The CPU / 16 is very similar to the specification stack computer, which is of course a big relationship with them from the same designer, in addition, the standard stack computer and the CPU / 16 have a joint pursuit: simplicity.

The CPU / 16 relative to the specification stack computer, the main efficiency improvement is that the program counter replaces the memory address register. It has the advantage that the next instruction is allowed to be bundled with the data bus, so the stack operation can be Read alternate. One of its disadvantages is that there is a need for @ and! The operation requires some tips: it requires the memory address to overwrite the program counter, and then restore the program counter with the program counter protection register. Of course program counters and a memory address register (or DHI register) can be multiplexed with the RAM address bus, but this will increase complexity and components.

The DL0 register is added in the original design, and the purpose is to support multiplication and division by high efficiency 32-bit shift. However, there is a significant improvement in an intermediate result register because we have four intermediate results available (DHI, DL0, data stacks, and temporary results for returning stack). For example, the DL0 register is used for the intermediate storage location of the SWAP operation, from the concept, which is much more clear than the returning stack for the same purpose.

An important implementation feature of the CPU / 16 is that all resources of the machine can be directly controlled by the host PC, because the PC interface supports the loading and single step clock, using these features, any specified micro-instruction First put the value into one or all registers, generate a clock, read back the data and observe the result. This design technology makes the microcode development extremely intuitive, and avoids expensive microcode development tools, making the diagnostic programs very easy to write.

CPU / 16 No design interrupt support.

4.2.5 Implementation and facing applications

The CPU / 16 uses conservative (or can also be said to be sturdy) 74LS00 series chip and relatively slow 150 NS RAM for stack and program memory, its design consideration is based on the degree of declining: Simply, the lowest design And development tool cost, tightening, flexibility and speed. The CPU / 16 clock cycle is 280ns, an average of a directive per 3 cycles.

The use of discrete components is due to their inexpensive, compared to single-chip arrays only require little start tools. Discrete component design is also easier, cost-lowering and changes, which is the philosophy of exploratory projects, which can only be a processor that is slower than a single chip. Despite this, the CPU / 16 is still better than the slower version of Novix NC4016 (it's the leadership of the stack computer in that era). In order to improve flexibility and limit the width of the microcode, the CPU / 16 uses a discrete ALU chip (74LS181) without using the seipheet element. Basic application areas are association processors for IBM-PC personal computers. Although the redefined instruction set makes the CPU / 16 adapt to most languages, the basic application language is still Forth.

Another application area we are interested in is auxiliary education for computer architecture courses. Since the components used by the machine include only 100 simple TTL chips, students can easily understand this design. An additional result of using discrete component techniques is that all system signals can be observed by external probes, making it ideal for students testing, learning hardware, software, and microcode.

This part of the information comes from the CPU / 16 Technical Reference Manual (Koopman 1986). Other information about CPU / 16 can be obtained from Haydon & Koopman (1986) and Koopman & Haydon (1986), and Koopman (1987B) describes a conceptualized WISC architecture that is very similar to CPU / 16.

4.3 MISC M17 architecture

4.3.1 Introduction

The MISC M17 microprocessor is a low-cost, embedded processor designed by Minimum Instruction Set Computer, in order to reduce system costs, M17 puts its two stacks in program memory, but only keeps lots on the chip Stack top buffer register. Other compromises are to keep the chip to produce low cost and low cost of the entire system, while also have acceptable system high performance.

The MISC M17's goal is to have an acceptable performance in a large number of embedded control applications (compared to other stack computers, if the performance is extremely high than the standard microcontroller).

4.3.2 block diagram

Figure 4.4 is a block diagram of M17

Figure 4.4 M17 block diagram

The data stack and return stack reside in the program memory, while the top element of each stack is placed in the register to increase the speed. X, Y, the Z register holds 3 stack top elements with data stacks, X is the top element. These registers are connected by multiplexer such that they can be transmitted within a single cycle. At the same time, the Z register can read or write from the program memory resident in the stack. Thus, a data pop-up operation (FORTH DROP operation) can be completed simultaneously: read the Z, copy z to Y, copy Y to X, from the memory. Similarly, a data stack is operated in stack operation (such as the DUP operation of Forth) is done: Copy the X to Y, copy the x, write z to the program memory while keeping the X origin.

The LASTX register updates to the contents of the X register in each instruction cycle. Therefore, it can save the top element values that are rewritten by the previous instruction, which is very useful for many instruction sequences.

The ALU of the M17 is designed such that it can perform all possible ALU operations simultaneously, only the function output is only selected to write back the X and / or Y registers. This technique allows the ALU to overlap with the decoded time, because one but instruction is decoded, its task is just selecting the correct ALU output from the calculated function.

M17 has an 8-bit I / O bus, which allows data transfer while operating the ALU, which can be seen in all 16-bit single-chip stack computers discussed here, which allows the case where the memory data bus is not interfered Take the high speed I / O. The return stack and data stack are saved in the program memory, and return to the top element buffer in the INDEX register. The Index Register is used as an reduction counter in the program loop or as an instruction duplicate counter.

The instruction pointer is a conventional program counter that can be loaded from the instruction register to implement subroutine calls, which can be loaded from the data bus to implement branch, or can be loaded from the index register to implement subroutines return. The Index register can also be loaded from the instruction pointer to save the subroutine call to save the return address.

Return Stack The pointer is an incremental / reduction counter that holds a memory address that resides in the program memory (it is actually the address of the second stack of the programmer, because The first element is in the Index register). Similarly, the data stack pointer points to the data stack resides in the program memory, which is actually the fourth element of the stack because the top three are saved in the registers x, y, z. The data stack grows from the high memory position to the low memory location, while the return stack grows from the low memory position to high memory position. After this arrangement, the free space between the top of the data stack and the return of the stack can be used more efficiently.

M17 directly adds to 5-segment memory, up to 64K 16-bit word per segment. Byte swap, byte package, disassembly instructions can implement access to 8-bit data. M17 provides five pins for indicating which memory space is activated: data stack, return stack, code space, A buffer, B buffer, which can be ignored. For some larger systems, each pin can control its own memory chip, providing 5 separate 64K word memory spaces. If used in combination with the memory control chip, you can form a memory space of up to 16m words.

Each command of M17 requires two cycles: one for loading instructions from the program memory, another execution operation, and performs reading or writing of a stack in the program memory. Through this two-cycle instruction, the memory bus maintains a continuous operation, the simplest system can operate with two 8-bit memory chips.

The M17 has 6 command Cache registers that form a historical buffer that saves the sequence of sequential instructions just performed. If a special instruction triggers a loop sequence, these saved instructions form a loop from 1 to 6, which repeatedly until the exit condition is true. Starting from the second subsequent loop iteration of this cycle, one instead of each period is performed instead of the original 2 cycle, because no need to read the instruction from the memory. To simplify interrupt and control logic, these cycles require appropriate arrangements and aligned in 8 units. This sequence can be interrupted, but if you need another loop sequence in the interrupt service program, the interrupt service program is responsible for retaining a special sign.

The last feature of M17 is to implement a variable length clock cycle by using an asynchronous memory interface. In the asynchronous mode operation, M17 provides a memory request signal in each memory cycle, and the responded memory device gives a READY signal when its data is ready. This handshake process does not require a crystal oscillator, the result is the system asynchronous operation. One advantage of this strategy is that the memory device of different speeds can be shared with different waiting delay devices, while avoiding waste memory bandwidth; another advantage is that the delay is short when the address memory is operated, allowing internal operation cycles than memory access The cycle is faster. In a cost-abnormal sensitive application, all systems run only a normal clock oscillator.

4.3.3 Total Directive Collection

Figure 4.5 shows the instruction format of M17. The instruction consists of two clock cycles: one for instruction reads, one for operation and stack memory access. All specification stack computer primitives listed in Table 3.1 can be completed by single instruction cycle (two clock cycles), but the details of some instruction cycle operations may be different on M17. For example, a memory store does not pop up the data, nor does it require a stack addressing, otherwise add two additional memory transfer operations. Figure 4.5A M17 instruction format - subroutine call

Figure 4.5A shows the subroutine call instruction, a subroutine call is the address of the subroutine as an instruction (of course this address must be even). Bit0 is 0 for identification subroutine calls, which enforces subroutines to start from even addresses, but allow access to all 64K word address space.

Figure 4.5B M17 instruction format - conditional instruction

M17 has 3 conditional instructions: Set, Return, and JUMP. Figure 4.5B shows the most original conditional instruction format. Bit6-15 indicates which condition to enter the logical OR condition discrimination function. For example, if Bit15 and Bit13 are 1, select a "less than or equal to 0" condition. When bit5 is 1, it causes the logical condition value inversion. For example, if Bit15 Bit13 and bit5 are 1, select "greater than 0" conditions. Bit4 controls the INDEX register and its features. For Return, it allows programmers to control the DROP that returns to the stack. SET and JUMP Select 0 test and reduce INDEX. In this way, many useful data x y z or index conditions can be established with an instruction step.

Important is that M17's conditional instruction does not change the contents of data on the stack, which simply simply separates the conditional code value from the system data and performs conditional operations. For example, selecting carry conditions Bit9 gives a carry on X and Y, but does not actually change the contents of the X and Y registers. If the set command is not used, the result of the conditional calculation is not retained.

Figure 4.5C M17 Instruction Format - Set User Signs

Figure 4.5c shows the format of the SET condition command. This instruction sets the user flag, we can see it as a traditional flag register, Bit4-15 selection criteria. The user flag can be tested in other instructions in the program and is further used for branch. Bit3 pointed out whether the stack top element needs to pop up after the calculation execution (equivalent to the FORTH DROP operation).

Figure 4.5D M17 instruction format - Conditions Returns

Figure 4.5d gives the condition subroutine to return the format of the instruction return. When bit4 is 0, the instruction returns to execute as a condition. When the condition calculation is true, the return and the return address is eused from the return stack (reside in the Index Register). When bit4 is 1, the return top is still returned, but the condition is to perform a branch when the condition is furniture. This is a traditional approach to achieving the Begin ... Until_False control structure, which puts the start address in the Index register and uses the data stack condition to determine what time ends.

Figure 4.5E M17 instruction format - condition jump

One condition jump is shown in Figure 4.5E, this instruction calculates a specified condition and jumps when the condition is true, and the destination address is placed in the program memory after the JUMP instruction. If the jump condition is false, the M17 skips the JUMP target address and executes the following instructions (the second word after the JUMP instruction). The JUMP instruction can be used to implement a down-count loop, which uses the index register by setting Bit4 to 1.

Figure 4.5F M17 instruction format - process

Figure 4.5F Displays the Process instruction format. This instruction has several independent control fields that recall the horizontal microcode decoding format in the CPU / 16. Bit3-5 Specify Z register control, bit6-7 is used for the Y register, Bit13 is used for the LASTX register, and Bit14 is used for the X register. In addition, Bit8-12 selects the execution of the Alu / shift function. Finally, Bit15 can update the data stack pointer through the instruction. Figure 4.5G M17 instruction format - Access

Figure 4.5G gives the format of the Access instruction. The format of this instruction is very similar to the Process command, and the main difference is that bit8-11 indicates a source or source / purpose to data around the processor. Bit12 and 14 Control the update of the source and destination registers, allowing the exchange between internal registers.

The M17 is processed as a hardware to force a subroutine call located at the memory address 0. Another address can be specified by the interrupt device, which also has a context register that saves the status of the processor when an interrupt is received.

4.3.4 Architecture Features

The maximum difference between the M17 and the third chapter describes the maximum difference between the M17 stack and the memory access using the same bus, which can reside in the same memory chip. In order to obtain as high performance, the M17 buffers 3 data stack top elements and a returns top element in the internal register.

Compared to a single internal bus used by the specification stack computer, M17 provides a rich inter-register interconnection structure, which can not only move data between the LASTX / X / Y / X register to perform PUSH and POP operations, but also allowed in one A very complex stack operation is performed in a single decoded / execution cycle.

Since the stack resides in the program memory, use a multiplexer selection address to provide the program memory. The advantage of placing the stack in the program memory is that there is very little information when the context exchange is exchanged. The on-chip stack element does not need to be copied into the main memory, just refresh the stack top element to the memory, and then point the stack pointer to different memories to activate the new task.

4.3.5 Implementation and targeted applications

M17 uses a 2.0 micron HCMOS process, 6600 door arrays, encapsulated into 68 PLCC, which means that the system performance is reasonable in the case of maintaining development and production costs, and on-chip is mainly 16-bit width. Memory to save programs and stacks.

The maximum clock rate of M17 is approximately 15 MHz when using 30 ns SRAM. When using 120ns SRAM, it is 6MHz, and each instruction occupies two clock cycles, and the six instruction sequences stored in the instruction cache can be in accordance with the rate of one instruction of each clock. carried out.

Several characteristics of MISC M17 make designers get small and high performance products. Some applications include: smoke, mines, dangerous occasions, and remote equipment installation. Put the stack in the program memory so that the system cost is low, the complexity is low; the asynchronous memory bus protocol does not require a complex interface between the high-speed processing operation and the low speed data sampling device.

Information from this section comes from MSIC M17 Technical Reference Manual (MISC 1988)

4.4 Architecture of NOVIX NC4016

4.4.1 Introduction

NOVIX NC4016, originally called NC4000, is a 16-bit stack-based microprocessor, designed to execute the Forth language program. It is the first FORTH single-chip computer, many of the features in subsequent design stem from this device. The application for NOVIX NC4016 is the high-speed execution of the Forth program in real-time control and general program design tasks.

NC4016 uses a dedicated sheet-out stack memory for data stacks and returns. Since NC4016 uses 3 sets of separate pins to connect two stacks and RAM data bus, most of its instructions are executed single-cycle.

4.4.2 block diagram

Figure 4.6 shows the block diagram of NC4016

Figure 4.6 NC4016 block diagram

The ALU portion includes a 2 element buffer for the data stack top element (T-index top element, n referring to the second element of the top of the stack), and also includes a special MD register for supporting multiplication and division, SR register The square root calculation of the rapid integer. The ALU can operate one of the T registers and N, MD, SR registers. The data stack is an outer memory containing 256 elements, and the data stack pointer on the chip provides the external memory address. The separated 16-bit stack data bus allows the data stack to read and write in parallel while other operations. As mentioned earlier, the two elements of the data stack top are buffered in the T and N registers of the ALU.

The return stack is a separate memory, which is very similar to the data stack, and the different is only a buffer on the on-chip index register buffered a returning stack top element. Because Forth holds the subroutine return address and loop counter on the back stack, the index register can reduce 1 to achieve high efficiency reduction count cycles.

The stack does not overflow and overflow protection. In a multi-task environment, a slice stack page register can be controlled using an I / O port to provide a stand-alone more than 256-fold stack memory for each task. This can ensure that a task does not change the stack of other tasks, minimizing the overhead switching overhead.

The program counter points to the next position from the external program memory, which is automatically changed by Jump, Loop, and subroutine call instructions. The program memory organizes to 16-bit words and does not directly support byte addressing.

The NC4016 also has two I / O buss connected to the specified pin of the film. The B port is 16-bit I / O bus, and the X is 5-bit I / O bus. The I / O port allows applications to access I / O devices directly without having to steal cycles from memory bus. Some bits of the I / O port can also provide a high position of the memory address to extend the program memory address.

NC4016 can use four separate 16-bit bus for high performance data transmission (program memory, data stack, return stack and I / O bus) at each clock cycle

4.4.3 Directive Collection Total

Early products of NC4016 use non-encoded instruction format for stack computers. In NC4016, the ALU instruction uses a separate field format, which controls the different parts of the machine, very like a horizontal microcode. Dead products of NC4016 and many other Forth processors are only 16-bit computers using this technology. Use a non-coded instruction format to make the instruction coded hardware, and Figure 4.7 shows the instruction format of the NC4016.

Figure 4.7A NC4016 instruction format - subroutine call

Figure 4.7A shows the instruction format of the subroutine call. The high position of the instruction is set to 0, and the other bit of the instruction is used to save the 15-bit subroutine address, which limits the program that can only address 32K words.

Figure 4.7B NC4016 Instruction Format - Conditional Branch

Figure 4.7B gives the conditional branch instruction format. Bit12 and Bit13 choose to transfer, unconditional transfer or "reduction if the result is 0 Transfer" to realize the loop when T is 0. Bit0-11 Specifies the minimum 12 bits of the target address, restricting the branch within the 4k-byte range of this instruction.

Figure 4.7C NC4016 instruction format - ALU operation

Figure 4.7c gives the format of the ALU instruction, which uses several bits to control different on-chip resources. Bit0-1 Controls the shifter operation of the ALU output, and bit2 specifies a non-recovery division cycle. Bit3 enables T and N registers as a 32-bit shift register.

The bit5 of the ALU instruction indicates a subroutine return operation, allowing the subroutine to return to the arithmetic operation, in many cases, can get the "free" subroutine returns.

Bit6 specifies whether the PUSH operation of the stack is completed, it is combined with bit4, controls the PUSH and POP operations of the stack element.

Bit7 and 8 Control ALU input selection and specify iterations for multiplication or square root function. Bit9-11 Specifies the ALU function to be executed. Figure 4.7D NC4016 instruction format - memory reference

Figure 4.7D shows the format of the memory reference instruction. These instructions require 2 cycles: one for refining, one for actual read and write operations. The address of the memory is always from the T register. Bit12 pointed out that the memory read or written. Bit0-4 Specifies a small constant to add or subtilize the value of the Y register to implement automatic increment or automatic reduction operation. The bit5-11 of this instruction specifies the ALU and and control functions, similar to these features of the ALU instruction.

Figure 4.7E NC4016 instruction format - user space / register transmission / constant

`Figure 4.7E shows other instruction formats. Such instructions can read or write the "user space" of the previous 32 words of the program memory, which will put the address in the stack in the stack than the usual memory access. It can also be used to transfer values between on-chip registers, or 5-bit constants (only single cycle only) or a 16-bit constant (using two cycles) on the stack. The ALU and control functions specified by this command are very similar to the Alu instruction format.

The NC4016 is designed to perform the Forth language, due to the non-encoded format of many instructions, a single instruction can be used in a machine operation corresponding to a FORTH operation sequence. Table 4.2 gives the Forth primitive and the NC4016 supporting instruction sequence.

: (Subprogram) and

(Subroutine return) Branch

! Drop

DUP

- i

0 LIT

0branch OR

1 over

1- r>

2 * r @

> R swap

@ Xor

Table 4.3 (a) RTX 2000 Directive Collection Total - Forth Original Language (See Appendix B)

INV SHIFT DUP @ SWAP

Lit INV DUP NN G! INV

LIT SWAP INV DUP U! INV

LIT SWAP OP DUP U @ op

NN INV (Short Literal) NN G! INV

NN over op nn g @ INV

NN swap op nn g @ Drop INV

OP Shift nn g @ over op

! INV NN G @swap OP

! nn over inv shift

@ INV over Swap OP Shift

@ nn over swap! INV

@Swap inv over swap! Nn

@Swap OP Over swap @ op? Dup? Branch Swap INV SHIFT

DDUP INV SWAP DROP INV SHIFT

DDUP NN SWAP OP SWAP DROP @ nn

DDUP! SWAP DROP DUP INV SHIFT

DROP INV SWAP DUROP DUP @ nn Rot OP

Drop Lit INV SWAP DUROP DUP @ SWAP

DROP NN INV SWAP over OP Shift

Drop Dup INV SHIFT SWAP OVER!

DUP INV SHIFT U! INV

DUP LIT OP U @ op

DUP @ nn rot op u @ swap INV

Description: INV - 1'S Complement Or No-OP

LIT - Long Literal Value

NN - Short Literal Value

OP - alu operation

Shift - Shift Select Or No-Op

Table 4.3 (b) RTX 2000 Directive Convention Total - Combined Forth Original Language

Directive Table Stack Back Stack

NN g @ -> n ->

Fetch The Value from Internal Register OR ASIC BUS DEVICE

NN (Stored AS A 5 Bit Literal in the instruction).

NN g! n -> ->

Store N INTO THE INTERNAL Register OR ASIC BUS Device NN

(Stored As a 5 Bit Literal in the instruction)

* N1 n2 -> D3 ->

Single Clock Cycle Hardware Multiply

* 'D1 -> D2 ->

Unsigned Multiply Step (Takes Two 16 bit number and

Products a 32 bit produter.

* - D1 -> D2 ->

Signed Multiply Step (Takes Two 16 Bit Numbers and Products

A 32 Bit Product).

* F d1 -> d2->

FRACTIONAL MULTIPLY Step (Takes Two 16 Bit Fractions and

Products a 32 bit produter.

* / 'D1 -> D2 ->

Divide Step (Takes A 16 Bit Dividend and Divisor and

Products 16 Bit Remainder and quotients.

* / '' D1 -> D2 ->

Last Divide Step (To Perform Non-Restoring Division Fixupstep).

2 / N1 -> N2 ->

Arithmetic Shift Right (Same As Division by Two for

Non-negative integers.

D2 / d1 -> d2->

32 Bit Arithmetic Shift Right (Same As Division by Two for

Non-negative integers.

S 'D1 -> D2 ->

Square root step.

Next -> n1 -> n2

Count-down loop using top of return stack as a counter.

Table 4.3 (c) RTX 2000 Directive Collection Total - Special Purpose Word

4.4.4 Architecture Features

The NC4016 architecture is designed in a single cycle. All primitives except memory access and long-term reading are completed in single cycle, which requires more interconnected channels than specific stack computers, but also provides better performance.

The NC4016 allows the non-conflicting continuous operation to be combined into one of the same instructions. For example, in the Forth program, a value can be taken out from the memory and use sequence @ to the stack top element, and these operations can be combined into a single instruction in NC4016.

The subroutine return bits of the NC4016 allow the subroutine to return to other instructions to other instructions. By combining the instructions at the subroutine outlet and the return, in many cases, "free" subroutine returns can be obtained. One optimization made by the NC4016 compiler is to perform the end of the end. The so-called tail recursive cancellation is: use a non-conditional branch of subroutine instead of subroutine call / return pairs.

Another innovation mechanism of NC4016 is to access the first 32 locations of the program memory as a global "user variable". This mechanism helps to simply solve some problems in the advanced language implementation by putting a task's key information such as the multi-stack pointer of the main memory, and helps to simply solve some problems in the advanced language implementation, which also makes the high-level language compiler to have Performance, the method is to simulate the 32 quick access variables as the register set, and these compilers are initially developed for registers.

4.4.5 Implementation and facing applications

The NC4016 is achieved by less than 4,000 doors, 3.0 micron HCMOS gate array technology is implemented, and encapsulated 121 PGA and operating speed is 8MHz.

When the initial NC4016 is implemented, the gate array technology does not allow the stack memory to be placed in the film. So the smallest NC4016 system includes 3 16-bit memory: one for programs and data, one for the data stack, one for returning stacks.

Since NC4016 can perform most of the instructions single periodically, there is a large amount of time between the clock and the memory address of the clock and the memory address used to refer to the next instruction, this time is approximately equal to half of the clock, This also means that the program memory access time should be 2 times the clock cycle.

NC4016 is originally designed as a concept verification and prototype, so it contains trouble that must be overcome by software and external hardware. For example, NC4016 hopes to process interrupts, but an error in the gate array makes the interrupt response is incorrect. Novix thus releases an application note indicating how to use a 20-pin PAL device to solve this problem. Subsequent products will eliminate these difficulties and add other capabilities.

The NC4016 target is an embedded control market that achieves very high performance in an appropriate and small system. Applications suitable for NC4016 are: laser printer control, graphical CRT display control, telecommunication control (T1 exchange, facsimile controller, etc.), local area network controller, and optical character identification system. The information from this chapter comes from Golden et al. (1985), Miller (1987), Stephens & Watson (1985), NOVIX NC4016 microprocessor programmer profile (Novix 1985).

4.5 Harris RTX2000 architecture

4.5.1 Introduction

The RTX2000 of the Harris Semiconductor is a 16-bit processor and is a derived product of Novix NC4016. The RTX2000 has a high integration that includes not only the core processor, but also contains a stack memory, hardware multiplier, and counters / timers on the chip.

4.5.2 block diagram

Figure 4.8 RTX2000 block diagram

Figure 4.8 shows a block diagram of RTX2000. The main difference between the RTX2000 and NC4016 is a chip-on-chip resource other than the CPU core, including: 256 elements return stack, 256 elements of data stack, 16 x 16-bit single-cycle hardware multiplier, 3 counter / timer and A vector interrupt controller with priority. In addition to the on-chip stack, the RTX2000 accesses all external devices through the I / O bus (called G bus on NC4016, referred to as ASIC bus on the RTX2000).

The RTX2000 is designed relative to the original Novix design, which includes: using byte exchange capability for processing 8-bit data, the jump capability between adjacent memory blocks when performing conditional branches, stack over / under Overflow interrupt.

Another feature of RTX2000 is that on the on-chip memory page control logic. It passes the specified page register (for extended return stack address values), index register (for extended returning stack address values), index register (for reassignment), user storage base address,, user memory base address, and page register (for reassignment) Limitations of the 32K word of the program memory. Because the return stack is 21 bits, any location of the memory can be accessed by a long-conditioned instruction sequence, which can save the full address in the return stack.

4.5.3 Directive Collection Total

The RTX2000 instruction set is very similar to the NC4016 instruction, but it is very different, so it needs to be described separately, and Figure 4.9 gives the instruction format of the RTX2000.

Figure 4.9A RTX instruction format - subroutine call

Figure 4.9A shows the instruction format of the subroutine call. In this format, the highest bit of the instruction is 0, and the subroutine is a 15-bit subroutine address, which limits the program within 32K words.

Figure 4.9B RTX Instruction Format - Conditional Branch

Figure 4.9B shows the conditional branch instruction format. BIT11-12 Select branch conditions: T is 0 (which can be used simultaneously or unconditionally POP data stack), unconditional branches, and transfer to 0 when the index register is reduced and 0 is transferred for loops. Bit 8-0 specifies the low 9 digits of the target address. At this time, Bit9-10 controls an incremental / reducer for branch addresses to implement branches of the same 512 word memory page, adjust the page or jump to zero

Figure 4.9C RTX instruction format - ALU operation

Figure 4.9c shows the format of the ALU instruction. Bit0-3 Operation ALU Output Shift Shift.

The BIT5 indicating subroutine of the ALU instruction returns, which allows the subroutine to return to the process mixing with the arithmetic operation to return to the "free" subroutine returned in many cases.

Bit8-11 Select the ALU function, the bit7 control output is reversed. Figure 4.9D RTX instruction format - ALU operation (multi-step mode)

Figure 4.9d is the format of the ALU instruction multi-step mode. This format is very similar to the ALU instruction format. Bit0-3 Select the shift control function, bit5 control subroutine return function, bit9-11 Select the ALU operation.

In multi-step mode, bit8 selection is a multiplication / division register or a square root register for special operations, while Bit6-7 selects a special multi-step control. One basic purpose of multi-step mode is to repeat multiplication and division steps.

Figure 4.9E - RTX instruction format - memory reference

Figure 4.9e gives the format of the memory access instruction, which requires two cycles: one for instruction reads, one for actual operand read and write. The address of the memory is always from the T register. Bit12 selects bytes or word operations because the RTX2000 uses a word memory address, so this bit selects the "low half-word / high halfword" or "full word" of the memory word.

Bit6-7 Indicates memory read, write, and other control information. Bit0-4 Specifies a small constant to add, subtractically achieve automatic incremental and automatic reduction addressing. The ALU function indicated by Bit8-11 is the same as the ALU instruction.

Figure 4.9F RTX instruction format - other instructions

Figure 4.9F shows other instruction formats, which can be used for 32-word represented user space read and write to save the memory address on the stack each time it is read and written from the memory. It is also used to transfer (in a single cycle) or load a 16-bit constant (using two clock cycles) to the top of the stack. The ALU function indicated by Bit8-11 is the same as the Alu instruction format.

The RTX2000 is specifically designed to perform the Forth language. Because many instructions use non-encoded formats, the machine operation corresponding to the FORTH operation sequence can be encoded into a single instruction. Table 4.3 gives the FORTH primitive and the instruction sequence executed by the RTX2000.

: (Subprogram) and

(Subroutine exit) Branch

! Drop

DUP

- i

0 LIT

0branch OR

1 over

1- r>

2 * r @

> R swap

@ Xor

Table 4.3 (a) RTX 2000 Directive Collection Total - Forth Original Lord (see Appendix B)

INV SHIFT DUP @ SWAP

Lit INV DUP NN G! INV

LIT SWAP INV DUP U! INV

LIT SWAP OP DUP U @ op

NN INV (Short Literal) NN G! INV

NN over op nn g @ INV

NN swap op nn g @ Drop INV

OP Shift nn g @ over op

! INV NN G @ Swap OP! NN over Inv Shift

@ INV over Swap OP Shift

@ nn over swap! INV

@Swap inv over swap! Nn

@Swap OP over swap @ op @ op

DUP? Branch Swap INV Shift

DDUP INV SWAP DROP INV SHIFT

DDUP NN SWAP OP SWAP DROP @ nn

DDUP! SWAP DROP DUP INV SHIFT

DROP INV SWAP DUROP DUP @ nn Rot OP

Drop Lit INV SWAP DUROP DUP @ SWAP

DROP NN INV SWAP over OP Shift

Drop Dup INV SHIFT SWAP OVER!

DUP INV SHIFT U! INV

DUP LIT OP U @ op

DUP @ nn rot op u @ swap INV

Description: INV - 1'S Complement Or No-OP

LIT - Long Literal Value

NN - Short Literal Value

OP - alu operation

Shift - Shift Select Or No-Op

Table 4.3 (b) RTX 2000 Directive Collection Total - Combined Forth Original Language

Directive Table Stack Back Stack

NN g @ -> n ->

Fetch The Value from Internal Register OR ASIC BUS DEVICE

NN (Stored AS A 5 Bit Literal in the instruction).

NN g! n -> ->

Store N INTO THE INTERNAL Register OR ASIC BUS Device NN

(Stored As a 5 Bit Literal in the instruction)

* N1 n2 -> D3 ->

Single Clock Cycle Hardware Multiply

* 'D1 -> D2 ->

Unsigned Multiply Step (Takes Two 16 bit number and

Products a 32 bit produter.

* - D1 -> D2 ->

Signed Multiply Step (Takes Two 16 Bit Numbers and Products

A 32 Bit Product).

* F d1 -> d2->

FRACTIONAL MULTIPLY Step (Takes Two 16 Bit Fractions and

Products a 32 bit produter.

* / 'D1 -> D2 -> Divide Step (Takes A 16 Bit Dividend and Divisor and

Products 16 Bit Remainder and quotients.

* / '' D1 -> D2 ->

Last Divide Step (To Perform Non-Restoring Division Fixup

STEP).

2 / N1 -> N2 ->

Arithmetic Shift Right (Same As Division by Two for

Non-negative integers.

D2 / d1 -> d2->

32 Bit Arithmetic Shift Right (Same As Division by Two for

Non-negative integers.

S 'D1 -> D2 ->

Square root step.

Next -> n1 -> n2

Count-down loop using top of return stack as a counter.

Table 4.3 (c) RTX 2000 Directive Convection Total - Special Features

4.5.4 Architecture Features

Like NC4016, the internal structure of the RTX2000 is optimized to implement single-cycle instruction execution, in addition to all primitive operations outside of the memory reading and long-term number are completed in single cycle.

The RTX2000 also allows some FORTH instruction sequences to a single instruction, and a key capability provided is the return bits in some formats, which allows the ALU operation and subrouties to be combined together.

4.5.5 Implementation and facing applications

The RTX2000 is implemented using 2.0 micrometer CMOS standard single technology, and packaged for 84 PGA. RTX2000 runs at 10MHz. A large advantage of standard unit technology is that RAM can mix in the same chip and logic circuitry, allowing return stacks and data stacks to be placed.

Since the RTX2000 performs most of the instructions, including conditional branches and subroutine calls are single-cycle, there is a large amount of time between the clock cycle start and the memory address used to refer to the next instruction. This time is approximately half the clock cycle, which means that the speed of the program memory should be 2 times the clock speed.

The RTX2000 was originally designed based on NC4016, so it improved NC4016 design, and there was no weird hardware on NC4016.

The RTX2000's goal is the high-end 16-bit microcontroller market, due to the use of semi-customized technology, special versions of the processor can be used for designated design applications. Some possible applications include laser printers, graphical CRT display control, telecommunications control, optical character identification, signal processing, and military control applications.

This section is from the RTX2000 Data Manual (Harris 1988a) and RTX2000 Instruction Handbook (Harris 1988B)

4.5.6 Standard Unit Design

Harris RTX2000 uses standard unit methods without using a gate array, which makes it benefit shallow. In the gate array, the designer customizes the rule mode of the logic gate presented on the silicon. In the standard unit, the designer works using the logical function library, which can arrange any arrangements on the silicon wafer without a predetermined schedule. The gate array is used with the predetermined memory area, and the flexibility of the standard unit design technology is that the gate array is not comparable.

In this way, the main difference between NC4016 and RTX2000 is shown: The RTX2000 can use the advantages of standard unit flexibility to put the stack RAM onto the chip. Due to this flexibility, the RTX2000 series processor with different capabilities and capacity can use the same core, and this core can be used as a large standard unit during design. In addition to the standard version of the RTX2000 series, users can benefit by applying specific hardware. Examples of dedicated hardware include: serial port, FFT address generator, data compression circuit, or other hardware that must be installed outside the sheet. Using standard unit technology, users can also cut the chip version according to their own needs. After less than 2.0 microns, we can arrange a certain number of on-chip programs RAM and ROM through this cut.

Chapter 5, 32-bit system architecture

32-bit stack computers started from 1989 as a product, but will soon play an important role in the future stack computer system. In 5.1 we will discuss the advantages and problems that appear together with the 32-bit stack processor.

In 5.2 we will discuss Johns Hopkins University / Applied Physics Laboratory Frisc 3, which is also known as Silicon Composers SC32. FRISC3 is a hard wiring stack processor that embodies the design spirit of NC4016 and its subsequent processors, but more flexible. It uses a very small on-chip stack buffer and manages the stack by automatic buffer control circuit.

In 5.3, we will discuss the design of Harris RTX 32P. RTX32P is a microcode processor derived from WISC CPU / 16, a two-piece implementation of a WISC CPU / 32 processor. The RTX 32P uses RAM-based microcodetic processors to increase flexibility, and use a large-on-chip stack buffer. The RTX 32P is a prototype processor that is developing 32 commercial processors.

In 5.4, we will discuss Wright State University SF1 design. SF1 is actually a ML1 stack computer that puts stack frames in multiple hardware stacks to support C language and other traditional languages. However, SF1 has a strong feature of ML0 system, so it can be used as an example to illustrate how ML1 design is designed.

Although these three processors have greatly different strategies, their goals are high-speed execution of the stack program.

5.1 Why use 32-bit systems

The 16-bit processor described in Chapter 4 has sufficient capabilities in a wide range of applications, especially in an embedded control environment. However, some applications require more powerful capabilities provided by 32-bit processors, including extended 32-bit integer operations, large memory addressing spaces or floating point calculations.

A technical difficulty designing 32-bit stack processors is the management of stacks. A reckless method is to use a slice stack memory like NC4016, but for 32-bit designs, this requires 64 additional pins for data and address, which makes it a lot in cost-sensitive applications. practical. FRISC3 uses a simple way to solve this problem: set two automatically managed chip stack top element buffers, split stack elements through normal RAM data pins. The RTX32P simply assigns a large-capacity stack space on the on-chip, with the memory of the memory to move by the stack element to achieve the splitting of the stack.

Chapter 6 will further discuss details of these methods.

5.2 FRISC3 (SC32) Architecture

5.2.1 Introduction

JOHNS HOPKINS UNIVERSITY / Applied Physics Laboratory (Jhu / APL) FRISC 3 is a 32-bit processor that is hardwired, optimized for performing the Forth language. Name "frisc" means "Forth refinement command set computer", 3 is for the relative to the first two generation prototypes. The focus of FRISC3 focuses is the single-cycle execution of the Forth primitive in the real-time control environment. JHU / APL Development FRISC3 is to meet the needs of the fast Forth language processor, which is primarily used for spatial control processing applications, such as satellite and space shuttle tests. The FRISC3 project may be traced back to Jhu / APL HUT Project (see APPENDIX A), which is a bit handler processor optimized for the Forth language.

After completing the HUT processor, the Johns Hopkins design team designed a 4.0 micron Silicon-On-Sapphire 32-bit Forth processor (FRISC1) and 3.0 micron Bulk CMOS version (FRISC 2), which are customized design. The last version of FRISC3 is a commercial processor and a by-product of their early work.

Silicon Composers purchased the property of the FRISC 3 commercial product and renamed SC32. The description of this chapter is all suitable for FRISC3 and SC32, but we still use the name of FRISC3 in subsequent chapters of this book.

The initial use of Frsic3 is embedded real-time control, especially spacecraft (this is the focus of the Jhu / APL team concerned), but it is also suitable for other industrial and commercial applications.

5.2.2 block diagram

Figure 5.1 is a block diagram of the FRISC3 architecture

Figure 5.1 FRISC 3 block diagram

The data stack and the return stack use a dedicated hardware implementation. They include stack pointers and special control logic, providing address of 16 element 32-bit stack memory, constitutes a ring buffer. 4 elements at the top of each stack can be read directly into the B-bus. In addition, the data stack stack top element can read the T-bus (the top element bus), return to the top element to read the A-bus (return address bus), which makes two potential difference stack elements to read at the same time Out, only one stack element can be written in the same time.

FRISC3 An innovation point is the stack management logic that uses the stack pointer. This logic automatically swaps the stack elements in the program memory stack overflow area of the 16-word stack buffer, which can ensure that the stack buffer on the film will never overflow and overflow. This feature saves additional stack data pins by stealing program memory cycle, making the performance loss of the program execute.

FRISC3 designer refers to this feature as a stack cache because it has access to a small amount of stack top elements for fast films, but unlike usual data or instruction cache, it does not have to use associated memory lookup table structures to access those residents Store data in discrete memory space.

The arithmetic logic unit of FRISC3 contains a standard ALU, and two inputs are the latch of B-bus and T-bus, respectively. These two different Alu sources allow data stack top elements (via T-bus) and any of the four data stack elements (via B-bus) to operate through a directive because the data stack is double port. The B-bus can also provide any non-stack resources via ALU's B.

At the first half of the clock, the B-bus and T-bus latch of the ALU are provided for capture data. This allows the B-bus to write data from the ALU to other registers of the data from the ALU after the second half cycle of the clock. The shift module input by the alu B inputs B inputs 1 bit for division, but can also transmit non-shifted data. Likewise, the shift unit of the ALU output can also be used to implement one bit to implement multiplication while providing B-bus data. The latch output of the ALU allows the pointer to add offset addressing to access the memory. In the first clock cycle of the memory read and written, the ALU adds the text constant value by the T-bus to the data stack word selected by the B-bus. In the second cycle, the B-bus is used to transmit the selected target.

The flag register FL is used to save one of the 16 optional conditional codes generated by the ALU, which can be used for branching and more accuracy arithmetic operations. The zero register is used to provide constant 0 to the B-bus.

4 user registers are used to save pointers or other values that point to the memory. Two of these registers for stack control logic, pointing to the data stack and returning the stack on the top elements in the program memory resident in the residential area.

The program counter is used to provide a memory address of the A-bus as the read instruction. The PC can also reach the return stack through the ALU to implement subroutine calls. The return stack can also drive the A-bus as a PC to implement the subroutine to return. The instruction register can drive the A-bus for instruction read, subroutine calls, and branches.

5.2.3 Directive Collection Total

Figure 5.2 Displaying the four instruction formats for FRISC3: a control stream, one for memory loading and storage, one for ALU operation, one for shift operation. FRISC3 uses a non-coded instruction format similar to NC4016, RTX2000, and M17. All instructions are distinguished by high 3 digits.

Figure 5.2A FRSIC3 instruction format - control flow

Figure 5.2A shows the control flow instruction format. The three control flow instructions are subroutine calls, unconditional transfer, and conditional transfer. The condition instruction is executed when the FL register is set, and this value is given by the nearest FL register settings. The address field contains a 29-bit absolute address. Non-conditional branches can be compiled to eliminate tail recursive.

Figure 5.2B FRISC3 instruction format - Memory Access

Figure 5.2b shows the format of the memory access instruction. Bit0-15 includes unsigned offset, which will be added to the address provided by the bus source operation data. It is completed: In the ALU input locking bus source and offset field from the instruction, perform an addition, give the ALU output to the A-bus for memory addressing.

BIT16-19 Specifies the addition / reduction of information that returns the stack and the data stack pointer. Bit20-23 Specifies the B-bus source. Here, TOS represents the second element of the data stack top element, the SOS data stack, and 3OS represents the third element of the data stack, TOR means that the top of the stack is returned. Bit24-27 specifies B-bus sources in a similar manner.

Bit28 pointed out that the next command read from the address specified from the backrest stack or the address specified from the program counter. Use bit28 to implement two combinations: Return to the stack stack top element as the address of the read instruction, the return stack performs a pop-up operation, which causes the subroutine to return to other operation parallel.

Bit29-31 indicates the type of instruction. For memory access instruction formats, 4 possible instructions are: loaded from memory, deposit memory, load low addresses, and load high addresses. Memory read-write instructions use the bus source to provide an address, and the bus destination field indicates the data register target or source. Loading and storage instructions are only 2 clock cycles instructions because they must access the memory twice to complete the data movement and the next instruction.

Two load address instructions simply load the calculated memory address to the destination register, do not do memory access at all. This can also be considered an immediate digital instruction. Mounting the high address instruction to move the offset left to 16 bits before performing the addition. Loading address instructions also means that a constant value is loaded because the ZERO register can be selected as the address register. In this way, a 32-bit assembly can be completed with a load low address instruction after performing a high address. Figure 5.2C FRISC3 instruction format - ALU operation

Figure 5.2c gives the Alu instruction format. In this instruction format, Bit0-6 specifies the ALU operation to be executed. The ALU's A is connected to the T-bus, and the B is connected to the B-bus. Bit7 control is loaded into the condition code selected by the command bit10-13. These conditional codes provide different combinations: zero bit, Negative, Carry bit, Overflow bit, and constants 0 and 1. Bit8-9 Select the carry of ALU operations. Bit14 Select the actual ALU result The content of the FL register enters the B-bus. Bit15 is 0, which pointing this instruction is an ALU operation.

Bit16-28 Specifies the memory access format given in Figure 5.2b. Bit29-31 indicates the type of ALU / shift operation instruction.

Figure 5.2D FRISC3 instruction format - shift operation

Figure 5.2d gives the shift instruction format. Bit0 of this instruction is not used. Bit1 pointed out that the input of the FL register is a condition code from the Bit 10-13 or the shift bit selected by the shift register. Bit 2-3 Select Perform a special steps to perform multiplication and recovery division. Bit4 Select the input bit of the right shift is from the FL register or the ALU condition code. Bit5-6 Specifies the right left or right shift operation. Bit7 specifies that the FL register is loaded with the shift output or the condition code generated by Bit10-13 and Bit1. Bit8-9 Select the inlet of the ALU operation, and Bit14 decides to drive the B-bus with the output of ALU or with the FL register. Bit15 is 1, indicating that this command is an shift operation.

Bit 16-28 indicates that the memory access format given in Figure 5.2b, bit29-31 specifies the ALU / shift operation type.

All instructions require only one clock cycle, only the memory read and write instruction exceptions, they require two clock cycles. Each clock cycle is divided into a source and destination phase when executed. In the source, the selected B-bus source and the T-bus source are read into the ALU input latch. In the goal, the B-bus objective is written. Each instruction is completed in parallel with the execution of the previous instruction.

Subprogram calls are also single-cycle completion, and subroutines returns no additional expansion times because they are combined with other instructions.

0> r

0 <@

0 = AND

0> Branch

0branch Call

1 Drop

1 DUP

1- EXIT

2 * Literal

2 NEGATE

2 / not

4 or

OVER

-1 r>

- r @

D <> u <

= U>

> Xor

Table 5.1 (a) FRISC 3 Directive Collection Total - Forth Original (See the Description of Appendix B)

FRISC 3 supports many combinations of Forth primitives, so we only give several examples here.

LIT @ (Address Plus Offset Fetch)

Lit ! (Address Plus Offset Store)

@ (fetch a variable)

! (Store a variable)

2 Pick (Copy The Third Element On The Stack)

3 Pick (Copy The Fourth Element On The Stack)

R> Drop R @ <

SWAP DROP over Over

LIT DROP LIT

OVER DUP LIT

OVER - DROP DUP

DUP DROP OVER

Dup and over @

Dup xor 2 pick @

DUP 1 3 Pick @

Over over!

2 PICK 2 Pick!

3 PICK 3 Pick!

R @ > r

R> DUP> R

DUP Drop

DUP> R> Drop DUP

The flexibility of FRISC 3 also supports many operations that are not included in the Forth language, such as the word that returns a stack (for example, the SWAP returns to the stack), and the FRISC3 also contains 4 elements of the data stack and returning the top of the stack.

Table 5.1 (b) FRISC 3 Directive Collection Total - Combined Forth Original Language

5.2.4 Architecture Features

Just like other computer design we have discussed, FRISC3 has an isolated memory address bus to refer to the instruction operation while In addition, Frisc3 does not prepare a dedicated stack top element register for the data stack, replaced by a double-port stack memory, which allows arbitrarily access to the four stack top elements, which provides more common ability than pure stack machines, It can improve the speed of execution of certain code sequences.

Stack control logic is used to prevent a stack catastrophic or lower overflow when the program is running, which allows the stack to maintain at least 4 elements of space throughout the run. Details of this request stack buffer management method will be discussed in 6.4.2.2. Each stack consists of a ring buffer, as long as the elements of the on-chip stack buffer are less than 4 or more, the stack controller exchanges data with the memory, each execution of an element, because the stack pointer is in one instruction Only increment or 1 time is increased. Each stack element is exchanged for two cycles. These additional clock consumptions are discussed in each chapter, and the designers of FRISC3 believe that these consumption is typically less than 2% machine execution time. 5.2.5 Implementation and facing applications

FRISC3 was achieved using 2.0 microns CMOS technology and silicon compiler, using 35,000 transistors, packaged to 85 PGA, and FRISC3 operating frequency is 10MHz.

FRISC3 is designed for real-time control applications, especially in space stations. It focuses on effectively executing the Forth language program, although it should also be able to effectively execute C or other traditional advanced languages. When implementing the C language, a user register can be used as a stack frame pointer, and the offset address relative to the frame pointer is performed using the memory access offset.

Information in this chapter is based on Hayes & Lee (1988) About FRISC3, FRISC previous version of the architecture information can be found via Fraeman et al. (1986), Hayes (1986), And hayes et al. (1987).

5.3 RTX 32P architecture

5.3.1 Introduction

Harris Semiconductor RTX 32P is a 32-bit RTX (Real Time Express) processor member, which is a prototype of Harris commercial 32-bit machine.

RTX32P is a CMOS device implementation of WISC Technology CPU / 32 (Koopman 1987C), which initially designed with discrete components. The CPU / 32 is developed based on the WISC CPU / 16 described in Chapter 4. Due to this history, RTX32 P is a microcrecoded machine with a slice of microcode RAM and on a chip.

RTX32P is a 2-chip stack processor, the initial design goal is to be a system platform, which should have the greatest flexibility. A large number of high-speed RAMs force the designer to use two chips, but this is also consistent with the goal of producing a study and development platform. Real-time control is the initial application direction of RTX32P.

RTX32P original programming language is Forth. However, the successor of RTX32P provides a good support for more traditional languages, such as C language, ADA, Pascal, and dedicated languages such as LISP and function programming languages.

An important design of RTX32P is: As the processor speed increases, the ALU can perform twice in each of the outer memory access cycles, so RTX32p executes two microinstructions when each primary processor is accessed, including instruction reads. . Each instruction is two or more clock cycle lengths, performs different microinstructions per clock cycle. The reason for using this strategy requires a long space to say clearly, we will discuss in 9.4.

5.3.2 block diagram

Figure 5.3 is the architecture of the RTX32P

Figure 5.3 RTX 32P block diagram

Data stacks and return stacks are implemented with dedicated hardware stacks, including a 9-bit addition / downholder (stack pointer), can address 32-bit wide 512 elements. The stack pointer can be read from the system to implement high efficiency access to the deep stack element.

The ALU section includes a standard multi-function ALU and a DHI register for saving intermediate results. For the convenience, the DHI register acts as a buffer register of the top element, which means that the stack pointer actually points to the top elements of the stack stack of the program, this result is the operation of the two elements of the stack, such as addition, can be used with one The cycle is completed. At this time, the ALU's B is read from the data stack, and the ALU's a side reads the top elements from the data HI register. The ALU's B-side input latch usually transparently latched, so that data can be returned within a single cycle, and the speed of operation between the DHI register and the data stack can be improved.

There is no visible condition code in the machine language program. Adding addition and other intensive operations are supported by microcode. They put the carry mark as a logic value on the stack (0 indicates that the carry is cleared, -1 means that the carry is set)

The DL0 register is a temporary retention register for an intermediate result of an instruction. Both DHI and DLO registers are shift registers that are connected to a 64 shift mode to support multiplication and division.

A door host interface is used to connect to a personal computer PC because all chip memory is based on RAM, requiring an external host to initialize the CPU.

The RTX32P has no program counter, and each instruction contains the address of the next instruction or a stack of returns to the stack. This design decision is noted that the Forth program contains a high proportion of subroutine call operations. 6.3.3 Discuss the influence of the RTX32P instruction format in more detail.

The replacement program counter is the next address register (NAR) included in the memory address logic in the block diagram, which contains a pointer for reading the next instruction. The memory address logic uses the return stack top element to address the memory implementation subroutine, and also uses the RAM address register (AddR REG) to perform high efficiency memory read and write. Memory address logic also includes an increased 4 circuit that generates a return address for the subroutine call operation. Because the return stack and memory address logic are isolated from the system bus quarantine, subroutine calls, subroutines return and unconditional jumps can be performed in parallel with other operations. The result is that these control transmission operations do not require any clock cycle in many cases.

The program memory is organized into a 4G byte memory, which can be addressed, instructions, and 32-bit data items on the byte boundary, because the data in the memory is accessed by 32-bit words. However, due to the restriction of the package pin, the actual RTX32P can only address 8M bytes.

The micro-serving memory is a 30-bit read write memory on a single 2k word. The memory is recorded in 256 pages per page. Each microcode in the machine is positioned in the page of its own 8 words. The micro-serving counter provides a 9-bit page address, but only 8 digits are used in this implementation. This policy allows 3 digits as the next microinstruction in the current microinstruction, with the lowest bit 8 selected 1 conditional micro-branch selection, which can achieve a branch when performing a single operation code And circulation.

The instruction decoding is very simple, it puts the 9-bit operating code into the microprogram counter, which uses it as the page address of the micro-serving memory. Since the micro-serving counter is implemented with a counter circuit, you can span the page of 8 microinsicks when needed.

Microinsign Register (MIR) keeps the output of the micro-memory memory, which makes it access to the next micro instruction while performing the current instruction. The MIR register completely eliminates the microprogram memory access delay on the system key path. Its access also reduces the limitations of two clock cycles for an instruction. If an instruction is capable of completing the single clock cycle, you must add a NO-OP instruction later, which ensures that the next instruction can correctly refer to the MIR.

The host interface allows RTX 32P to operate in two possible modes: main mode and slave mode. In slave mode, the RTX 32P is controlled by the PC to implement the microprogram, arbitrarily change the location of the register and the system to initialize and debug. In the main mode, the RTX 32P is free to run the program while the host monitors the status of a register with a service request. When the RTX 32p is in main mode, the host can perform a specific service loop, or perform other tasks, such as reading a next block from the disk input stream, or displaying an image, or only querying the status register . In need, RTX 32P will wait for the host's service. 5.3.3 Directive Collection Total

Figure 5.4 RTX32P instruction format

RTX32P has only one instruction format, as shown in Figure 5.4. Each instruction contains a 9-bit operating code for addressing the microcode page, and a 2-bit program flow control field indicates a non-conditional transfer, the subroutine call, and subroutines return. In the case of subroutine calls and non-conditional transfer, Bit2-22 gives a 23-bit word alignment target address. This design definition program is within 8M bytes, but remote jump and subroutine calls can be implemented by using a memory address logic page register. The memory is treated as a continuous 4G byte of address space when data access.

As long as it is possible, the RTX32P compiler compresses an opcode and a subroutine call, a subroutine return or jump instruction into an instruction. If it is not compressed, a NOP and Call JMUP or RETURN, JUMP to the next line instruction are compressed with an operator. The elimination of the tail is to change a subroutine to the jump of the subroutine entry, which saves a return overhead after the call is executed.

Because the RTX32P uses RAM as the microcode memory, the microcode can be changed according to the requirements of the user. The standard software environment of the CPU / 32 is MVP-FORTH, which is a dialect of Forth-79 (Haydon 1983). Some Forth instructions included in the microcode instruction set are shown in Table 5.2. It should be noted that the quantity and complexity of this instruction set.

DDROP

DDUP

! DNEGATE

- DROP

0 DSWAP

0 = i

0Branch I '

1 J

1- Leave

2 * LIT

2 / NEGATE

Pick not

Roll OR

= OVER

> R r>

? Dup r @

@ROT

ABS S-> D

And swap

Branch u *

D! U / mod

D XOR

D @

Table 5.2 (a) RTX 32P Directive Collection Total - Forth primitive (see Appendix B) @ (Fetch a variable)

@ (fetch and add a variable)

! (Store a variable)

DUP @

Lit

Over

Over -

R> Drop

R> swap> r

SWAP!

SWAP -

SWAP DROP

Table 5.2 (b) RTX 32P Directive Collection Total - Combined Forth Original Language

The RTX 32P instruction set can extend according to the requirements of the user to support the stack operations required for special applications.

Halt -> ->

Returns control to the host processor

Syscall n -> ->

Request I / O Service Number N from the host

Dovar -> Addr ->

Used to implement FORTH variable

Docon -> n ->

Used to achieve Forth constant

Table 5.2 (c) RTX 32P Directive Collection Total - Special Word

Below these FORTH operations are made by the microcode to complete their most work.

SP @ (Fetch Contents of Data Stack Pointer)

SP! (Initialize Data Stack Pointer)

RP @ (Fetch Contents of Return STACK POINTER)

RP! (Initialize Return STACK POINTER)

Match (String Compare Primitive)

Error CHECKING & Reporting Word

LOOP (Variable Increment Loop)

/ Loop (variable unsigned increment loop)

Cmove (String Move)

DO (loop initialization)

ENCLOSE (Text Parsing Primitive)

LOOP (increment by 1 loop)

Fill (Block Memory Initialization Word)

Toggle (Bit Mask / Set Primitive)

Operation code data stack returns

exp1 u2 -> eXP3 u4 ->

FLOATING POINT NORMALIZE OF UNSIGNED 32-BIT MANTISSA

ADC N1 N2 CIN -> N3 Cout ->

Add with carry. Cin and cout area logical flaggs on the

STACK.

ASR N1 -> N2 ->

Arithmetic Shift Right.

BYTE-ROLL N1 -> N2 ->

Rotate Right by 8 bits.

D ! D Addr -> -> Sum D INTO 32-bit Number at Addr.

D> r d -> -> d

Move d to return stack.

DLSLN D1 N2 -> D3 ->

Logical shift left of d1 by n2 bits.

DLSR D1 -> D2 ->

Logical Shift Right of D1 by 1 bit.

DLSRN D1 N2 -> D3 ->

Master SHIFT RIGHT OF D1 by N2 Bits.

DR> -> d D ->

Move D from Return Stack to Data Stack.

DROT D1 D2 D3 -> D2 D3 D1 ->

Perform Double-Precision Rot.

LSLN N1 N2 -> N3 ->

LEGCIAL SHIFT LEFT OF N1 by N2 Bits.

LSR N1 -> N2 ->

Logical shift right of n1 by 1 bit.

LSRN N1 N2 -> N3 ->

Master Shift Right of N1 by N2 Bits.

Q Q1 Q2 -> Q3 ->

128-bit addition.

QLSL Q1 -> Q2 ->

Logical shift left of q1 by 1 bit.

RLC N1 CIN -> N2 Cout ->

Rotate Left Through Carry N1 by 1 bit. Cin is carry-in,

Cout is carry-out.

RRC N1 CIN -> N2 Cout ->

Rotate Right Through Carry N1 by 1 bit. Cin IS Carry-in,

Cout is carry-out.

Table 5.2 (e) RTX 32P Directive Collection Total - Extended Arithmetic and Floating Point Support Words

Note: The RTX 32P uses the RAM microcode memory, so the user can increase or modify any of the required instructions. The above listed is only the instructions provided by the standard development package.

Table 5.2 gives some common Forth word combinations that can be used as single instructions. Table 5.2c gives some words for supporting low-layer Forth operation, such as subroutine calls and returns. Table 5.2d lists some supported high-grade Forth Forth words by special microcode, and Table 5.2e gives the increased word supported by the microcode to support extended precision integer operations and 32-bit floating point calculations.

The execution time of the instruction varies greatly. Directives for simply handling stack data such as and SWAP require 2 cycles (a memory cycle). Complex instructions such as Q (128-bit plus) require 10 or more microinstructions, but is still faster than advanced code. If necessary, the microcode loop can also delay thousands of cycles to complete the memory block movement.

Figure 5.5 RTX32P micro-instruction format

As mentioned above, each instruction accesses a microcode sequence of the micrographer memory page, which corresponds to the 9-bit opcode of the instruction. Figure 5.5 gives a micro-instruction format. The microcode uses horizontal code, which means that the microcode has only one format, which is divided into several fields to control different parts of the machine. As WISC CPU / 16, the simplicity of the stack computer makes the rtx32p's microcode format is simple. In this case, each micro-instruction only uses 30 bits, the rtx32p's microcode format and the CPU / 16 already discussed.

The microcode Bit0-3 specifies the source of the system data bus, two of the bus sources to give special control signals to configure RTX32P to implement a clock one multiplication and 32/64 bits of unrecoverable divisions.

Bit8-9 gives two special situations in the purpose of the data bus: DL0 can be specified separately as a bus purpose, and DHI always is always loaded with ALU. Bit8-9 and 10-11 specify the data stack pointer and return stack pointer control. Bit12-13 Controls the shifter output from the ALU. This shifter can be left left and right, and there is an 8-bit loop function.

The bit14-15 of the microinstruction is not used, so it is not included in the microcode RAM, and BIT16-20 controls the function of the ALU. Bit21 specifies a 0 or 1 carry in order to perform multi-precision calculations, the microcode is made by a conditional branch based on the column of low words, and enforces the next carry corresponding to 0 or 1, IT22-23 control DL0 loading and shift .

Bit24-29 is used to calculate a 3-bit offset to read a micro instruction. Bit24-26 Select one of the 8 codes to form a low address bit, while Bit27-28 produces a high address as a constant. Bit29 is used to increment 9-bit microprogram counters to allow the use of more than 8 microcode memory locations. Bit30 initializes the decoding sequence for the next instruction, because the command is a clock cycle length. Bit31 controls the address increment that can be accessed as a memory block data.

Microins are executed within each microcode clock cycle, each machine micro instructions are performed using two or more micro-cycles.

5.3.4 Architecture characteristics

In the RTX32P architecture, it can clearly see the traditional WISC CPU / 16, the most obvious improvement is to increase the validity of the memory address structure and isolate the return stack with the data stack in the subroutine call and return operation. These changes plus the RTX32P single instruction format makes subroutine calls, returns, and jumps can be combined with the operating code, and do not require any cost when running.

The RTX32P runtime clock is twice the main memory access time, so that two clocks will be given in each memory cycle, and each instruction is minimum for two clock cycle times.

The RTX32P instruction format has a variety of methods, many of which are not obvious, such as performing conditional branches. RTX32P has no direct hardware for supporting conditional branches, as this will greatly reduce hardware execution speed of other instructions or require high program memory access speed, and conditional branch is completed: By putting a 0branch opcode and a branch The target's subroutine call is combined, and the subroutine call is processed by hardware, which is parallel to the judgment parallel to whether or not the stack top element is 0 (if 0 executes a branch). If a branch occurs, the POP returns to turn the subroutine call into a JUMP execution. If the branch does not take, the microcode PoP returns the stack and use this value to read the branch mismatch command, which is the effect of performing an immediate subroutine call. Conditional branches require three clocks when they take up to 4 clock cycles when they don't occur. Remember: When we say this, each memory cycle of the processor is 2 clock cycles.

Another interesting ability of RTX32P is to quickly access any memory locations as a variable. Although the 0-operating instruction format appears to indicate another memory location to specify the variable address, you can use this action: compile a special opcode and a subroutine call, then the address of the "subroutine" actually The address of the variable that needs to be read. The microcode is then "stealing" variable value as the instruction read logic read, and forced subroutines to return before this value is performed as an instruction. The two methods discussed here are mainly to illustrate the hardware has some important capabilities, and they are not obviously visible to programmers using traditional computers. These capabilities are particularly useful when writing access to data structures (such as expert system Decision tree), in fact it allows direct execution of data structures. This direct implementation is implemented in this: put the data in a marker format with 9-bit tag (corresponding to a special user-defined microcode), use a 23-bit address, which is a subroutine for the next element Call or jump, or in the NIL pointer returns.

An important implementation feature of RTX32P is that all resources of the machine can be controlled by a host computer because the host interface supports the loading and single step clock of the microinjun register. With these capabilities, the execution of a microinitation can be installed in any one or all system registers, load the micro instruction, give the cycle clock, and then read the value back to check the results. This technique makes the microcode writing very directly, saving expensive external development hardware, which also makes the diagnostic program easy to write.

The RTX32P supports interrupt processing, including data stacks and up / upfolio interrupts that return stacks. This overflow, the usual processing method is to put half of the on-chip stack content into the program memory so that the program can use any depth stack. By using 512 element hardware stack buffer size, the typical Forth program never has a stack overflow.

5.3.5 Implementation and facing applications

RTX32P uses 2.5 micrometer CMOS standard unit technology to achieve 2 chips. The data channel chip contains ALUs, data stacks, and program memory, which is an 84 LCC package. Control the chip contains other parts of the system, using a 145 PGA package. RTX32P is running on 8MHz clock frequencies.

RTX32P is designed for real-time embedded control applications, especially to low power, small-sized embedded system applications. As mentioned earlier, RTX32P is a prototype of a commercial processor, which is scheduled to be named RTX4000. This processor has several features that make it suitable for real-time control applications, which are also suitable for personal computers for acceleration calculations. Its applications include: mixed ROM and RAM microcode to narrow the system to a chip, stand-alone, on-chip hardware support floating point, faster clock speed and dynamic program memory chip support. Of course, some version of the chip may not contain all of these features. In addition, the enhanced architecture can support C, ADA, Lisp language, which is implemented by using the address field in the instruction to indicate the fast access 21 bit constant, which allows some key operations such as stack frame pointer additional offset addressing. Ability to run at high speed.

This chapter is based on WISC CPU / 32 Description Koopman (1987C), Koopman (1987D) and RTX32P introduction Koopman (1989).

5.4 SF1 architecture

5.4.1 Introduction

Wright State University's SF1 (meaning of the name is the first number of stack frames) is a multi-stack of processors designed to perform advanced languages, including Forth and C. It is designed to have a large number of stacks, which are described herein to achieve 5 stacks. SF1 is sent in the Forth language, but it spans the boundaries of the ML0 and ML2 machines, each instruction, can be directly addressed from the stack memory to any two stack elements. It also mixed some features of FRISC3 and RTX32P, as well as some unique innovations. Wright State University has developed a series of stack-based computers, initially from Rufor (Grewe & Dixon 1984), which is a pure Forth-based computer that is implemented using a bit sheet component. From 1985 to 1986, a computer architecture constructs a prototype of discrete component, which is a more common stack processor called SF1. In 1986-1987, a VLSI class expanded this architecture and using multiple customers customized silicon films to realize the VLSI version of SF1. The following description is the VLSI SF1 implementation.

The application of SF1 is a real-time control system using Forth, C and other advanced languages.

5.4.2 block diagram

Figure 5.6 is a block diagram of the SF1 architecture

Figure 5.6 SF1 block diagram

SF1 has two bus SBUS and MBUS. MBUS is a multiplexer bus for transmitting addresses to program memory, with program memory exchange instructions, and data for transmitting data between system resources. The dual bus design allows the instruction to acquire from MBUS while SBUS data operation.

The ALU has a stack top register TOS that can receive the results of all ALU operations. The ALU input register ALUI acts as a second operand for saving the ALU. The Alui can get an operand for most operations from SBUS, or by MBUS for memory reading. Alui and TOS can be positioned on MBUS and SBUS. For convenience, the TOS register holds the top elements for special operations, although it allows programmers to properly manage.

8 different sources and destination connecting to stack bus

S - for parameters and universal stacks

L - circular counter stack

G - Global Stack

F - frame stack

R - Return Address Stack

C - online constant value

I - I / O address space

P - program counter

In the machine manual, all of these 8 sources are called stacks, but in fact, C, I and P are not a stack structure. The stack L, G, F and R are exchangeable in most cases, or for any purpose. S Stacks are somewhat special, all subroutines return the stack automatically push to the S stack.

Any of the 8192 elements of these stack stacks can be designated as the source or purpose of the bus. Whenever the stack is read, the top elements can return or POP. When a stack POP, the stack top element always removes the stack memory, regardless of which element is actually read.

Similarly, when a stack is used for bus purposes, any one of the 8192 elements of the stack can be written, or the top element can be pressed from SBUS.

The C bus source is used to return a 13-bit sign constant on the address field of the instruction. P is used to load and store program counters. I I / O address space for addressing 8k words.

The program counter PC is a counter that provides an address to the MBUS bus to read the instruction, which is also used to load from MBus to implement subroutine calls and JUMP. The PC can read and write on SBUS to save and recover subroutines to return the address.

5.4.3 Directive Collection Total

SF1 has two instruction formats, as shown in Figure 5.7. The first format is used to subroutine call and jump, and the second format is used for other instructions. Figure 5.7A SF1 Instruction Format - Jump / Call

Figure 5.7A gives the jump / subroutine call instruction format. Bit0 in this instruction is 0. Bit1 jumps at 1 time, is 0, the subroutine is called. Bit2-31 is a word aligned Jump / Call target address. This instruction format is very similar to the RTX32P given in Figure 5.4, but there is no high opcode. JMUP and subroutine call instructions are executed single-cycle.

Figure 5.7B SF1 instruction format - operation

Figure 5.7b shows the format of the operational instruction, which is more like the FRISC3 Alu instruction format of Figure 5.2c. In this instruction, BIT0 constant is 1.

Bit1 Select whether it is a SKIP operation. If the SKIP operation is selected, the next instruction of the instruction stream is taken out when the Zero flag register is set to 1. This can be used to achieve non-0 conditional branch instruction sequences.

Bit2-7 Select the ALU operation, a special ALU operation returns the status flag generated by the previous ALU instruction. These markers can be offset as a multi-channel jump table for multi-conditional branches. These conditional branches are slower than using SKIP instructions, but more flexible.

Before discussing Bit8-28, let's examine the work of SBus in a clock cycle. SBUS is used twice in a clock cycle, in the first half clock cycle, SBUS is used to read one of the 8 bus sources, and read data is always placed in the Alui register. After half of the clock cycle, Alu performs operations based on the new Alui value and the old TOS value. At the same time, the old TOS value is written back to one of the 8 bus destinations. Bit29 in the instruction can override TOS as a write value and forced Alui (which is in the first half cycle of the clock cycle) appears on the SBUS on the second half cycle of the clock cycle.

Bit8-11 Select SBUS purpose. This purpose writes back in the half cycle of the command of the command. Bit8 Selecting the Purpose Stack is Push or just write back. Similar to this, BIT12-15 selects the SBUS source, which is read in the first half cycle of the clock. Bit12 selects whether the source stack is popped up.

Bit16-18 provides the address of the read and write stack, which allows you to read or write any one of the 8K stack elements. Note that there is only one address in the instruction, so the source and destination stack must use the same address within a given cycle.

Bit29 is used to write the value of TOS when writing SBUS purposes. When this bit is set, use the Alui register and no longer a TOS register. This can be implemented directly between the two SBUS sources, and the method is to load the value into the ALUI within the first half cycle of the clock, stored at the same value in the second half of the clock cycle.

Bit30-31 is used to control memory access. Bit31 Select the extended instruction cycle, which accesses the RAM through the MBUS in the second clock cycle (the first clock cycle is used to read the next instruction). Bit 30 specifies the RAM read or write, the TOS register reads and provides an address in the first clock cycle, read or writes in the second clock cycle to provide or receive data. Note that the read and write of RAM is performed in the second of the two clock cycles, so Bit2-29 can be used to perform a normal operation in the first clock cycle. The first clock cycle is often used to perform TOS reload operations (TOS contains an address), and the loaded value is the value of RAM to be written in the second clock cycle.

5.4.4 Architecture Features

We also saw this importance: Providing a specified channel through MBUS is used to refer the finger, while providing the second channel in SBUS for data processing. Like other stack computers, SF1 is designed to support fast execution of instructions, especially fast execution of subroutine calls. The SF1 operating instruction format uses an operand to use a very novel, a single stack top register as an input of the ALU operation and actually has only one single instruction field that makes the architecture look like a 1-operand stack computer. However, each directive can specify the source and the purpose of the fact that this machine looks more like a 2-operable computer. Perhaps this instruction format is called 1-operand command is more appropriate because only one address is available at the same time as the source and the purpose.

The reason for each stack has 8K direct addressile stack elements is to support high-level languages such as a stack of frames like the C language. In addition, a stack can also be used as a very large register file, not special stack POP and PUSH operations.

There are several hardware stacks to perform fast context switching in real-time control applications, although the implementation described here only contains 5 hardware stacks, but the number can be increased in different design versions. A simple way to assign stacks is to assign each of the 4 high priority tasks, and then exchange the fifth stack to the low-priority task and swap from the program memory when needed.

The subroutine returns to return the POP back to the program control and write the top element to the PC register. Since a prefetch line is used, the instruction after the subroutine returns will be executed before returning.

The 32-bit volume is obtained by using the constant space of the 13-bit offset PC relative addressing access program memory, which is typically placed after a non-conditional branch, or after returning after using the subroutine of this process.

5.4.5 Fields of implementation and application

SF1 uses 3.0 micron CMOS Mosis technology to implement, using two customized methods, with two customized chips, one chip contains ALU and PC, and another chip implements a 32-bit stack. Control and instruction decoding use hard-encoded logic, but finally combines a customized VLSI.

The implementation of the stack is very different from the stack computer we have seen. Because the stack must be designed to be free access, an obvious design method is to combine an adder and a stack pointer to standard memory. A disadvantage of this method is relatively slow, and it is also difficult to extend to the multi-chip stack.

The method used by SF1 is completely different. Each stack memory is actually a huge shift register that moves stack elements between adjacent memory characters when stack PUSH or POP. The nth word on the address stack is a nin of the memory that simply accesses the memory, because the top element of the stack always remains shifted to the 0th memory address. One disadvantage of this method is that the shift register unit is larger than the regular memory cell, so the chip of the SF1 largest stack contains only 128 words of 32-bit width.

SF1 is the time being a research platform that emphasizes real-time control by switching with fast context (specifying a stack chip to each task) and supports advanced languages.

This section is described based on Dixon (1987) and longway (1988) for SF1.

Chapter 6 Understanding Stack Computers

In the previous chapter, we have described a stack computer abstract, and also describes several actual stack computer examples. What we have to discuss now is: Why do these computers are designed? Why is the stack computer with a safe advantage compared to traditional computers?

This chapter uses three different computer design methods as reference points. The first reference is a complex set instruction computer CISC, the most typical is the microprocessor used on the VAX series and personal computers; the second reference is a cryptographic instruction computer RISC, typical is Berkeley RISC project (Sequin & Patterson 1982) And Stanford MIPS project (Hennesy 1984); the third is the stack computer we have discussed in the previous chapter. 6.1 Discussing the debate between the past few years, those who advocate registers, stack-based computers and memory-to-memory computer people, and recent debate on high-level language and RISC architecture on CISC.

6.2 Discuss the advantages of stack computers. Stack computer program size is small, low hardware complexity, high system performance, better smoothing performed more than other processors in many applications.

6.3 Given the research results of the command execution frequency in the FORTH program. We are not surprising at all: subroutine call instructions account for a lot of proportion in the Forth program.

6.4 Given the stack management result obtained by accessing the simulation of the stack. The result shows that 32 stack elements are required for many applications. The four different methods used in handling stack overflows are also discussed: stack management using a large stack, request stack management, pagination, and use Lenovo Cache Memory.

6.5 Examined the cost of interrupting and multitasking on a stack computer, a simulation result is displayed in many environments, and the context switch is the secondary cost. Through some methods, the cost of the stack buffer context switch will further decrease, including: appropriately program interrupt, using a lightweight task, allocating stack memory to a plurality of smaller buffers.

6.1 Review of History

The debate between the hardware support stack computers and other architects has a long history. This argument can be divided into two main areas: the debate between registers and non-register designers, advanced language machines The debate between designers and RISC designers. We cannot give these controversy a further affirmative answer, and the idea given in the reference is worth considering the readers of interest.

6.1.1 Register and non-register computer

Is the register should be explicitly supplied to the programmer when designing a computer? This design decision can return to 1950. The existing stack-based KDF.9 computer (Haley 1962) proves that it has begun to change much before using the register-based design method.

Several solutions have caused several solutions to use register-based computers: pure stack-based computers, single operation based on stack-based computers (also known as stack / accumulator computers) and memory to memory computer.

Pure stack computers are one of SS0, MS0, and SL0 and ML0 classification, which are very simple. A clear argument like a stack computer is that the calculation of expression needs to use a similar stack structure, while the register-based computer requires some time to simulate the stack behavior when calculating the expression. Of course, pure stack computers may need to be more than stack / accumulator computers (SS1, MS1, SL1, ML1 classification) because they cannot read a value at the same time and perform arithmetic operations on this value. Smart readers may have noticed that the stack computer discussed in Chapter 5 uses multi-operand instructions such as Variable @ to make up for this lack of defects on a large program.

In the memory to the memory computer architecture, all operands are in the memory, which looks valuable to the high-level language that runs C or PASCAL. The reason is this: In these advanced languages, most assignment expressions have only one or two variables on the right side of the assignment operator, which means that many expressions can be processed with a single instruction, no longer need to put data The register moves inside and outside. The CRISP architecture (Ditzel et al. 1987a, 1987b) is an example of a memory processor. Some views most often cited in the debate can be seen in these papers: Keedy (1978a), Keedy (1978), Keedy (1979), MYERS (1977), Schulthess & Mumprecht (1977), And Sites (1978) These papers do not involve all issues and some aspects have been outdated, but they provide a good starting point for those who are interested in researching these debate history.

6.1.2 Is it a high-level language or a RISC computer

A related debate is to use a high-level language computer or use RISC philosophy to make a computer design.

Advanced language computers can be used as evolution of CISC philosophy. These computers have potentially very complex instructions to map them directly to one or more advanced languages, in some cases, a compiler front end is used to generate intermediate code that can be performed directly by the machine, such as PASCAL P - Code or M-code for Modula-2. This philosophy limit may be the Symbol project (ITZEL & KWINN 1980), which implements all system functions with hardware, including source program editing and compilation.

The philosophy of the RISC system supported by senior languages is to provide the easiest and possible structural block for the compiler for the operation of integrated high-level languages. This typically includes code sequences for loading, storage, and arithmetic operations to implement each advanced language statement, and the RISC designer claims that the code sequence they have either runs is fast than the complex set of complex sets that are equivalent to the CISC computer.

Stack Computer Design Philosophy in some program is located between advanced language design and RISC design, the stack computer provides a very simple primitive, which can be performed in a single memory cycle, in line with the spirit of RISC philosophy. However, the high-efficiency application code on the stack computer is completed by inexpensive subroutine, and the selection of these subroutines can be seen as a virtual instruction set, which is cropped into the need for advanced language compilers, no complexity Hardware support.

About senior languages and RISC computers can be referred to Cragon (1980), Ditzel & Patterson (1980), Kavipurapu & Cragon (1980), Kavi et al. (1982), Patterson & Piepho (1982), And Wirth (1987 ).

6.2 Differences between architecture and traditional computers

The significant difference between the stack computer and the traditional computer is the former using the 0 operand stack addressing replacement the latter register or memory addressing. When this difference is combined with the fast subroutine call, the stack computer is made in the following aspects. More than traditional computers: program size, processor complexity, system complexity, processor performance, and program execution consistency.

6.2.1 Program Size

People often say: "The memory is very cheap". This is said because they see the history of rapid growth in memory, knowing that the size of the chip can be fantastically increased over time.

The problem is that when the chip capacity is increased, the size of the computer to solve the problem is also consistently increased, which means that the size of the program and data set is more faster than the available memory capacity. This situation has become more serious as people use advanced languages in various program design phases, which is of course huge, but also improved program productivity. Not surprisingly, the outbreak of program complexity looks some contradictions, some people say "Let the program extend to the entire memory, then use some of them". The number of program available memory is fixed for an application, which is completely considered for the cost of memory chip costs, printed circuit boards, etc., which is also subject to mechanical performance, such as power consumption, The limitations of expansion slots in the heat dissipation and system (this is also considered for economy). Even if the cost and budget, electrical load and cable delay speed, the number of processors can be used will be limited to the number of memory chips will be restricted. Small program size can reduce memory costs, number of components, and power supply requirements, to improve the speed of the system by highly efficiently assigning costs to smaller, higher speeds of memory chips. Additional income includes: Improvements in virtual memory environments (Sweet & Sandman 1982, Moon 1985); only a small Cache chip can improve your hit rate. Some applications, especially embedded processor applications, very harsh for printed circuit board space and memory chip, because these resources are fixed in all system costs (Ditzel et al. 1987b).

Faced with an increasing program size, traditional solution is to use a layered memory device for a series of capacity / cost / access time compromise. Hierarchical components include (from the cheapest / maximum / slowest to the most expensive / smallest / fastest): tape, disc, hard disk, dynamic memory, micrographic Cache, and on-chip instruction buffer. So, the more accurate statement of "Memory is very cheap" should be: "The slow memory is very cheap, but the fast memory is really expensive."

The memory problem finally attributes a memory that provides a number with a reasonable price that the processor can pay for a number, enough speed enough, which is completed by arranging most of the program to the fastest memory hierarchy.

Managing the fastest memory hierarchy is to use the Cache memory. The working principle of the Cache memory is that a small program will be used multiple times in a certain period of time, so that when this short program is first referenced, it is copied from the slow memory to cache and in it. After storage for later use, this is a second and subsequent continuous access to delay. Because Cache's capacity is limited, the instructions in Cache may be replaced when its location is used for the latest fingernails, and the problem with Cache is that it must be large enough to reserve a sufficiently long code segment for occasional reuse.

The Cache memory is sufficient to save a certain number of instructions, called "Work Set", which can effectively improve the performance of the system. How does the size of the program affect the performance of the program performance? Suppose we give a certain number of advanced language operations in the work, considering improving the impact after the code. Intuition tells us that if a code sequence that completes a high-level language statement is more tightened on the A machine than in the B machine, the A machine requires a smaller Cache byte than the B machine to save the same code generated. This means that the A machine requires only a smaller Cache to get the same memory response time.

Davidson and Vaughan (1987) makes recommendations through the actual examples, and the Cache size used by RISC should be 2.5 times the CISC (although other sources, especially RISC developers, it is more than 1.5 times more. child). They also recommend that the RISC computer requires 2 times the Cache in the CISC architecture. Further, the RISC computer requires 2 times Cache, which is still 2 times, still generates a 2x cache non-hit rate (because the fixed non-hit rate causes 2 times for 2x cache access, and the result is the same Performance requires a faster main memory device, so we have such rules: RISC processors running on 10Mips require 128k bytes of Cache memory for satisfactory performance, while high-end CISC processors typically only need not much 64k bytes of cache. Compared with RISC and CISC, the stack has a smaller program size, the stack computer program is approximately 2.5 to 8 times smaller than CISC code (Harris 1980, Ohran 1984, Schoellkopf 1980), although the later discussion makes this conclusion Some restrictions. This means that in order to get a comparable memory response time, the RISC processor's Cache memory size is as big as the full program memory of the stack processor! Is this true? We consider this situation: Just on the RISC processor The UNIX / C programmer faces the 8M to 16M bytes of memory and 128K bytes of cache, the Forth program program is warmly arguing that the program on the stack computer is really more than 64k bytes. space.

Stack computer Small program size not only saves memory chip to reduce system cost, but it also actually improves the performance of the system. This is due to the increase in the opportunity to reside in the high speed memory when needed, and it is possible to put all the programs in the fast memory.

Why is the stack computer only need such a small memory? There are two factors that do little program sizes on the stack computer are possible. A more obvious factor is also recited in the paper, the stack computer's instruction format is small, but the traditional architecture not only needs to specify an operation in each instruction, but also requires the specified operand and addressing mode. For example, a typical operation of two numbers of operations in a register-based computer:

Add R1, R2

This instruction not only needs to specify the ADD opcode, but also describes two registers, which are R1 and R2.

Conversely, the stack-based instruction set only needs to specify the ADD operation, because the operand means that the current stack top is required to perform a load or store instruction or put an immediate number to the stack. Operation. WISC CPU / 16 and Harris RTX 32P use 8-bit and 9-bit opcode, and many operas retained provide space for efficient runners. Other processors discussed in this book such as Novix NC4016 use loose encoding instructions, 2 or more operation packages can be packaged in one instruction, which can compensate for loss of the byte code density to a certain extent.

The stack computer can more tighten the code is less obvious but more important reason that they can effectively support the reuse coding of the high frequency subroutine, often referred to as a series line encoding (Bell 1973, Dewar 1975). When these code runs on a conventional computer, its execution speed will be affected. In fact, most basic compiler optimizations in RISC and CISC are compiled into embedded macros. In addition, many programmers have such experiences: excessive process calls on traditional computers will damage the performance of the program, which causes the procedures on the traditional computer to be significantly too large. Instead, the stack computer can effectively support process calls. Because all of the working parameters are on the stack, the overhead of the process call is minimized, and does not need to be passed to the parameter to spend the memory cycle. On many stack processors, the process call only needs one clock cycle, while the subroutine returns to most of the cases can be combined with other operations without any clock cycle.

We have several conditions when the stack computer code is compressed than other computers, especially if we do not give a comprehensive research result. The size of the program is very large. Depending on the programming language used by the programming language, compiler, program design style, and instruction sets used by the processor. At the same time, Harris, Ohran and Schoellkopf's research on the stack computer is mainly based on variable length instructions, and the processors described in this book are 16 or 32-bit fixed length instructions.

Consider the following facts below the length of the fixed instruction: The processor running Forth is smaller than the program of other stacking computers. Programs are smaller because they use subroutine calls more often, allowing high code reuse in a single application. At the same time, we can also see in the back section: Even 32-bit processors such as RTX32P, are not as usual as usually what you want to occupy many memory spaces.

6.2.2 Processor and System Complexity

When we examine a computer complexity, it is important that two levels: the complexity of the processor and the complexity of the system. The complexity of the processor is actually used in the processor core actually used to calculate the number of logic units (measured by the chip area, the number of transistors, etc.); the complexity of the system is considered to embed a full-function system processor while including circuitry. Memory and software.

CISC computers have become extremely complex in these years, and this complexity is derived from the requirement to meet many functions well. A large complexity is to require the use of many instruction formats to be tightly encoded, other complexity comes from their support for multiple programming and data models. No machine can effectively support all such applications: in a time film, both COBOL's PACKED decimal number, and perform Fortran floating point matrix operation, and deal with Lisp expert system, what else wants to do, then Of course, it is extremely complicated.

Part of the complexity of the CISC is the result of the smaller program. Its goal is to reduce the semantic clearance of the advanced language and machine instructions and generate more efficient code. Unfortunately, this makes almost all chip available areas to achieve control and data channels (such as Motorola 680x0 and Intel 80x86 products). Therefore, RISC's designers always say that the CISC designer may be "compensated performance, and the size".

Extreme complexity of certain CISC processors looks a bit unbeatable, but they come from a universal and good goal: establish a consistent and simple interface between hardware and software. Its success can be seen from the IBM System / 370 product line, which contains a variety of different performance and prices, using the same assembly language instruction set from a personal computer card to the supercomputer.

Software and hardware in assembly language-level interface means that compilers do not need to be complex, they can produce all aspects of acceptable code, which can also be shared on different machines in the same series. Another advantage of CSIC is that because the code is very tightening, they don't need to use large-capacity Cache to improve system performance, that is, CISC reduces system complexity by increasing processor complexity. The concept behind the RISC computer is faster by reducing its complexity. At this point, the actual processor is used inside the CISC to use less transistors for control circuitry. Its implementation is an instruction using simple instruction format and low semantic content, each instruction does not do too much work, but only It takes a little time to complete. Instruction formats are typically selected in order to run a particular programming language and task, typically an integer operation of the C programming language.

Reducing the complexity of the processor is not cost, many RISC processors use huge registers to ensure that the frequently accessed data can be reused quickly, and these registers must be dual port memory (in order to allow two different addresses to access at the same time) To achieve the readings of two sources of each cycle. Further, since these instructions are low semantic content, the RISC processor must have a higher speed memory bandwidth to ensure that the directive stream enters the CPU. This means that in order to get the performance, the fixed resources fixed in the system must be a Cache memory. At the same time, one of the characteristics of the RISC processor is the internal pipeline, which means having more hardware and compilation techniques to manage pipelines. At interrupt processing, you should pay special attention and use additional hardware resources to ensure that the line state is properly saved and recovered.

Finally, different RISC implementation strategies have put forward higher requirements for compilers: scheduling lines to avoid competitive adventures, filling branch delay slots, managing the allocation and overflow of the register. While reducing processor complexity in order to get more complexity, more complexity is pushed to compilers while making it easier to get bugs, which makes the compiler more complex, developed, and debugging costs.

Reducing the complexity of the RISC processor results in a complexity that is biased (or even more severe) increases the system.

Stack computer efforts improve the balance between processor complexity and system complexity. The stack computer designer recognizes that the simplicity of the processor is not only limited to the number of instructions, but also the data that may be operated by the instruction: all the operations are the top of the stack. In this sense, the stack computer should not be "a computer that reduces the collection set", but should be "a computer that reduces the operational database."

The point of viewing the operand selection rather than how many operations do, instructions can be very tight, because they only need to specify the actual operation without the need to specify where the source is from. The on-chip stack memory can be a single port because only one element is only one element that requires a stack of PUSH or POP operations (assuming two elements of the stack is saved in the register). More importantly, because all the following operands already know are stack elements, do not need to be prefetted using the pipeline. The operand is always available immediately in the top register. As an example, we can consider the T and N registers in the NC4016 design, and random them with each 10,000-mentioned 100 RISC computers. Access registers are compared.

Selecting an implicit number of operands also simplifies the instruction format, even if the RISC computer has a variety of instruction formats. However, the stack computer has only a few instruction formats, even more extreme, such as RTX32P, only one instruction format, the spectrum format simplifies the instruction decoding logic, speeds up the operating speed of the system.

The stack computer is particularly simple: the core of the 16-bit stack computer typically only 20000 to 35,000 transistors. Compare the Intel 80386 chip with 275000 transistors, Motorola 68020 has 200,000 transistors. Even if 80386 and 68020 are 32-bit computers, this difference is also obvious. The compiler of the stack computer is also very simple, because the format and operation selection of the instructions are consistent. In fact, the compiler of the register computer is used for expression calculations through a stack, and then maps this information to the register. The compiler of the stack computer is only necessary to do very little work when the source of the stack is mapped to the assembly language. The Forth compiler is simple and flexible to be well known.

The stack computer system is also very simple, because the stack program is so small that the method for performing Cache control for high performance is redundant, typical programs can be placed in a memory of the Cache speed without complexity. Cache control circuit.

In the case where the program and / or data is too large without placing the memory, you can use the software managed memory hierarchy: The subroutines and blocks that are often used can be placed in the high speed memory, and the blocks that are not frequently used are slow. Speed memory. Single-cycle calls for high-speed memory are low-cycle calls make this technology very effective.

Data stacks can be viewed as a data cache as a data cache, such as the parameter transmission of the process, and data elements can be moved and removed from the high speed memory under the control of the software. Although traditional data cache, or less expanded into instruction cache, it is indeed possible to improve some speed, but the stack computer is not required to do not require Cache when running small or medium-size applications.

Therefore, the stack computer reduces the complexity of the processor by limiting the operand available to the command. Do not have mandatory reduction of available instructions, nor does it cause a large change in hardware and software that supports processor running. This result of this reduction complexity is that the stack computer has left more space for the on-chip program memory and on-chip-on-chip-on-chip-on-chip, and a hidden impact is that the stack computer is so small that many applications can be completely put. Chip. This piece of memory is faster than the Cache outside, which will also save complex control circuits without sacrifices of any speed.

6.2.3 Performance of the processor

The performance of the processor is a difficult to say, and the number of unclear kung furates which processor is better than the other, but these efforts are often based on rough, obviously suspicious test procedures, and their enthusiasm is mainly from Your interest and loyalty to the specific product (or there is a purchase relationship).

The reason why it is difficult to compare is mainly the problem in the application, and the test procedure using integer arithmetic operations cannot be suitable for floating-point performance, commercial application or symbol processing. It is best to use the test program to illustrate the case where the given operating system is used, using a given operating system using a test program, using a given operating system, using the given compiler, using the same programming language Next, the A processor is better than the B processor, but this is only the result of the test program. Obviously, it is difficult to test different computer performance.

A completely different architecture is more difficult. The difficulty of difficulty works for a single instruction. Because the workload made by the VAX computer is very different from the command of the register to the register between the RISC, "the number of instructions per second" is only fine (even when it is normal). To standard instruction test, we still can't really believe in using the same test program). Another problem is that different processors use different technologies (bipolar, ECL, SOS, NMOS, and CMOS will have different sizes) and different design implementation strategies (expensive full-scale wiring, standard unit automatic wiring, door Array wiring). However, very conceptually comparing different architectures requires deduction of different implementation techniques. Further, the software implemented has a very large impact. The problem is in the real world, and the efficiency of a specified computer does not only determine the speed of the processor, but also by the system hardware, operating system, program language and compilation. Influence of performance and quality. All of these difficulties give readers a conclusion: Accurate performance testing is impossible to solve it here. Instead, we focus on some reasons: Why is the stack computer can be faster than other types of computers? Why is the stack computer has a good system performance characteristics? Which type of program is most suitable for running?

6.2.3.1 Rate of Instruction Execution

Figure 6.1A Overlap - Original Instruction Stage

The most sophisticated RISC processor claims that they have the highest execution rate - a clock command. They are done through the pipeline, divide the instructions into several sequences: address, instruction read, data storage cycle, as shown in Figure 6.1A. This acceleration in the instruction stream has weakened the execution of the instruction, and some problems have also produced, the most obvious is to avoid competition arising from data dependencies, and when a command depends on the result of the previous instructions, this will generate such problems. . Since the second instruction must wait for the first instruction to put the results in the result of the first instruction. There are several hardware and software policies to mitigate the impact of data dependencies, but there is no one that can completely solve this problem.

The stack computer can perform the program as fast as RISC, which may be faster because there is no data dependent problem. There is a saying that the register computer is effective because of the use of pipeline ratio that cannot be performed with a stack computer that cannot use the pipeline, because in fact, each directive relies on the previous instructions on the stack. But now, the stack computer does not require the pipeline and can be as fast as RISC.

Consider how the RISC computer is designed when it is designed for a stack computer. Both machines need to read instructions, and both machines can perform parallel with the previous instructions. For convenience, we consider these with the instruction decoding. Since the stack computer such as RTX32P does not need to perform condition operations to divide the parameter field from the instruction or which format is selected, it is easier than the RISC computer.

In the next step of the pipeline, the main difference begins to become obvious. The RISC computer must spend a pipeline stage to access the operand behind the instruction (at least some cases) can only complete the decoding. A RISC directive specifies two or more registers as inputs for ALU operations. The stack computer does not need to read the data; they wait for the top data when needed. This means that in the least case, the stack computer can save the operand read portion of the pipeline. In fact, stack access can also be done faster than registers, which is because single-port stack memory is small, and of course, it is faster than double-port register memory.

The RISC and stack computer's instruction execution section can be considered the same because the same ALU can be used in two systems. But in this case, the stack computer can also have the advantages of RISC, such as on the M17 stack computer, the ALU function can operate the stack top element before the instruction decoding.

Operating Number Storage Phase The other cycle of the memory is required in some RISC designs, as the result must be written back to the register file. This writing and new instruction read is conflict, requiring delays or with a three register file. This may also ask Alu to keep the output in a register, then this register is written as a source of the register file in the next clock cycle. Instead, the stack computer is only simply putting the results of the ALU output into the top of the stack. Another problem is that the RISC processor requires additional forward data logic to ensure that the register is written when the ALU input is the ALU input of the next instruction. The stack computer ALU output is always available, which is an implied input as an ALU. Figure 6.1B Directive Stage overlap - Typical RISC Computer

Figure 6.1C Directive Stage overlap - Typical Stack Computer

Figure 6.1B shows that the RISC processor requires at least 3 or 4 streamlined steps to maintain the same throughput: instruction read, operational data read, instruction execution / operative reading. At the same time, we also noticed that several of the RISC design methods, such as data dependencies and resource competitions are very simple in stack computers. Figure 6.1c gives a stack computer only requires two levels of pipeline: instruction read and instruction execution.

All of this means that the stack computer has no reason to be slower than the RISC processor, while the stack computer will only be faster and simpler than the RISC computer.

6.2.3.2 System Performance

System performance may be more difficult to measure than the original processor performance, which includes not only how many straight line encodes per second, but also the speed of processing interrupt, the speed of context switching, and due to conditional branch, the performance reduction caused by subroutine calls . Some methods such as three-dimensional computer performance (Rabbat et al. 1988) more reflects the computer's system performance than the original instruction data.

Both RISC and CISC computers are used to perform linear coding under normal conditions, and many process calls that do these machines will seriously affect their performance. The cost of the calling process includes not only the save program pointer and reading different instruction streams, but also includes saving and recovery registers, arrange parameters, and pipeline destruction that may occur. A actual data structure called the return stack is very important to the control flow structure on the stack computer, because the stack computer saves all the working variables on the hardware, which is required to transfer the parameters to the subroutine. There is very little time, usually only one DUP or OVER directive.

For all types of processors, processing conditional branches are difficult, in which the premise of the prefetch mechanisms and pipeline mechanisms is that the instructions are not interrupted, so that they can keep busy, and the conditional branch will make the branch I am forced to wait at the time. The only choice is to predict before the possible path is expected to be non-destructive when waiting for the branch. The RISC machine uses "Branch Delay Time slot" and places a non-destructive directive or NOOP, which is always executed after the branch.

The stack computer handles the branch in different ways, which can achieve a branch within a single cycle without the need to delay the time slot, nor does it need to increase the complexity of the compiler. The NC4016 and RTX2000 processes this problem than the processor cycle by specifying the memory cycle, which means that there is time to generate an address in the processor cycle, and can also read instructions at the tail of the clock cycle. This method is well done, but the problem is that the progress of the semiconductor process always makes the improvement of the processor speed much higher than the increase in program memory speed.

FRISC3 produces a branch instruction for conditional branches and completes branches with the next directive. This is a very clever approach, because there is always a comparison or other operations before most of the branch instructions on any machine. In addition to comparative operations (usually a subtraction), FRISC3 also specifies which condition is the next branch instruction. This handles most of the branch decision making to the comparison command, and only test a separate bit when the conditional branch is subsequently performed.

RTX32P uses a microcode to combine comparison and conditional branches into two instruction cycles, which are the same as the comparison instruction. For example, = 0BRANCH can be combined into an operation of a single 4 microcode period (two instruction cycles). For interrupt processing, stack computers are simpler than RISC and CISC computers. In a CISC computer, complex instructions can be done using multiple cycles, so they need to be interrupted. This requires a lot of processing overhead, and saves control logic and recovery machine status in the middle of the instruction. Where is the RISC computer, because they have a pipeline, need to be saved and restored in each interrupt. They also have registers that need to be saved and recovered to facilitate the resources required for the interrupt service program. On the RISC and CISC machines, there is usually a few hundred microseconds to respond to an interrupt.

On the other hand, the time of the stack computer processing is typically only a few clock cycles. Interrupt as subroutine calls executed by hardware, do not need to clear and restore the pipeline, the only thing to do in the processor processing interrupt is to insert the interrupt response address as a subroutine to invoke into the instruction stream, put the stack mask register PUSH Go to the stack (to prevent recursive calls to the interrupt service subroutine). One, enter the interrupt service, do not need to save any registers, because the new subroutine only needs to simply put its data PUSH to the top of the stack. To illustrate how fast the stack processor has more interrupt processing speeds, let's take an example: RTX2000 requires only 4 clock cycles (400 ns) between the first instruction execution of the interrupt response confirmation and interrupt service.

People may think that the context switch of the stack computer is slower than other processors, but we will see it later, the fact is not the case.

The last advantage of the stack computer is that their simplicity has left many spaces for customers to apply the designated hardware in the microprocessor. For example, Harris RTX2000 has a hardware multiplier on a piece, and other application specified hardware includes FFT address generators, A / D, D / A or communication interface. Such hardware can effectively reduce the number of components in the final system and can greatly reduce program execution time.

6.2.3.3 What are the best operations of the stack computer?

Programs capable of effectively executing on a stack computer include subroutine call sensitive programs, have a large number of programs that control flow structures, perform symbol computing (usually using stack structure and recursive problem), designed to handle high frequency interruption Programs running on finite memory.

6.2.4 Consistency of program execution

Advanced RISC and CISC processors depends on many special technologies that can reach static high performance in a quite long period of time, but there is no way to ensure high performance in a short period of time. The techniques they use include: instruction prefetch queue, complex pipeline, scoreboard, cache memory, branch target buffer, branch predictive buffer. Of course, the problem now is that these technologies cannot guarantee the instantaneous performance of the processor at any particular time. An inappropriate sequence of external events or internal data may cause the Cache memory to be fully mating, queue refresh, and other delays. Of course, the average high performance can also be accepted for some procedures, but for many real-time applications, predictable continuous consistent high performance is more important.

The stack computer does not use any of the dynamic acceleration techniques listed above, but it has been excellent system performance. The simplicity of the stack computer program is implemented, and it has been very consistent in any time measurement. We can see in Chapter 8 that has a huge impact on the implementation of real-time control applications.

6.3 Research on FORTH Directive

We have a conceptual understanding of the stack computer and other computers, and now we can give some quantity statistics on the implementation of the stack computer. Testing a stack computer and register-based computer instruction command frequency and code size work include: Blake (1977), Cooke & Donde (1982), Cooke & Lee (1980), Cragon (1979), Haikala (1982), McDaniel (1982), Sweet & Sandman (1982), And Tanenbaum (1978)). Not ingenious, many results are obtained from programs written in traditional languages, rather than born, based on stack-based language Forth. Hayes et al. (1987) gives an implementation statistics of a Forth program, we will extend this. The results of this chapter are all based on Forth, so that most of the advantages of the stack computer can be used. Note that the test procedure is static, so these results should be considered "Truth" approximation (no matter how you understand).

The following sections reference 6 different test programs. All programs are written in 16 bits of Forth. They are:

FRAC: A fractal chart generation program that uses a random number generator. It always uses the same initial value as seeds to generate graphics (Koopman 1987E, Koopman 1987F)

Life: Conway's life game is a simple implementation on the 80-column 25 line character display. Each program computer has a full screen mobile 10 generation.

Math: A 32-bit floating point package, completely written with advanced Forth code, no machine primitives such as a class, each program runs to generate a sinusoidal, cosine, and positive cut function table for 1-10 degrees integer. . (Koopman 1985)

Compile: A script is used to compile the Forth program and test the implementation of the Forth compiler itself.

FIB: Calculate 24th Fibonacci using recursive process, also known as "DUMB" Fibonacci.

Hanoi: Hanoi tower issues, written into a recursive process, each program runs 12 plates.

Queens: N Queen Issues (exported from 8 Queen Problem), prepared to recursively. The program finds the first acceptable location on the N x N board, and each program is running calculated N = 12 queen.

There are three programs that best represent mixing in different applications, they are: Math - it uses intensive stack operations to manage 32 digits (still a 48-bit intermediate floating point format) on the 16-bit stack computer (there is a 48-bit intermediate floating point format) LIFE - Manage the matrix of memory cells by many conditional branches; FRAC - it performs graphics and basic graphics mappings.

The compiler test program is also useful, it reflects the behavior of the compiler, which must perform TOKENization in the input stream and the identifier search.

6.3.1 Dynamic Directive Frequency

Names Frac Life Math Compile AVE

Call 11.16% 12.73% 12.59% 12.36% 12.21%

EXIT 11.07% 12.72% 12.55% 10.60% 11.74%

VARIABLE 7.63% 10.30% 2.26% 1.65% 5.46%

@ 7.49% 2.05% 0.96% 11.09% 5.40%

0Branch 3.39% 6.38% 3.23% 6.11% 4.78%

LT 3.94% 5.22% 4.92% 4.09% 4.54% 3.41% 10.45% 0.60% 2.26% 4.18%

SWAP 4.43% 2.99% 7.00% 1.17% 3.90%

R> 2.05% 0.00% 11.28% 2.23% 3.89%

> R 2.05% 0.00% 11.28% 2.16% 3.87%

CONSTANT 3.92% 3.50% 2.78% 4.50% 3.68%

DUP 4.08% 0.45% 1.88% 5.78% 3.05%

Rot 4.05% 0.00% 4.61% 0.48% 2.29%

User 0.07% 0.00% 0.06% 8.59% 2.18%

C @ 0.00% 7.52% 0.01% 0.36% 1.97%

I 0.58% 6.66% 0.01% 0.23% 1.87%

= 0.33% 4.48% 0.01% 1.87% 1.67%

AND 0.17% 3.12% 3.14% 0.04% 1.61%

BRANCH 1.61% 1.57% 0.72% 2.26% 1.54%

EXECUTE 0.14% 0.00% 0.02% 2.45% 0.65%

Instructions: 2051600 1296143 6133519 447050

Table 6.1. Important FortH primitive dynamic directive execution frequency

Table 6.1 shows a number of performing frequencies for FORTH primitives: FRAC, Life, Math and Compile. Dynamic instruction execution frequency is the number of executions of one instruction during a program run. Appendix C gives the results of the corresponding table 6.1. AVE is listed as the weighted average result of the four test programs, which can be seen as an approximate result of most Forth programs, the principle of the Forth word in the table is: they or the top 10 of the AVE column, or a special program The top 10. For example, Execute accounts for only 0.65% in AVE, but it accounts for 2.45% in compilation, it is 10th in compilation test.

The most obvious thing in these numbers is that the subroutine calls and returns are far more than other operations, which also describes why the stack processor derived from Forth emphasizes the subroutine call / return and other instructions. The effect of combination. The number of subroutines returns less than the subroutine call, because some Forth operation pops up the returns of the stack elements and returns the two-stage subroutine call - this is the condition that the condition is executed.

It is also fun to consume the time on the stack treatment. In all sample instructions, about 25% flowers on the treatment stack. This first looks high, however, we should think that because the stack processor has the ability to combine stack processing and other work (such as combination over -), this number should be much higher than what we have seen. And this 25% increases by 5% by the floating-point mathematics package. This cost does not exist in a 32-bit processor or using a fast access user memory space (such as NC4016 and RTX2000).

Another interesting aspect is to prepare for processing and read data on the stack (this process is included in variable, @, lit, constant, and user). Fortunately, the stack computer can combine these instructions with other operations. The last conclusion is that many instructions listed in Appendix C have less dynamic execution frequencies, but it is not important to think that these instructions are not important, where many instructions are executed, when there is no hardware support, Will spend a long time, so you can't just rely on execution frequencies to determine if an instruction is important.

6.3.2 Static instruction frequency

Names Frac Life Math Compile AVE

Call 16.82% 31.44% 37.61% 17.62% 25.87%

Lit 11.35% 7.22% 11.02% 8.03% 9.41%

EXIT 5.75% 7.22% 9.90% 7.00% 7.47%

@ 10.81% 1.27% 1.40% 8.88% 5.59%

DUP 4.38% 1.70% 2.84% 4.18% 3.28%

0Branch 3.01% 2.55% 3.67% 3.16% 3.10%

PICK 6.29% 0.00% 1.04% 4.53% 2.97%

3.28% 2.97% 0.76% 4.61% 2.90%

SWAP 1.78% 5.10% 1.19% 3.16% 2.81%

Over 2.05% 5.10% 0.76% 2.05% 2.49%

! 3.28% 2.12% 0.90% 2.99% 2.32%

I 1.37% 5.10% 0.1% 1.62% 2.05%

DROP 2.60% 0.85% 1.69% 2.31% 1.86%

BRANCH 1.92% 0.85% 2.09% 2.05% 1.73%

> R 0.55% 0.00% 4.1% 0.77% 1.36%

R> 0.55% 0.00% 4.68% 0.77% 1.50%

C @ 0.00% 3.40% 0.61% 0.34% 1.09%

= 0.14% 2.76% 0.29% 0.26% 0.86%

Instructions: 731 471 2777 1171

Table 6.2. Important Forth primitive static execution frequency

Table 6.2 gives FRAC, Life, and Math's static compilation frequencies in many compiled primitives and primitives most commonly used in compiled programs in the Compile test program (including FRAC, Queens, Hanoi, and FIB). The static frequency of an instruction is the number of times it appears in the source program, and the AVE column gives the weighted average of 4 test programs, which can be seen as a general result of most Forth program compilation. The Forth word listed here is the top 10 of the AVE column, or the first 10 digits located in a particular program. In this static result, the subroutine call is very frequent, and approximately 1/4 of all compilation instructions. Note that FRAC is calculated twice due to contained in Compile, so the actual subroutine call should be slightly lower than the value listed in the table.

6.3.3 instruction compression of RTX32P

Since subprofread call is so common, we are not surprising why most stack processors have mechanisms that combine sub-program calls and other instructions. A focal point is that the subroutine call is more common in the source code to return more common, combined with other instructions is more attractive.

The RTX32P discussed in Chapter 5 has a uniqueness that it has a single instruction format that combines the operation code with the subroutine call / subroutine to combine / unconditional transfer. This initially seems to waste memory space, but in fact, performance is greatly improved and the memory cost is relatively low. Unfortunately, this single instruction format can only be used for 32-bit processors because there is not enough bits in the 16-bit processor instruction to combine operational code and large address fields.

Tables 6.3 and 6.4 are the implementation and compilation results of the FRAC, Life, and Math programs written by 32-bit advantages.

6.3.3.1 Revenue of Performing Speed

Table 6.3 is 4 statistics executed for dynamic programs optimized for RTX32P. Table (a) is given by compression of the operator and subroutine, without the program of the snooped hole optimization (operation code combination).

Table 6.3. Dynamic instruction execution frequency of the RTX32P instruction type

Frac Life Math Ave

OP 57.54% 46.07% 49.66% 51%

Call 19.01% 26.44% 19.96% 22%

EXIT 10.80% 12.53% 16.25% 13%

OP CALL 0.00% 0.00% 0.00% 0%

OP EXIT 0.00% 0.00% 0.00% 0%

Call EXIT 0.00% 0.00% 0.00% 0%

OP Call EXIT 0.00% 0.00% 0.00% 0%

COND 5.89% 9.95% 6.56% 7%

LIT 6.76% 5.01% 7.57% 6%

LIT-OP 0.00% 0.00% 0.00% 0%

Variable-op 0.00% 0.00% 0% 0%

Variable-OP-OP 0.00% 0.00% 0.00% 0%

Instructions: 8381513 1262079 940448

OP-OP 0.00% 0.00% 0.00% 0%

Table 6.3 (a) Close instruction compression and operation code combination FRAC LIFE MATH AVE

OP 50.92% 42.22% 45.94% 46%

Call 17.81% 28.31% 21.42% 23%

EXIT 12.48% 13.42% 17.45% 14%

OP CALL 0.00% 0.00% 0.00% 0%

OP EXIT 0.00% 0.00% 0.00% 0%

Call EXIT 0.00% 0.00% 0.00% 0%

OP Call EXIT 0.00% 0.00% 0.00% 0%

COND 6.82% 10.66% 7.05% 8%

LIT 2.60% 1.94% 2.53% 2%

LIT-OP 5.21% 3.43% 5.59% 5%

Variable-OP 2.67% 0.00% 0.01% 1%

VARIABLE-OP-OP 1.49% 0.00% 0.01% 1%

Instructions: 7250149 1178235 875882

OP-OP 4.72% 3.68% 1.76% 3%

Table 6.3 (b) Turn off the instruction compression and perform an operator combination ON.

The B part of the table is an impact that combines commonly used operating code sequences (such as SWAP, DROP, OVER, , VARIABLE, @ and variable @ ) in one instruction. The column labeled OP-OP is to treat two operations, including OP, OP-CALL, OP-EXIT, and OP-CALL-EXIT. Literal , Literal and, etc. are special cases, all marked as Literal-OP, while variable @ and variable! Marked as variable-op. @ And variable @ - marked as Variable-OP-OP. All constants and variables require a complete instruction to keep one opcode and address, without other instructions. For example, the detection codes of the operation code can reduce the execution of 10% of instructions.

Frac Life Math Ave

OP 48.84% 31.26% 40.81% 40%

Call 8.46% 22.20% 15.53% 15%

EXIT 4.57% 0.00% 4.80% 3%

Op Call 13.93% 11.47% 6.68% 11%

OP EXIT 7.71% 15.96% 12.90% 12%

Call EXIT 0.80% 0.00% 2.04% 1% OP Call EXIT 0.15% 0.00% 0.03% 0%

COND 7.23% 12.69% 7.99% 9%

LIT 8.31% 6.39% 9.22% 8%

LIT-OP 0.00% 0.00% 0.00% 0%

Variable-op 0.00% 0.00% 0% 0%

Variable-OP-OP 0.00% 0.00% 0.00% 0%

Instructions: 6827482 990313 772865

OP-OP 0.00% 0.00% 0.00% 0%

Table 6.3 (c) Open instruction compression, turn off the operand combination

Table C gives the influence of the combination of instruction compression instead of the operand, which means that as long as it may, the operator is called and returned to the following subroutine. After the subroutine is called, it will be combined into an unconditional transfer. The result is 24% of the total number to implement an operation code and subroutine call / return combination. This makes 40% subroutine call in the source program "free", and almost all subroutines return to be free. In addition to special instructions, such as constants and return stack processing instructions, they cannot be combined with subroutines.

Frac Life Math Ave

OP 39.05% 24.91% 39.19% 34%

Call 6.75% 24.27% 15.94% 16%

EXIT 6.54% 0.01% 10.78% 6%

Op Call 12.71% 12.53% 6.87% 11%

OP EXIT 6.78% 17.44% 7.40% 11%

Call EXIT 0.95% 0.01% 2.10% 1%

OP Call EXIT 0.09% 0.00% 0.03% 0%

COND 7.84% 13.86% 8.21% 10%

LIT 3.00% 2.52% 2.95% 3%

LIT-OP 6.00% 4.45% 6.51% 6%

Variable-OP 3.08% 0.00% 0.01% 1%

VARIABLE-OP-OP 1.72% 0.00% 0.01% 1%

Instructions: 6294109 906469 752257

OP-OP 5.44% 4.79% 2.05% 4%

Table 6.3 (d) Instructions and compression, open operation code combination

Table D gives the effects of simultaneous operation code combination and instruction compression. The result is 25% less than the original program. This performance improves hardware and hardware extensions, because there is natural parallelism between the subroutine call and the operation code. Interesting point is that the MATH test program reduces the 6.1 million command on the 16-bit system to 94,000 instructions on RTX32P, indicating that the 32-bit processor is necessary for floating point calculations. The Life test program (it is almost all 8-bit and data processing) almost the same in each system. The FRAC test program has increased by 4 times, but this is due to the 32-bit version of FRAC uses a higher graphic resolution, and there are 4 times the number of points that need to be calculated, so it takes about 4 times the instruction.

6.3.3.2 Cost of memory size

It is worthwhile to increase the subprogrammation and operation code combination for performance, especially since it does not need to increase hardware in the processor. In fact, it simplifies hardware by only one instruction format. There is still a problem now to solve: What is its memory cost?

Fortunately, the Forth program's static subprogramme frequency is higher than the dynamic frequency. This provides a mature opportunity for the operation code / subroutine call combination. Table 6.4 shows the difference in static program size: there is no program that is compressed and simultaneous instruction combination and code compression.

Table 6.4. Static command frequency of the RTX 32P instruction type

Frac Life Math Ave

OP 48.40% 51.46% 44.72% 48%

Call 28.48% 33.01% 35.64% 32%

EXIT 5.12% 6.41% 7.55% 6%

OP CALL 0.00% 0.00% 0.00% 0%

OP EXIT 0.00% 0.00% 0.00% 0%

Call EXIT 0.00% 0.00% 0.00% 0%

OP Call EXIT 0.00% 0.00% 0.00% 0%

COND 3.52% 4.46% 4.04% 4%

Lit 14.48% 4.66% 8.05% 9%

LIT-OP 0.00% 0.00% 0.00% 0%

Variable-op 0.00% 0.00% 0% 0%

Variable-OP-OP 0.00% 0.00% 0.00% 0%

Instructions: 1250 515 2422

OP-OP 0.00% 0.00% 0.00% 0%

Table 6.4 (a) Instruction compression is closed, the operation code combination is closed

Frac Life Math Ave

OP 33.71% 35.78% 37.05% 36%

Call 17.33% 21.94% 27.03% 22%

EXIT 1.47% 2.87% 2.39% 2%

Op Call 11.65% 21.15% 10.54% 14% OP EXIT 3.78% 4.70% 1.73% 3%

Call EXIT 1.05% 1.04% 4.02% 2%

Op Call EXIT 0.42% 0.00% 1.17% 1%

COND 4.62% 6.00% 4.98% 5%

LIT 16.17% 4.18% 8.61% 10%

LIT-OP 2.83% 2.08% 1.32% 2%

Variable-op 5.46% 0.26% 1.01% 2%

Variable-OP-OP 1.47% 0.00% 0.15% 1%

Instructions: 952 383 1965

OP-OP 2.73% 5.22% 1.98% 3%

Table 6.4 (b) Instructions and compression, the operation code combination is closed

RTX32P uses a 9-bit operating code, 21-bit address, and 2-bit control fields. If we want to assume an optimized package instruction format, we can design: Use 11 bits to specify an operator, there is a separate subroutine return bit, and then a 23-bit subroutine call / return format. Similarly, if we assume that this instruction format can bind the subroutine to the operand or by replacing the subroutine in replacing the subroutine with JUMP, then this hypothetical machine should have 11 or 23-bit variable word length. But we don't have to worry, because this is just the minimum value obtained by our theory.

In the optimized format, three programs are composed of 1952-bit opcode (all 11 bits), 1389 subroutine calls (each 23 bits), 565 combined operation code / address bits (34-bit), they It is 72640 bits.

Now let's consider optimizing the compiled program on RTX32P. Consider each instruction to use 32-bit fixed length encoding, there are some balances that are not used, all of which are 32-bit 3300, total 10,5600, memory The overhead is 32,960, or the result is "waste" 31% of the memory than the theoretical calculation.

Of course, designing a computer that uses a 11-bit opcode and 23-bit subroutine calls is tidy. But more practical considerations, we should see that the "blank" opcode in the compressed version is 766 digits (each 9), "blank" subroutine call is 917 (each 23), the total is 27985 bit - In exchange for 25% of the execution instructions only with 27% waste. So we said that even if we are compared to the variable length instruction, we also have a large performance improvement with relatively low cost.

There is also a small problem in this section, which is the procedure for testing is relatively small, and the program performs a very complex operation, so you can see the stack computer program is a tightening, the problem is that the Forth program is difficult to get . However, the selected program includes the relevant part of the usual Forth code, from the author's point of view, if a larger program can be obtained, the result is also trusted. Of course, we also have a way to get a bigger program, which is the use of the traditional high-level language compiler, but this type of code has a completely different feature because the programmer solves the problem in Forth and C, Fortran It is different, we can discuss this problem in Chapter 7.

6.4 Stack Management Strategy

Since the stack computer depends on access to high-speed stack memory in each directive, the characteristics of the stack memory used are critical. Especially if the processor is getting faster and faster, the problem is generated: in order to get high performance, how many stack memories need to be arranged on the film? The answer to this question is critical because it determines the cost and performance of such a high-end stack processor on the chip on the chip.

One equivalent problem is: How to manage program storage, especially in multitasking environments?

6.4.1 Evaluation Stack Size: One Test

The first question is to need how many on-chip memory as a stack, preferably by different sizes of memory with different programs. This simulation measurement is due to the number of data exchanges with the memory generated by the stack up and down overflow. Upy, you need to copy data elements from the hardware stack to the memory. You need to copy the data elements from the stack buffer memory when you get overflow.

Frac Life Math Fib Hanoi Queens

# Instructions 2051600 1296143 6133519 880997 235665 140224

Max Stack Depth 44 6 23 25 52 29

# Stack Operands 3670356 1791638 11786764 1483760 44642 257320

# Stack spills

For Buffer Size:

0 3670356 1791638 11786764 1483760 446642 257320

2 838960 148448 3919622 370940 155656 41426

4 202214 4098 1313566 92732 69608 9216

8 39040 0 238020 13526 32752 512

12 10236 0 28300 1970 8184 196

16 3580 0 800 284 4088 64

20 1532 0 280 38 2040 22

24 636 0 0 2 1016 10

28 220 0 0 0 504 2

32 92 0 0 0 248 036 36 0 0 0 120 0

40 10 0 0 0 56 0

44 0 0 0 0 24 0

48 0 0 0 0 8 0

52 0 0 0 0 0 0

Table 6.5. Take the memory extension cycle of data stack split

Figure 6.2 Data Stack Split

Table 6.5 and Figure 6.2 give a simulation result, which monitors the program Life, Hanoi, Frac, FIB, MATH, and Queens to monitor the number of data stack splits and recovered memory cycles. Hanoi and Queens in the "Toy" program do not represent typical programs, they are deeply recursive, and they can be seen as the worst case of the stack program.

The split algorithm is to split a data element to the stack each time a data element is divided by a data element from the stack and performing READ / POP when the stack is empty. The simulation assumes that the hardware automatically handles the stack and is completed with each memory cycle or writes an element. This simulation is performed using the RTX32P instruction set, each instruction, which is equivalent to 2 times complexity of the hard-wired processor, such as RTX2000, and the measured period is the memory cycle. The purpose of simulation is to show the best behavior of the desired, of course, about 3-4 times the factor in most implementations.

Surprisingly, the FRAC program is almost as bad as the Hanoi program, which is because FRACs perform 6 elements PUSH to the stack each time when FRAC performs a large number of fixed-point divisions. Obviously, all recursive programs produce a lot of data elements on the stack.

The good news regarding the stack size is a memory exchange that overflows the up and down of the stack to all applications. When the stack buffer size is 24, the stack overflow generated by the Hanoi program is only 1% of the instruction. In fact, if the size of the stack reaches 32 elements, almost all programs do not appear stack overflows.

Frac Life Math Fib Hanoi Queens

# Instructions 2051600 1296143 6133519 880997 235665 140224

Max Stack Depth 14 7 30 22 14 39

# Stack Operands 725224 680676 3199170 185472 41056 53722

# Stack spills

For Buffer Size:

0 725224 680676 3199170 185472 41056 53722

2 326778 135608 1235678 57312 12310 26070

4 179938 118 642798 21890 2048 13306

8 27932 0 273686 3192 128 115812 132 0 57262 464 8 572

16 0 0 13442 66 0 314

20 0 0 1062 8 0 154

24 0 0 382 0 0 62

28 0 0 42 0 0 32

32 0 0 0 0 0 16

36 0 0 0 0 0 8

40 0 0 0 0 0 0

Table 6.6. Memory for returning to stack splitting

Figure 6.3 Return to the stack split

Table 6.6 and Figure 6.3 give the return stack splitting and recovery simulation results of the same program. As a result, the Math program requires a lot of returning stack. This is because the mathematical package is written in a modular manner in order to maintain a portable in different systems, so a large number of deep subroutine nesters are used. At the same time, the MATH program uses the return stack to store a large number of temporary variables to process 48-bit data on the 16-bit processor.

6.4.2 Overflow Processing

Now we have examined how the stack overflow and overflowing in the program execution, how is the problem? Treatment stack splits There are 4 possible methods: ensuring overflows never happens; put them as catastrophic systems; use the stack controller requested on demand; use the page stack control logic or use a data Cache memory. Each method has its own advantages and weakness.

6.4.2.1 A very large stack memory

The easiest way to resolve the stack problem is to assume that the stack has never happened. This method seems to be very stupid, but it also has some advantages, the best result of using this strategy is that system performance can be predicted (without stack splitting to slow down the system), and no stack management hardware is required.

This method uses a large number of stack memories to avoid overflow. The MISC's M17 uses this method that only the stack overflow occurs only when the program memory capacity is exhausted. The method used by NC4016 is a high-speed micro-outer memory that can accommodate thousands of data elements that are simply ignored overflow problems. The price to use this method is a compromise between the size / speed and processor speed of the outer memory.

In the case of using small-capacity sheets, this method spills a catastrophic event that crashes as a system. When the program is commissioned, you can simply tell the programmaker that the stack can only have X elements, and the programmer is responsible for not exceeding this limit. If the program written is small, it is very simple, and the value of X is greater than 16 or 32, which is also very practical. WISC CPU / 16 uses this method to set the stack element to 256 so that you can keep the stack simple.

6.4.2.2 Request MANUNSTOR

Let the stack overflows at a certain rule, handle this problem most conceptual approach to use a request stack manager, which moves a single element between the stack and the memory.

In order to achieve this policy, the stack buffer is built by ring buffer and has a head and tail pointer. There is also a need for a pointer to the memory for recording the top position of the stack in the memory. As long as the stack overflows, the set of elements resides in the bottom buffer is copied into the memory and releases a buffer location. If the underflow, copy one element to the buffer from the memory. This technique is only moved by the processor when absolute needs, maintains the smallest stack traffic. For this technique, some small improvements can always keep the empty or full of several elements in the stack. This management can utilize idle memory cycles and reduce the number of times paused due to overflow. Unfortunately, this small improvement may not have any value on the real stack computer, because the processor always uses the program memory access instruction and data with the 100% time, and does not have a bandwidth to the stack manager.

The benefits of requested stack management are always available in stack buffer elements. Therefore, it is suitable for those with chip spaces that can be used in a stack buffer. As an additional benefit, the up and down of the stack overflows the maximum only 2 times in each of the execution of the entire program, which is the data stack splitting simultaneous subroutine returns overflow. This high-performance cost is a relatively complex hardware control circuit and each stack requires three counters to implement this policy.

The FRISC3 stack management strategy is very similar to the requested strategy. This system has an in-depth study, and a universal algorithm is called Cutback-K AlgorithMwas. It is proposed by Hasegawa & Shigei (1985), and Stanley & Wedig (1987) also discusses the hop-top element buffer management strategy of the RISC computer.

6.4.2.3 Page Stack Management

One way different from the request stack management policy is an interrupt in the stack overflow or overflow, and then manages the stack split by software. This method is less than the hardware used by the request management, but requires a larger stack buffer to reduce the frequency of the interrupt.

One of this method is to use the "Restriction Register" to point to the nearby position of the top and bottom of the stack buffer space. When an instruction causes the stack pointer to less than the until the boope pointer, the element equivalent to half of the buffer will be copied from the program memory into the stack. When an instruction causes the stack pointer over-overflow, the element equivalent to half of the stack size is copied into the program memory.

The page request policy allows any dimensional segment of a large stack memory in a time piece for different processes. Thereby, the stack memory appears as a special memory, and does not appear as the ring buffer. Therefore, in practical applications, a stack overflow is actually half to the memory of the copy buffer and then repositions the other half in the start position of the stack buffer.

The cost of the page management method is 2 times the requester management of the mobile element. The size of the buffer should also be a second time in the request, which guarantees that there is the same PUSH and POP number between overflow and underflow, but this increased size is almost no value in actual applications.

It is interesting to use this method that the programming style of the program is unlikely to be welcomed because if the size of the stack is 32 elements, it will no longer need them. This method provides a downtown for programs that exceed their buffer size. Relying on this method, a morbid program can also run (although very slow), the operating system can generate a warning message to indicate the wrong program.

Both the RTX2000 and RTX32P use the page method for stack management.

6.4.2.4 United Cache

Many conventional processor management program stacks are to use a traditional data cache, typically mapped to program memory space. This approach applies a very complex hardware than the stack computers we discussed earlier, but there is no advantage because the stack computer does not often access its elements freely. For data structures of variable lengths, such as strings and records, as they can provide some benefits when these contents are asked to "stack" as the C or ADA program environment.

Other interesting papers about stack management are: Blake (1977), Hennesy (1984), Prabhala & Sethi (1977), And Sites (1979). 6.5 interrupt and multitasking

The performance of the interrupt processing has three components: the first is that the processor receives the interrupt and the processor starts to perform the time between the interrupt service, which is called interrupt latent.

The performance component of the second interrupt processing is an interrupt processing time. This time is the time that the processor actually costs the interrupt task machine status and starts executing the interrupt service program. It usually needs to be saved, which is very small, assuming that the interrupt service program can minimize overhead by only saving the registers that are ready for use. Sometimes, we see the term "interrupt latency", which usually refers to the sum of the previous two.

The third part of the interrupt service performance is what we are saying saves overhead. These are to save the time of the machine register that is not automatically stored by the machine interrupt processing logic, but the machine register must be used in the interrupt service. Status saving can be very different, depending on the complexity of the interrupt service program. In extreme cases, the state saving overhead requires switching all contexts of multiple tasks.

Of course, the turnover of these machine status and exiting the interrupt service program also determines the performance of the system. We are not clearly discussed here because they are almost equal to state preservation time (because each thing saved must be restored), but their requirements for interrupt responses are not important.

6.5.1 Interrupt Response Delay

The CISC computer can have an instruction that takes a long time to complete, which greatly prolongs the interrupt response. The stack computer can have a fast interrupt response delay like a RISC computer. This is because many stack computer instructions are only single cycle long, so it is considered in the worst case, and only a few clocks between interrupt recognition and interrupt processing. .

However, one but interrupt is processed, the difference in the RISC and stack computer began to manipulate. The RISC computer must save the process of identifying the interrupt through a pipeline, and the process line is restored when returning to avoid information of the loss of special instructions. On the other hand, the stack computer does not have an instruction to perform the pipeline, so it only needs to save the address of the next instruction being executed. This means that interruption can be regarded as a process called by hardware. Of course, because the process call is very fast, the interrupt treatment time is very short.

6.5.1.1 Instruction Restart

The stack response delay may have a problem, which is the fluidization command and the microcode loop.

The fluidization instruction is used to repeat an operation such as writing a top element to the memory. These instructions use NC4016 and RTX2000 repeat instructions, M17 instruction buffer, and the CPU / 16 RTX32 microcode loop is written. These primitives are very useful, which can be used to build high efficiency string processing primitives and stack up and down overflow servers. The problem now is in most cases, these instructions are not interrupting.

One solution is to make these instructions to be interrupted by external interrupt logic, which only increases the complexity of the processor. A potentially hardware problem is that the non-stack processor also uses this solution to save intermediate results, without any problems for a stack processor, because the intermediate results are saved on the stack, which is an ideal mechanism for saving status during interrupt.

Another way to the stack processor is to use software to limit the number of repetitions used for stream instructions. This means that if a block of valid 100 characters needs to be moved in the memory, you can complete through several groups of each group of 8 characters. This makes it necessary to meet the requirements of the break response time without sacrificing performance. This is a compromise between absolute computer efficiency (using long flow instructions) and interrupt response time.

In the microcode machine, compromise is similar to this, however, in the commercial version of RTX32P, there is a very simple microcodel policy to do better. This strategy is to use a microcode-visible condition bit indicating whether there is an interruption. This interrupt is not scheduled in the iteration of each microcode cycle, but this does not consume any execution time. If there is no unresolved interrupt, a loop is performed. If there is an interruption, the address of the stream instruction is used by the PUSH to the return stack as an address returned by the interrupt and allows the interrupt program to execute. As long as the stream instructions are saved on the stack (it is just a simple operation such as a character block moving), this method is only small overhead during processing interrupt, but does not overhead during normal instructions. 6.5.2 Lightweight Interrupt

Let us examine the extent to which the saving status required for different interrupts: fast interrupt, multi-tasking lightweight thread interrupt and complete context switching.

Quick interrupt is the most common interrupt at runtime. What these interrupts do is to add several milliseconds for the time counter, or copy several bytes from the input port to the memory. When traditional computers processes this interrupt, they usually need to save two, three registers to program memory to print out the workspace in the register file. On the stack computer, there is no status information that needs to be saved, and the interrupt service program can simply put its information PUSH to the top without interfering information of the interrupt program. Therefore, for fast interrupts, the status information of the stack computer is saved to 0.

Lightweight threads are tasks in multi-tasking systems, and its execution policies are the same as the interrupts we just described. They can get multitasking benefits without the beginning and end costs of the whole process. A stack processor simply requires each task to perform a short instruction sequence to simply implement a lightweight thread, which can be referred to as non-duding or collaborative task management. If each task does not leave the parameters in the stack during its operation, there is no overhead of the context switch between tasks. The multi-task cost of this method is 0, because one task is only given control to the task manager on one logical conversion point of the program, where the stack should be empty.

As can be seen from these two examples, interrupt processing and lightweight threads are very cheap on the stack computer. The only problem left is to need context switching to complete the whole process, deprotive multitasking.

6.5.3 context switch

The context switch overhead is a question that everyone is often said to "stack computer can't run multitasking". The reason behind this view is usually based on this idea: the large number of stack buffers need to be saved in the program memory. However, it is ridiculous that the stack computer is more poor than other computers when running multiple tasks.

Context switching for all task systems is potential overhead, in the use of Cache's RISC and CISC computers, the overhead switching overhead may be more expensive than the manufacturer, and there is a hidden result after the context switch is due to Cache lost The performance brings to deterioration. The RISC computer also has a large register file, which also strictly needs the same problem of facing the stack computer. A disadvantage of the RISC computer is that they are randomly accessed by the register, so each register (or add complex hardware to detect which register is being used), but the stack computer can only save the stack buffer. Activity record.

6.5.3.1 Experiment of a context switching

Table 7 shows the data obtained from the tracking drive emulator, which is the Forth program spent in a context switching environment in the storage and recovery data stack element number. Program Simulation Queen, Hanoi and Quick-Sort. For Queen and Hanoi use small N values to keep the simulation time reasonable. The impact of the stack up and down overflow and context switch is measured simultaneously, because they are severely intertwined in such an environment.

Table 6.7 Memory extension cycle for switching and different stack size splitting for different frequency context

Combination of queens, QSort, Hanoi

# Instructions = 36678

Average Stack Depth = 12.1maximum Stack Depth = 36

Buffer

Size Timer = 100 Timer = 500 TIMER = 1000 Timer = 10000

2 17992 16334 16124 15916

4 12834 9924 9524 9214

8 8440 3950 3430 2910

12 10380 3944 3068 2314

16 11602 2642 1620 632

20 12886 3122 1846 626

24 13120 2876 1518 330

28 14488 3058 1584 242

32 15032 3072 1556 124

36 15458 3108 1568 82

Table 6.7 (a) Page Buffer Management

Buffersize Timer = 100 Timer = 500 Timer = 1000 Timer = 10000

2 26424 24992 24798 24626

4 11628 8912 8548 8282

8 7504 3378 2762 2314

12 6986 1930 1286 630

16 7022 1876 1144 322

20 7022 1852 1084 180

24 7022 1880 1066 124

28 7022 1820 1062 90

32 7022 1828 1060 80

36 7022 1822 1048 80

Table 6.7 (b) Request buffer management

Figure 6.4 Top Stack Management overhead

Figure 6.4 - Expedition of page management

Table 6.7A and Figure 6.4 show the results of page management. The text "xxx clocks / switch" indicates the number of clocks between context switching between the context switching is 100, and the memory cycle on the management stack is reduced as the buffer size increases. This is due to the reduction of split rate while the program access stack is reduced. However, when the stack size increases to 8 elements, the flow rate of memory exchange is increased because the memory replication at the context switch is a constant as the buffer is increased.

Note that the program is 500 clock cycles between the context switch, even in this relatively high rate (it corresponds to a context switching 20 000 times per second on the 10MHz processor - this rate in practical application It is too high), for the stack buffer size greater than 12 elements, the cost of spending on context switch is only 0.08 clock cycles per instruction. Because in actual experience, each instruction does not have an average of 1.688 clock cycles when the context switch is handed over, this is only 4.7% overhead. When the context switching is 10,000 clock (1 milliseconds once context), the overhead is less than 1%. How can there be such a low overhead? One reason is that when performing these three depth recursive procedures, the average stack depth has only 12.1 elements. This means that the stack has never had a lot of information, and only the context switching only needs to save very little information. In fact, compared with the CISC computers of 12 registers, the stack computer simulation in this experiment requires a saving status in context switching than CISC.

Figure 6.5 Exhaustment of Request Stack Management

Table 6.7B and Figure 6.5 give the simulation results when running a request stack management algorithm. In these results, the lifting of more than 12 elements 100-week interval curve does not exist in the stack buffer. This is because the stack does not need to be re-filled when restoring the machine status, but only fills in a request when executed. For reasonable context switching frequencies (less than 1000 times per second), the request policy is slightly managed than page management, but is not overwhelming.

6.5.3.2 Multiple stack spaces for multiple tasks

Stack computers can also use a method to handle context switching, and even accept any cost. All procedures are not using a single huge stack, but provide their own stack for high priority / time. This means that each process uses a stack pointer and stack boundary register to create a stack it yourself use. Considering context switching, process management simply saves the current stack pointer because it already knows the boundaries of the stack. When the new stack pointer value and the stack limit register are loaded, the new process is ready. Do not take any time for copying stack elements.

In a typical case, the stack memory required by most programs is very small, and the process can be further guaranteed by a small design, and the time is short. Therefore, even if a stack buffer for a moderately reasonable 128 elements can also be assigned to 4 stacks per 32 elements. If a multi-task system requires more than 4 processes, one of which can be designed to be a low-end stack buffer, which is shared by memory between multiple low-level tasks.

From these discussions, we can see that the concept of stack processors can not be used to illustrate high efficiency of multitasking with a sentence. In fact, in many cases, the stack processor can do better than other traditional computers in multitasking and interrupt processing.

Hayes and Fraeman (1988) has independently obtained the result of fracture splitting and context switching on FRISC3, which is similar to the results of this chapter.

Chapter 7 Stack Computer Software Development

Hardware with high efficiency, hardware that supports software is the most important, and any computer system will have no purpose without software, the stack computer provides new compromise and options when considering software issues.

7.1 Discusses the importance of fast subroutine calls, how they directly affect the execution speed of the program, and affect the quality of the software and the productivity of programmers.

7.2 Explain how to select a suitable programming language for stack computers, and why stack computers can effectively support traditional languages.

7.3 Discussing the interface between the stack computer, the singleness of the software interface is unlikely to be implemented based on the register computer, so this is also a significant advantage.

7.1 Advantages of Fast Subprint Call

"Programmers have learned to avoid using process calls and parameters delivery for program efficiency. However, process call is an important tool for designing a good program, the stack architecture has a very efficient access process (Schulthess & Mumprecht 1977," p. 25) Complex emotions called expensive processes, due to the ability to consider efficiencies, this is more embodied in the software style of CISC and RISC structures (Atkinson & Mccreight 1987, Ditzel) & Mclellan 1982, Parnas 1972, Sequin & Patterson 1982, Wilkes 1982). In fact, Lampson (1982) is far more, he even said that the process should replace unconditional transfer, which enables the program to run faster.

7.1.1 Importance of Small Process

Using a large number of small processes while writing programs, you can reduce the complexity of each code, including writing, testing, debugging, and programmers. Low software complexity means low development and maintenance costs, while software is more reliable. But why is the programmer use the small process in a large number of devices?

Most applications are written in general languages such as Fortran, COBOL, PL / 1, PASCAL, and C. Early high-level language such as Fortran is a direct extension of machine philosophical ideas they run: using registers, sequential von Norman machines. Therefore, these languages and their usual use have emphasized long assignment phrase sequences, with very little conditional branches and process calls.

However, in recent years, the complexity of software has changed. The best software design practice currently accepted is to use modular structural programming. From a large aspect, the module can be easily allocated to the member design team member. From a small aspect, the complexity of the information must be accepted by restricting the number of information at any time.

More advanced languages such as MODULA-2 and ADA especially emphasize modular design, the general use of modules makes hardware generate a new innovation, and the structured language uses a register to point to main memory. In addition to the stack pointers and several complex instructions (these instruction compilers are not used frequently), the CISC hardware has not given other support for sub-program calls for many years. Because of this, the optimization code of the compiler output of the modern language seems to be extremely similar to the output generated by the early non-structured language, which is where the problem is, and the traditional computer is static optimized to generate a serial directive stream. Tracking process call instructions for many programs only account for a small proportion in all instructions, which can of course be attributed to such a fact: programmers avoid using them. Instead, modern programs emphasize the importance of non-sequential instructions to control flow and small processes, so today's hardware and software environments are expensive.

This does not mean that the structured language is better or maintained by the maintenance, which means that the efficiency of efficiency and the hardware forced writing sequence procedure have prevented the modular program can be achieved. Although the current philosophy is divided into a very small process, most procedures are still less, larger, more complicated than they should look.

7.1.2 Appropriate dimensions

How much is a typical process? Miller gives a clear amount of 7 plus or subtract 2 to suit a number of considerations (Miller 1967). Human treatment complex problems is to divide similar objects into less abstract objects. In a computer program, this means that each process should contain approximately 7 basic operations, such as assignment or process calls, which can be more easily grasped. If there is more than 7 distant operations in one process, it should be divided into multiple processes to reduce the complexity of each part of the program. In other parts of his book, Miller pointed out that people's brain can only master three levels of nesting in a single context, which strongly recommends that the depth cycle nesting and conditional structure should be arranged to be called, not Used in a process, a zigzag structure. 7.1.3 Why does the programmer do not use a small process?

The only problem now is, why do many programmers do not do this?

Programmers avoid using small and depth nested processes are the cost of executing procedures. If the process call is often used frequently, the establishment of the subroutine parameters and the actual process call command will spend a lot of programs execution time. If the process is nesting, even the best optimized compiler does not help, let alone these optimizations are conditional, and the result is that effective procedures are relatively shallow process nested.

Another reason for the use of process calls that cannot be used is to increase the difficulty of programming. In order to write a program, the definition process often makes the short-process definition work too much. When this weakness is considered with project management and document, it will further hinder the creation of a small process in a large project (the rule is like this: Each process must have a management control document, so we are not surprising: The average reasonable size of the process is usually 1 or 2 sheets of paper length rather than 1 or 2 lines.

There is also a deepest reason, explaining why the process is difficult to create in the programming language, why they appear frequently than the readers of the structured program design textbooks are less expected: Traditional programming languages and people who use them are also immersed in During the batch process. Batch does not provide convenience to people who use small processes. Real interactive processing (this does not refer to a batch-oriented editor - compile - connection - execution - crash - debugging process) only available in a very small environment, there is no teaching in the University's computer curriculum .

The result of these factors is that today's programming language provides limited availability for high-efficiency modular programming. Today's hardware and programming environment gives unnecessary restrictions on modularization, but also unnecessarily adds the cost of using a computer to solve the problem.

7.1.4 Architecture support

The problem caused by poor process call performance has been paid attention to the computer architecture designer. The RISC designer uses two different methods. The Stanford MIPS team uses compiler technology to extend the process to embedded code when needed; the MIPS compiler uses a smart register allocation method to avoid calling and recovering registers for process calls, Statistics from this strategy come from traditional software design methods, use large, nesting processes. When MIPS methods are good in existing software, it will have the same impact in a better software environment, just as we see on CISC.

The second method is proposed by the Berkeley RISC I small group, which uses the register window to form a register stack frame. A pointer to the register stack can move and quickly Push or POP a set of registers, which is the same as the stack computer's approach. It is necessary to consider the details of the implementation. It actually determines the success or failure of the product. This problem becomes the following question: Use a single stack or multiple stacks, using fixed or variable sizes register frame, overflow management and the whole The complexity of the machine.

7.2 Language Selection

Which programming language is selected to solve a specific problem is not a relaxing thing. Sometimes this choice is completely determined by the outside, such as the contract for the US Department of Defense, must use the ADA language; in other cases, select language is limited to an existing compiler. Of course, there are many possible choices when designing a new system. Software selection should not be considered as an isolated issue. The language used should reflect the entire system to develop, including the system operating environment, the language to solve the problem, development time, and cost, the maintenance of the product, the bottom layer The ability of the processor runs a variety of languages, and the programmer of the project programmer is equal. Note that the programmer's experience is not in front of the list, based on programmer's past, it is very far to increase productivity, which may increase productivity.

7.2.1 FortH Advantages and Weak Persons

FORTH is the language that the stack computer program design is the most choice. This is because Forth is based on a series of languages that can run on a virtual stack computer. All stack computers in this book support high efficiency implementation of the Forth program, all of which The machine can use the Forth compiler to generate high efficiency machine code. The maximum advantage of using Forth is that the highest processing rate can be obtained from the machine.

One of the features of Forth is that it uses subroutine to use subroutine with high frequencies, which has promoted unprecedented modularity. With this highly modular phase, the interactive development environment used in Forth is in this environment, the program is designed from the top, and the adaptable pile module is used, and then build it up by bottom, at each Test when writing short processes. In a large project, it is repeatedly reused upwards up and the top of the top.

Since Forth is a stack-based, interactive language, we don't need to write programs or "framework" code for testing. Instead, the value given to the process can be tested by the keyboard Push to the stack, and the process is tested by execution, and the result is returned in the top of the stack. Experienced Forth programmakers say that this interactive development of module procedures can improve the productivity of programmers, improve software quality, and reduce maintenance costs. Part of these benefits from Forth programs are usually much smaller than those who complete the same function than other languages, and only need to write and debug less code. A Forth program that does not contain a symbol table If there is 32K bytes, it will be considered an amazing, which requires a lot of source code codes to be generated.

One advantage of the Forth programming language is that it covers each level of program development. Some languages, such as assembly languages, only allowed to deal with hardware. Other languages, such as Fortran, processed abstraction, rarely contact with the underlying. The Forth program can overwrite all the aspects of the program. At the bottom layer, Forth can directly access the system hardware interface to implement real-time I / O processing and interrupt service; on a higher level, the same Forth program can manage complex knowledge bases.

One interesting aspect of Forth (it is not understandable in the outside line) is an extensible language. As the process is constantly incorporated into the language, the availability of programmers is constantly growing, and in this way Forth is like Lisp. There is no difference between the core functions of the language and the extension and increases made by the programmer, which makes the language flexibility far outside the understandability of the extension capability.

The scalability of Forth has indeed mixed a lot of blessings. Forth wants to be a programmer amplifier, a good programmer uses Forth language programming to become very good; excellent programmers can achieve significant achievements; the code written by Puttong's programmer can work; poor programmer returns To use other languages; Forth also has a fairly difficult learning curve because it is extremely different from other programming languages, many bad habits must overcome, the concept of new solutions must be obtained by practicing. Once these new skills are mastered, those skills and universal experiences and special programming methods that are based on Forth, actually improve the efficiency of programs in other program languages. Another problem with some Forth system is that they do not provide sufficient programming tool sets for programs, and the old Forth system is rarely collaborated with the host operating system. These issues come from Forth history only small, hardware resources are very small. system. In real-time control applications, these restrictions do not produce problems, but other applications require better support tools. Fortunately, the Forth system now provides better development environment and library support.

The result of all of these effects is that Forth is most suitable for medium sizes, and programs that can be completed with compatible program-based programmers. In large program projects, mutual conflicts of style and capabilities prevent high quality software. However, despite these restrictions, Forth programmers typically provide very good results in a very short time, often solving those other languages that cannot be resolved, at least in a given development time.

7.2.2 C and other traditional languages

Of course, there are always some applications, which best use traditional languages, which may need to use the most common cause of traditional languages is that existing code needs to be ported to a better processor.

To explain the relevant trade-offs, let's consider the problem that writes a C language written through the stack computer's C compiler to the stack computer. We will skip the problem that translates C source code into an intermediate code, because this is independent with the machine running. The problem that is most interested in transplant the C compiler is the so-called "backend", which is part of the compiler that generates the code of the target machine by a simplified intermediate code generated by the program.

That is actually, the stack computer expression code is relatively straightforward. For the arithmetic expression of the blurred form, it has been in-depth research based on the stack (suffix / rpn) expression (Bruno & Lassagne 1975, Couch & Hamm 1977) , Randell & Russell 1964).

Problems from C code to stack computer generate code is several hypothetical assumptions about operating environments. The most meaningful thing is that there must be a single stack in the program memory, including data and subroutines return information, as long as you do not want to destroy many C procedures, this assumption cannot be changed. As an example, consider a case where a local variable pointer is cited. The local variable must reside in the program memory, otherwise it cannot be correctly referenced by the program.

Worse, the C program typically puts a large amount of data PUSH to C language stack, including strings and data structures, and then, C language can be arbitrarily accessed within the range of the current stack frame (the variable contained in the stack belongs to the current the process of). These limits make it impossible to put the C stack into the data stack of the machine.

So how does the stack computer effectively run the C language program? The answer is that the stack computer should effectively support the "frame pointer additional offset address" mode to access program memory. The RTX2000 can accomplish these high efficiently using its user pointer; FRISC3 can use its user-defined register and with offset LOAD / SRORE instructions; RTX32P's business follow-up will have a frame register and a dedicated adder. Complete memory addressing. In all cases, the time required to access the local variable is the same as the access memory: one of the two cycles for instructions, and the other for the data itself. For many computers, this may be the best, it does not need to take those expensive technologies, such as separate data and instruction cache. In C language and advanced languages, the concept that cannot be mapped to the stack computer is "register variable" because the stack computer does not have a register collection, which implies the chance of the stack computer will lose optimization. But this is just part of the problem, and it is the stack computer that cannot be adapted to put a large number of temporary variables to the stack, but a small number of frequently accessed values can be saved on the stack for quick reference. For example, these values may include a loop counter placed on the returns, and the addresses of the two strings for comparison can be saved on the data stack. With this method, many high-efficiency hardware can be used by the C procedure.

There is also a concept that makes many C language programs to run the same block as the Forth program on the stack computer. This concept is to support Forth as the assembly language of the processor. This method is vigorously blown by several stack computers. Using this method is to transfer existing C processes to stack computers, using analysis tools to get their operation. The running information is used to point out several critical loops in the program. These cycles can be overridden with FORTH to get faster speeds, and can also use several applications specified on the RTX32P. With this technology, you only need to pass small efforts to make the C program actually get the same performance as all procedures written in Forth.

When these good properties are combined with the stack computer low system complexity and high processor processing speed, the C language becomes a viable selection in the stack computer programming language.

7.2.3 Design of Systems and Functions Based on Rules

The programming language based on the rules such as ProLog, Lisp, and OPS-5 are significantly suitable for stack computers, a particularly exciting possibility is to combine real-time control applications with rule-based knowledge bases. Early studies in this field are encouraged, and many work is done with Forth for tools. The area involved includes: LISP implementation (HAND 1987, Carr & Kessler 1987), an OPS-5 implementation (Dress 1986), Prolog implementation (Odette 1987), neural network simulation, real-time expert system development environment (Matheus 1986, Park 1986) . Many FORTH achievements were later ported to the stack computer hardware, and excellent results have been achieved. For example, the rule-based EXPERT-5 system described by Park (1986) is 15 times faster than standard IBM PC on WISC CPU / 16. A similarly rule-based system (actually closer to the Expert-4 of Park, which is more than 4.77MHz 8088PC than the Expert-5) is 740 times higher than the RTX32P. This acceleration close to 3 orders of magnitude makes some people amazed, but it only reflects the appropriateness of using the stack computer. It is suitable for the traversal of trees, solving problems that need to use decision trees to solve.

We see that the acceleration of rule systems is actually based on a wide range of principles, and the stack computer can use the data structure as an executable program code. Considering the tree data structure, it has internal node pointers and behaviors performed on the leaves. The node of the tree is a pointer to the sub-tree address, which is equivalent to subroutine calls on many stacks. The leaves of the tree can be executed instructions, or the process call to complete some task processing, and the traditional processor must use an interpreter to traverse the tree when looking for leaves. Because the stack computer can quickly perform subroutine calls, the result is extremely effective, and the technology of direct execution tree data structure is the reason why the RTX32P high speed discussed earlier. The stack computer is ideal for programming and implementing an expert system for LISP language, which is because Lisp and Forth are very similar in many ways, they all call the program as a function of other lists, which are expandable language, with reverse Poland French expression arithmetic operation. The main difference is that the LISP uses dynamic storage allocation using dynamic storage allocation, and Forth uses static storage allocation, because there is no reason to think that the stack computer is even more comparable than other computers, LISP should be able to run high efficiently on the stack computer.

The stack computer is suitable for running the LISP's conclusions equally suitable for ProLog, in an RTX32P Prolog implementation, the authors have new discovery on how to effectively map the prolog to the stack computer. The data type used by ProLog can be actual data or a pointer to another. One possible encoding mode of the ProLog data element is that the highest number of 32-bit words is used as a data type tag, and the lower 23 bits can be used as a pointer to another node, pointing to a 32-bit constant pointer or a short-range value. With this data format, the data element can actually be executed as an instruction. The RTX32P instruction can be configured to translate the instruction sequence of any length, simply transmitting the data structure as a program in accordance with the rate per memory period, and simply executes the data structure as a program; the detection of the NIL pointer can define a NIL pointer value as a call to an error capture subroutine. . The validity of this data expression is that other types of computers cannot be implemented.

The function programming language provides a new way to solve the problem, which is different from the model used for traditional computers (Backus 1978). A special execution function program is to use Graph Reduction technology. The method of direct execution of the program is the same technique with the rule-based system discussed above, so that the stack computer should be able to perform function program design. Language. Belinfante (1987) gives a Forth-based Graph Reduction implementation. Koopman & Lee (1989) describes a serial, interpretative Graph Reduction Engine.

From theory viewpoint, effective Graph Reduction machines such as G-machines and NORMA are part of the SL0 classification discussed in Chapter 2. The ML0 machine is a supercoming of SL0 capabilities, which should be effective in graph reduuction. In this field study is for RTX 32P, it is a very simple stack computer that competes with very complex Graph Reduction computers such as NORMA.

One marginal impact using the function programming language is a high degree of parallelism in the execution, which produces an ideal: calculation using a large number of stack processors, and these processors are programmed with a function programming language.

7.3 Singleness of the software interface

A key concept of a stack computer is that they have a single, consistent interface between the advanced language and the machine language, whether the process call or the operating code put the stack as a method of passing the parameters, this consistent interface will produce a few software development. Impact.

The source code of the program does not need to distinguish which instructions in any way is the machine directly support, which instruction is the process of implementing the FORTH. Suggestions for this capability are to use low-level stack operations similar to the Forth language as a target code for all languages. On this object language, a assembler is given, and users don't have to worry about how a specified function is implemented. This means that the different architectural achievements are highly compatible based on the stack-based source, and all instructions are not required in low cost implementations. If the same interface is used for traditional languages and forth, there is no problem with the C code, Forth code, and other languages. This interface can be further expanded in a microcode machine like RTX32P. The microcode of the application instruction can be used to replace the critical instruction sequence in the application, enabling the normal compiler to generate a transparent code sequence transparent to the user. In fact, in a microcode stack computer, the general application code development method is to write all applications with advanced code, then return to the microcode key loop, and write the advanced language subroutine into a microcode, this for the software Part is invisible, in addition to the speed of the program, this speed is increased to a plurality of applications equivalent to 2 times the acceleration factor.

Chapter 8 Stack Computer Application

The stack computer is like a number of computers, adapted to a wide range of applications, any system that requires high speed and low system complexity, can be used to select a stack processor.

8.1 Discussing these needs, it is very suitable for the application area of the stack processor - this area is real-time embedded control. Real-time control applications require small size, light weight, low cost, and high reliability.

8.2 Discuss the different capabilities and trade-offs of 16 and 32-bit hardware. Correctly selecting the size of the processor is critical to successful design.

8.3 Discussing the considering of the system implementation. Selecting a hard wiring and microcode system involving a series of compromise, including complexity, speed, and flexibility, different integration also affects the performance of the system.

8.4 shows 4 wide application areas suitable for stack computers, and a detailed description of possible applications is given in a list.

8.1 Real-time embedded control

Real-time embedded control processors are such a class of computers, which are built in a complex device, such as cars, aircraft, computer peripherals, audio electronics and military transport tools / weapons, but they are no longer used. Is a computer.

8.1.1 Requirements for real-time control

In most cases, the computer embedded in the system is invisible to the user, such as in a car anti-slip deceleration system. Typically, the processor replaces an expensive and volume of components in the system at low cost and multi-function, and there are also some obvious performance of the computer, such as in the automatic control instrument of the aircraft. However, in all cases, the computer is only a component of a large system.

Many embedded systems have made harsh restrictions on the processor, including size, weight, cost, power, reliability, and operating environment. This is because the processor is only a component of the large system, and the large system has its own operating environment and manufacturing restrictions.

At the same time, the processor must provide maximum possibilities to respond to real-time events. Real-time events are typically asynchronously to the external request of the system, which requires responding within a few microseconds to several milliseconds. For example, some high-performance jet aircraft is naturally unstable, depending on the computer to keep them smoothly. A air computer must be very light, very small, it is impossible to excessively require power and cooling; at the same time, it can't fall behind the aircraft that rely on it. When the sound is exceeded, the aircraft runs about 1000 feet per second. At this speed, several milliseconds determine the life of the aircraft and death.

8.1.2 How does the stack computer meet these needs?

Chapter 4 and Chapter 5 The stack computer manufacturers describe real-time control applications as one of the possible applications of their technology, what makes the stack computer to this application area?

Size and weight

We have seen that from the way of processors, the stack computer is very simple. However, deciding whether the size and weight of the entire system is not just how many door circuits yourself, but also the complexity of the entire system. A processor will occupy a valuable printed circuit board area if there is a large number of pins; if you need a Cache controller and a large number of memory devices, more printed circuit board is taken up; To do virtual memory management, it is even more unsuccessful. The core of doing dimensions and weight is to keep the number of components, and the stack computer has been well done because there is a low hardware complexity and small program memory requirements. Since the stack computer is simpler than other computers, there is a higher reliability. Power consumption and cooling

The complexity of the processor can affect the power consumption of the system. The power consumption of the processor is related to the number of transistors, especially related to the number of pins of the processor, depends on the processor that depends on the speed of the speed is a "consumption power." pig". A large amount of high-power, the processor of the high speed memory device may exceed the power consumption.

The stack computer tends to low power, manufacturing processes can have a huge impact on power consumption, and the power consumption of device devices using new CMOS processes is much smaller than bipolarity and NMOS design. Of course, power consumption directly affects the cooling requirements, because all of the power used by the computer will finally exhibit in a hot form. Cooling CMOS components can reduce the number of failures of components, and improve system reliability.

Operating environment

Embedded processor applications are extremely harsh requirements for operating environments, especially in automotive and military equipment. The processing system must face vibration, impact and high and low temperature, perhaps radiation. In remote installation applications, the system must be able to survive without on-site technicians. The rules that typically avoid the problem caused by the operating environment are to minimize the number of components and the number of pins. Stack computers are done well due to low system complexity and high integration.

cost

For low-end and intermediate performance, the cost of processor itself may be very important. Because the cost of the chip is related to the number of transistors on the sheet, the low complexity stack computer has a natural advantage in cost.

In high-performance systems, the cost of processors is submerged in multi-layer printed circuit boards, support chips, high speed memory chips. In this case, the stack computer with low system complexity provides additional advantages.

Computer performance

In real-time embedded control environments, computing performance is not simply per second execution rate. Although the original computational performance is very important, other factors may also cause system crash, including interrupt response characteristics and overhead of context switching. An additional expectivity is to provide good subroutine call performance, subroutine calls are effective ways to reduce program memory size, even if the cost of the fast program chip does not use the target of this system, the small space and printed circuit board actually It is also necessary to request a program into a small program memory. The characteristics of the stack computer discussed earlier show that they are ideal for this area.

8.2 16 or 32-bit hardware

When selecting a stack processor for a special application, a basic decision is the size of the processor's data element: 16 bits or 32 bits. This requires consideration of cost, size, and performance.

8.2.1 16 hardware is often the best

The cost of the 16-bit stack processor is usually lower than the 32-bit processor. Their internal data channels are relatively narrow, so they use fewer transistors and the manufacturing cost is also low. They are only 16 bits from the external memory connection channel, so they are only half of the memory bus than the 32-bit processor. The system cost is also very low because the simplest 16-bit configuration is only half of the memory chip compared to the 32-bit processor.

The 16-bit chip also uses a reasonable number of silicon area for special functions, such as hardware multipliers, on-chip program memories, and peripheral interfaces. The current trend is to implement 16-bit stack processors such as RTX2000, including I / O peripherals, and program memories for embedded systems.

In the face of an application, you should first consider the 16-bit processor. If you do 32-bit processors, you should have a clear and obvious advantage. 8.2.2 Sometimes 32-bit systems

The 16-bit processor can serve most of the traditional real-time control applications, which provide high-speed processing in a small system with minimal cost. Of course, the 16-bit processor can be well suited for traditional applications for a long time, the ability of the 32-bit processor is not widely used. With more 32-bit processor capability, you can find new areas suitable for them.

Just in the following cases, 32-bit processors should be used instead of 16-bit processors: 32-bit integer calculations, access a large number of memory, floating point calculations.

32-bit integer calculations are significantly suitable for 32-bit processors, mainly for graphical applications or to process large data structures; 16-bit processors can simulate 32-bit calculations by double precision, but the 32-bit processor is more efficient.

The 16-bit processor can pass the segment register to access a memory greater than 64K element, but if the element is often accessed, this technique becomes clumsy; if a program must continuously change the segment register to access the data structure (especially Is a single data element greater than 64K), it will be used to calculate the value of the calculation. The worse case is that the calculation of the address must also be double precision due to the use of greater than 16 bits when calculating the data recording position. A 32-bit processor provides 32-bit linear address space, which can complete the address calculation on the 32-bit data channel.

Floating point calculations also require a 32-bit processor to get high efficiency. The 16-bit processor requires a large amount of time processing stack elements when processing floating point, while the 32-bit processor naturally adapts to the size of the data element. In many cases, the proportional integer calculation is more appropriate than the floating point number in some computers, and the 16-bit processor is sufficient. However, floating-point numbers are often used to reduce programming costs and support code written in advanced languages. This is accompanied by a high-speed floating-point processing hardware, and the advantages of traditional integer operations relative to floating point operations are decreasing.

The disadvantage of the 32-bit processor is the cost and system complexity. Since the 32-bit processor chip has more transistors and pins more than 16-bit systems. They also require 32-bit program memory and usually larger printed circuit board area, and they do not have more space to external devices such as hardware multipliers, but these things will appear when process technology becomes finer. .

8.3 System implementation method

Once there is a choice between the 16-bit and 32-bit systems, then it is to choose the manufacturer. The seven stack computers discussed in this book have different design and trade, including system complexity, flexibility, and performance, which reflect their adaptability to different applications. Is a compromise to use hard connecting or microd control .

8.3.1 Hard wiring or microd control

In all computer sectors, the use of hard wirings or the use of the micro-size to implement the context of the control circuit has been the advantage of being implemented faster, and the disadvantage is that these computers are only supported for instructions supported by the system. Simple instructions must perform multiple instructions to integrate a complex operation.

The microcode machine is flexible than the hard wiring machine, because it can be achieved by performing an arbitrary long microcode sequence to achieve a very complex instruction. Each instruction can be considered a subroutine call to the microcode process. On a computer with a microcode RAM, the instruction set can be enhanced by applying the specified instruction, which can give a special program to give a valid speed increase. .

Hard wired stack computers support certain complex stack operations, which combines stack processing, arithmetic operation, and subroutines return. This is done by processing different fields in the instruction. In this sense. The hardwired instruction format is more like a microcode. In fact, Novix is called its NC4016 instruction as "external microcode".

On the use of the microcode stack computer, simple operations such as adding usually more than the long, complex opcodes such as double precision arithmetic operations on the hard wiring computer, are not encapsulated into a single instruction. For these complex instructions, the microcode computer can run faster by providing special, complex opcodes. Usually this increase in flexibility can make more of the differences between two different processors. The final conclusion is that if all methods do not assess all methods, we can't say that the hard wiring and microcode computers are faster. It is important to carefully evaluate the needs of the application before selecting the processor. 8.3.2 Integration and System Cost / Performance

After discussing hard wiring and microcode compromise, we may remember to give a complete integration consideration when discussing 16-bit stack processors in Chapter 4. The integration is the number of system hardware placed on the processor chip, the more system functions on the processor, the higher the integration. Of course, on the other hand, we must also consider cost / performance compromise in your design, so that the components and quantities used for the operating system are minimized.

WISC CPU / 16 has the lowest system integration in all stack processors given, which uses a finger to implement the processor on the board. Of course, this method saves a lot of layout investment in the production of a single chip.

MISC M17 is a simple single chip processor because it uses program memory as a stack, only program memory and processor can work, the integration can be high, and the system has a low complexity. This simple design is limited to its speed than the separate stack memory is slow.

Novix NC4016 is also a single-chip processor whose integration can be compared to M17. Not surprising, both processors use a gate array process, which is substantially equivalent. The main difference is that the NC4016 uses separate memories as two stacks, and the separated stack has a faster processing speed at a given clock rate because more memory bandwidth can be used, but there is more in terms of cost. System-level components.

The Harris RTX2000 increases system integration than NC4016 by the NC4016 by the NC4016, which is actually reduced the complexity of the system while providing speed efficiency, because the on-chip memory is faster than the outer memory, and the cost is the use of more transistors on the chip. . However, these increasing transistors do not necessarily increase the size of the chip. Because the RTX2000 uses different technologies called standard units, it is well suited to provide a film memory. In fact, the RTX2000 customization system is designed to include a single-chip stack computer system while including a sheet stack memory, providing a single-chip stack computer system.

Future stack computer design should be compromised in these aspects: the width of the data channel (for most processing is 16-bit and 32-bit width, the signal processing should be 24-bit width, and the data structure of the mark should be 36-bit width ), System integration, external support required, original system performance. When we encounter each target application, these issues should be considered before they can then choose a suitable processor.

8.4 Examples of application

The stack computer can be applied, just like the usual computer, only by imagination. It is very suitable for stack computer applications including:

Image Processing

Target identification, including photoelectric character identification, fingerprint recognition, and handwritten identification, and image enhancement require a particularly powerful processor, it has a wide range of applications. The processor interested in many commercial applications is small, low and portable.

Robot control

The robots have 5 to 6 connections (degrees of freedom), and the typical strategy is to arrange a microcontroller for each connection, and use a more powerful microcontroller as central control, each connection can perform complex position calculations in real time. . In a car, small size and low power consumption are critical.

Digital filtering

The filter requires a high processing speed to maintain a high data stream, the stack computer has on-chip space for hardware multiplier and algorithm specified hardware to quickly perform digital filtering calculations.

Process control

More powerful processors can apply expert system technology for real-time process supervision and control outside of simple process control technology, and stack computers are particularly suitable for rule-based systems.

Computer graphics

There are several special graphics accelerator chips on the market, they tend to move the original line and bit block, an exciting opportunity in this field is an interpretative advanced graphical command language for laser printers and equipment independent. screen display. One of the superior languages is very similar to Forth, that is, PostScript.

Other computer peripherals

The low cost of the stack computer makes it suitable for controlling computer peripherals, such as disk drives and communication links.

telecommunications

High-speed control can provide data compression, so the cost of the fax and Modem applications can be lower. They can also monitor the performance of the send and reception equipment.

Car control

The automotive market has strict restrictions on cost and environment. In this business, the small difference between each component cost is a huge benefit or loss. The high-level system integration is a mandatory requirement, and the computer can improve the performance of the system. And safety while reducing the cost of the system in the application. The application areas include: computerized ignition, brake, fuel distribution, anti-aft equipment, collision alarm system, and hit display system.

Consumer electronics

Consumer products are a class of things that are more sensitive than cars than cars. Anyone who uses cheap calculators or digital watches may be strange how this product can only be made through plastics and a single-chip. come out. High-speed, portable, cost-effective stack processors have such opportunities: music synthesis (such as MIDI devices), CD, digital tape devices, slow scanning TV, interactive cable TV services and video games via phone line.

Military and space control applications

Modern space applications can have commercial purposes, but they have the same reliability and environmental requirements with military, and the stack processor is well applied to high-speed control applications of missiles and spacecraft. In addition, there are some applications such as sound and electrical signal processing, graphical enhancement, communication, ignition control, and battlefield management.

Parallel processing

Original research shows that the stack computer can efficiently perform the function programming language, with a large number of parallelism written by these language, which is to be developed by multiprocessor stack computer system.

Chapter 9 Stack Computer's Future

The stack computer discussed in the previous chapter is the first generation of commercial stack processors. As these computers are widely used, the stack computer is required to redesign to market and improve efficiency. The problem with this chapter is: Which computer we want to see? What impacts will they have on the stack computer architecture and applications?

Answering the stack computer how to apply all the questions in many different fields is also fashionable, but we can think about some important views. The views and reasoning here can form the basis for the future to the stack computer concept exploration. However, this chapter is just thinking, not the fact that has already proven.

9.1 Discussing several aspects that need to be considered when needed to support traditional programming languages, as already explained, the current stack computer design has been able to handle most of the problems.

9.2 Discussion Methods for virtual memory and memory protection, do not have virtual storage on the current stack computer, as it is not required for most applications, but in future applications may be required.

9.3 Considering the need for the third stack and proposes a suggestion: stack frame residing in memory can simultaneously meet the needs of the third stack and traditional language support.

9.4 Discussion The urgent problem is that the memory bandwidth limit is limited, and the history behind the hierarchical memory structure is used. The memory computer proposes a solution for the memory bandwidth problem, which can be well applied to an important application area.

9.5 Introduction to two interesting but no stack computer design ideas used in existing design, one is to return to the provisional branch by using subroutine conditions, and another idea is to use stack to temporarily save assembly language. Code. 9.6 provides a thinking about the impact of stack computers on computers.

9.1 Support for traditional languages

Stack computer initial application market is real-time control, stack computer height integration can make it useful as a low-cost, high-performance coprocessor for personal computer or low-end workstation card, which can be directed to a specific The application area is customized, even for a single important package for customization. All of these environments need to run many application code written in traditional high-level languages.

Traditional advanced languages can be easily implemented on stack computers, the only problem is that pure stack computers may not be as fast as register machines when running traditional programs written using usual program style. This problem is mainly due to the requirements of the traditional language and the ability of the stack computer, and the traditional program tends to use few process calls and a large number of local variables. Stack computers can run programs consisting of many small processes and few local variables, which are caused by the structure of programming style developed by usually programming languages and structures of traditional programming languages. Furthermore, the difference is that the register computer can well adapt to the general data processing, and the stack computer behaves better in the real-time control environment. In any sense, as long as you provide certain hardware, the performance of the traditional language in the stack computer can be similar to the highest performance register computer in all applications. Of course, its idea is appropriately matched with the best aspects of registers, and does not sacrifice the characteristics of other areas of the stack computer.

9.1.1 stack frame

Next, we will determine which hardware is attached to support the structure of the advanced language. Singularly, the high-level language running Timeline is actually a pure stack computer that cannot support the only major area of traditional high-level languages. This is because the advanced language is called "activity record" structure, which is established in each subroutine call, and is located in the program memory, "frame" managed by software. In a usual implementation, each stack frame is disposed to a large block of memory before the subroutine call. This frame package contains input parameters (which is actually assigned by the subroutine), the user declared local variables, the intermediate variables generated by the compiler according to their own needs. During the operation of the subroutine, any access can be performed in the stack frame without actually executing PUSH and POP.

The stack frame is a temporary memory allocation device for a subscriber used, which is not a traditional saving consecutive calculation. This means that it is not compatible with the hardware stacks we have studied in the stack computer. An an intuitive way to modify the stack computer to meet this requirement is to assign a large block memory for random access. This is exactly the RISC computer register window (in the second chapter, it is a SL2) solution, but this does not make sense to the stack computer, because all data access is paid to provide an operand in the instruction format.

A variable method is to establish a second hardware stack, access local variables with a slow speed, while the original data processing is performed on the LIFO hardware stack. This makes us also have two beautiful worlds, and this is not no cost. In order to get good operational characteristics, the size of the second register frame stack is usually 5-10 times the data stack.

9.1.2 Register and memory mix

If such a compromise is the only problem, then we may try to build a frame on the chip. However, there is also a factor that let us consider placing the stack frame on the program memory. This additional factor is that the semantics of the traditional language allows these local variables to be accessed through the memory address. The C language is named on this issue, which affects the similarity of the stack computer and register computer.

Use the register as a memory to use the smart hardware and compiler, but the cost of hardware and / or software complexity is not in conformity with the smallest complexity of the stack computer. Therefore, the best choice for a stack computer is to maintain the stack frame of the traditional language in the program memory, and use a frame pointer register as a hardware register accessible to the stack frame. If there is a very rich piece of space, the stack computer can provide a single-on-chip RAM as part of the program memory space to speed up access to the stack frame. This idea may be very attractive: write complex compilers, strive to make the traditional program to put most local variables on the hardware stack upon execution. However, according to the authors of this book, the experience of some stack computers is tried to put all local variables in the C compiler on the hardware stack and use a frame pointer to access the C compiler of the program memory. The difference is small. . In fact, if the computer has a frame pointer, put the frame faster in the program memory.

Usually we will think so: Save local variables to the hardware data stack should be faster than placing in the program memory. However, the actual situation is not the case, the reason is that the stack computer has a deep element with a deep element, especially in the case where these elements may be split into the program memory. "All on the hardware stack" makes a large amount of time on the lookup stack element, and accessed the resident program memory frame element is slower than the elements on the hardware data stack, but the time saves both in the stack processing. no difference.

9.1.3 A strategy for stack frame processing

When processing a stack frame, a good compromise method can be used to set the program memory to the hardware stack data resident. The details of this method are as follows: All process calls are placed on the hardware stack because these parameters are or used to calculate, or to move between different stack frames through the hardware data stack, this process has no waste. any time. Then procedural calls, the truck process is copied from the hardware stack to the assigned frame position as a local variable, leaving only one, two parameters. The compiler guarantees that the parameters left on the data stack do not use the address reference, which can declare this to the programmaker, or the compiler is analyzed, and the most unforgettable can also be copied to the memory only for security. go with. Leave one on the data hardware stack, two data can make the stack splitter minus, and improve performance. When the process ends, the return value is onto the data stack.

This approach provides a good efficiency by saving a large number of memory references, which also provides good interfaces between different languages, such as Forth language and C language, compiler that handles these issues are also relatively Easy to write. Thus, many stack computers are now saved, which can easily implement this strategy.

9.1.4 Execution Efficiency of Traditional Languages

From these discussions we can see that the stack computer has sufficient efficiency to support the operation of the traditional language by using a stack frame pointer method. Of course, using a stack frame pointer to the program memory is not expected to reach the efficiency of the RISC machine, because the RISC computer has a large number of chip register windows to directly support frames, or through a smart optimization compiler to perform global register allocation.

Traditional language profiles have a poor performance of the stack computer from a model of high-performance register computers. Traditional procedures that lead to this impact include: the linear code of the large segment, close to 100% Cache hits, and the partial variables with cross-references between long processes and the processes that can be compiled into embedded code.

Investigating the structure that can be used in the stack computer, the stack computer can achieve even the program performance based on the register computer. Codes that are well-stacked computer run include: highly modular processes have multi-stage nested or recursive; relatively small amounts of frequently used subroutines, inner nesting can be placed into fast memory, and even possible Support; the local variables of the interfaces of many layers are very small; depth nested subroutine calls. At the same time, programs that operate in frequent interrupts and context switching environments can also benefit from stack computers. A practical way to use a traditional language on a stack processor is to use a moderately efficient advanced language to implement most of the program, and then re-encode the incense language of the inner layer of the program. This allows very high performance by appropriate effort. This method can obtain the maximum handler earnings for programmers for programmers's time investment in real-time control processing projects that need to use stack computer advantages.

We should remember: When you choose a computer, in addition to the original speed of a single program, you need to consider many other factors, which may be able to make the balance to the stack computer: interrupt processing speed, task switching speed, low Machine complexity, support the application specified by the application and / or hardware operations. In the final analysis, the stack computer may not be able to make the program of the traditional language run as fast as the register computer, but other considerations may eliminate this shortcomings, especially for real-time control applications, make the stack computer a good Select an object.

9.3 Virtual Memory and Memory Protection

The concept of virtual memory and memory protection is not widely considered in the current stack computer, because most stacks of computer applications so far are relatively small, plus hardware and software strict restrictions without leaving these technologies. space.

9.3.1 Memory protection is sometimes more important

Memory management can refer to a lot, but we only discuss that part of the protection feature is discussed here. Protection is an important aspect of memory management, which is more important for users of some real-time control users, especially military fields. Protection is the ability of the hardware, which prevents an additional program memory from accessing it without a strict control. One is not licensed, the memory accessible to it will cause an interrupt, which causes the operating system to turn off or reset this inappropriate task, which provides a secure approach to prevent a path-state to destroy other memories. . Memory protection features are typically accomplished by a separate, operating system-managed chip, and the stack computer does not block the use of this type of chip, but instead, because one of the stack computers is that they are small enough, it is possible to put this The memory protection circuit is on the chip.

9.2.2 No Virtual Memory Controller

Similarly, there is no reason why the stack memory does not provide virtual memory capabilities. One problem with virtual memory is the impact of memory mismatch, which may require an instruction to re-enter, because the stack computer is essentially a LOAD / Store computer, the instruction is not more difficult than RISC computers. In fact, since there is no pipeline, the speed of the interrupt is handled faster on the stack computer, it should be able to better handle the virtual memory.

The stack computer does not set the virtual memory. The reason is very simple. Many stack computers are real-time control applications, and virtual memory does not simply adapted to embedded real-time control because virtual memory needs to use large-capacity hard drives. surroundings.

9.3 Using the third stack

A stack computer design is recommended to use the third stack to give the third hardware stack to the purpose of storage cyclic variables and local variables.

Current Stack Computer's loop counter is usually used as a stack top element that returns a stack, which is because the subroutines and cycles can nested well, but let the subroutine to access the loop variable of its parent, is not a good Program design style. So, give a loop variable, avoiding your own stack, there is a conceptual benefit of the return stack, of course, has some conceptual benefits, but the resulting program execution efficiency is not necessarily proportional to the cost of the hardware. Another use of the third stack is to store local variables, even when writing programs with Forth, the programmer also feels that the compiler management can make some programs easier to write and maintain. To achieve these capabilities efficiently, hardware needs to access a stack allocated by frame, and read and write in the frame address, which is similar to that supports traditional languages. Therefore, in fact, the best way is to use a third stack, but use a frame pointer to point to the software management program memory stack, which can be supported for high-level language and Forth language local variables.

9.4 Memory bandwidth limit

It may be that the biggest challenge for computer architecture these years is that the memory bandwidth problem is that the memory bandwidth is the number of information that can be exchanged with the memory per unit. Another saying is that the memory bandwidth determines how many memory values that can be accessed.

The problem of the problem is that the program memory is usually more transistors and other devices than the processor, which means that CUP is easier to do more faster than the memory under a given condition. Here is a common sense: use certain manufacturing techniques and processes, faster components are more expensive, consume more power, and so on.

9.4.1 History of memory bandwidth problem

The ghost of the memory bandwidth limit exists in the entire history of computer machine design. At the beginning, everyone grateful to the computer to run, then, the speed of the program memory and the speed of the processor are not a problem. With the application of the computer, the main computer storage capacity is in fact to save data files, but it is much better than anything.

Since the magnetic core memory used in early large computers is slow, there is a very complex instruction set, which fills many work in each instruction, which also produces a very practical microcode, because a small amount of microcode Memory can be done as fast as the CPU within the limits, while a large number of program memory is not possible. After the semiconductor memory appears, capture the duplicated small block block, in particular the Cache memory of the cyclic code segment begins to use. The Cache memory becomes larger and larger, so more and more programs have resident in the Cache memory, and the speed is sufficient to match the speed of the processor.

Then there is a microprocessor, the early microprocessor is slow, the program memory chip is enough to match its speed (so, the problem is turned this: Not it can run, but it is Can run). The memory bandwidth problem is temporarily calibrated for a while, and later microprocessor manufacturers also follow the large system and use complex instructions and microcodes.

After the mainstream microprocessor has been developed, "Waiting Status" is started when the memory is accessed. A wait state is a clock cycle for the processor to wait for a memory response while the processor speed is faster than that of the memory chip. A simple solution is to spend more money to buy a faster memory chip, another simple solution is to wait for the memory chip factory to produce faster memory in a reasonable price. Finally, the CISC microprocessor proposes Cache and other techniques from large system in a large system to place more, and more program code is placed in the fast memory.

RISC disrupted the pattern by declaring the short code used on traditional computers, they use very low-level instructions to place a good microcode produced in the compiler in the program memory. This methodology has great advantages, but further improves the requirements for memory bandwidth, and the performance of the RISC computer is more dependent on the Cache memory.

9.4.2 The relationship between current memory bandwidth For the latest generation of computers: Cache memory chips may not be sufficient to maintain future processors in a busy state. Of course, this situation has not occurred, which is because the switching speed of the transistors in the processor is faster than the speed of the memory chip (or not the case). The problem is that the number of pins of the processor and memory has become the main problem.

The reason why the pin is a bottleneck is that the transistor will get smaller and faster than the chip density. Unfortunately, soldering and connecting their pins have not become smaller, unless weird packaging. The number of electrons necessary for pins and drives have produced huge differences compared to the electron driven by the transistor, so the pins become bottlenecks. This more means that any single-outer memory is slower than the on-chip memory, because the delay between them.

Now that we are in the situation is that all film stores are too slow and not keep the processor busy. This produces an additional layer of memory response speed: on-chip Cache memory. Unfortunately, this method has a fundamental problem compared to the previous method: Although the printed circuit board can do a lot, there is no problem, even if it is the waste rate of the printed circuit board, the number of chips on the board is linear. Growth, if there is a fault, the printed circuit board can still be repaired. Unfortunately, the waste rate of the chip has deteriorated as an increase in the increase in an index, and the chip is not easy to repair.

Using separate Cache chips, as long as you add more chips in the printed circuit board to meet reasonable needs. However, if a single-chip does not have enough area for on-chip Cache memory, it will make the processor cannot be manufactured because of the waste rate problem by increasing the chip size. High-speed processors, especially RISC processors, to a large number of memory needs, so that we see the best hopes with a certain number of high speed films, Cache memory and low speed large-capacity-external Cache memory - now our program performance is These two different cache have mixed a day, is this the best result we want?

9.4.3 Stack Computer Solution

Stack computers provide a completely different way to solve the problem. Traditional computers are attempting to capture a short program at runtime by cache, which is to reuse the commands that have been read for looping or frequent subroutine calls. The problem related to the performance of the Cache interface is the traditional programming language and the compiler generates zero chaos, and the codes are placed, and the opposite is that the stack computer encourages the compression code that uses the reuse.

The result of the stack computer program behavior is not using dynamically assigned Cache memory, and the recorder is small, static allocation, or operating system management for program memory is used to execute at the selected subroutine. Frequent subroutines can be placed in this section of the program memory, as long as the compiler or user knows that they need to run quickly, this part of the memory can be freely used. This is more encouraged to use modular, reused processes, because the results do this can improve performance, rather than hurting this principle as in other computers. Of course, since the on-chip program memory does not require complex, a large number of control circuits manage the Cache memory, excess area can be used for additional program memory. On the 16-bit stack processor, the complete real-time control program and its local variable data memory are rationally on the chip. As the submicron manufacturing process, the 32-bit processor also starts using the same solution. .

Consider the difference between this method and the hierarchical memory structure, let us see the microcode computer such as RTX32P, which has a large capacity dynamic RAM can be used to save many programs and data. In fact, this is extremely extreme, RTX32P programs require more capacity than its static memory, but let us assume that there is this situation. Dynamic RAM constitutes a memory element for high-level, uncommonly accessed programs, or very few data that does not regularly access.

Next, the static memory chip is added to the system, which is used to perform a higher frequency intermediate program, or can be used for program data, and they are frequently processed, and the data can reside in this single memory, and can also be dynamically available when needed The memory is copied and resides for a while. In fact, there is a two-stage static memory chip: large capacity, slow speed and small capacity, high speed, each has different power consumption, different costs, and different printed circuit board space. The on-chip program space can be another level, and the inner layer cycle of important processes in the program can reside here for quick access. Hundreds of bytes of program RAM can easily put them in the processor chip for data and programs. When running a specific program (this is the case where real-time embedded systems are often encountered), thousands of bytes of program ROM reside on the chip. In fact, any language can use many universal ROM subroutions to help programmers and compilers.

Finally, the microcode memory resides on the table on the actual control of the CPU. It is understood that the memory is tilted, and the microcode can be used as another hierarchy program memory, which saves many frequent behavior of the processor, corresponding to the supported machine instruction, once again, ROM, and RAM hybrids are appropriate. Of course, the data stack is used as a quick access device for saving the intermediate result of the calculation.

So we got a hierarchical structure that runs through memory size and speed throughout the system. The hierarchical concept is not new, and the new is that it does not need to manage the hardware at runtime, compilers and programmers can easily manage them. The key is because the stack computer program is small, and a large number of code can reside at each level of static allocation. Because the stack computer supports fast process calls, only inner-layer cycles and small parts frequently used code need to reside in high speed memory without having to put all user-defined processes in high speed memory. This means that actually doesn't really need dynamic memory allocation.

9.5 Two new ideas for stack computer design

Here are two interesting ideas about stack computer design details, they are not used in existing design, but may be proven to be proven in future design.

9.5.1 Condition subroutine return

The implementation details are like this: Dorate (1972) Discovering the stack computer does not require conditional branches, they only need to return subroutine instructions. Consider the IF statement used in a high-level language, if we ignore the optional ESLE clause, the IF statement is "a piece of an entry and two outlets". The entry code starts this statement and calculates the branch conditions here. When the first exit condition is false, all other statements are no longer executed. The second exit is the end of this statement, and all the actions have been completed.

The usual way to implement the IF statement is to test if the IF condition is categorically branch. Instead, we can consider that a subroutine contains an entire IF statement, the entry point of the IF statement is called to enter this special subroutine. The first exit point may be a condition subroutine to return instructions, in the IF statement condition clause The second exit is an unconditional return.

This method eliminates the conditional branch of the program in the program, which is just a condition subroutine returns a statement. This technique is ideal for stack computers because the calls of entering condition subroutines and the return cost of subroutines are low, which can be more effective than currently used by the conditional branch methods.

9.5.2 Save the code using the stack

Another interesting suggestion of another stack computer program is proposed by Tsukamoto (1977), and he examines the advantages and disadvantages of self-modifying code. Although self-modifying code can be very effective, almost all professional software programmers are avoided, because the risks are too big. Since the modification code destroys the content of the program, the programmer cannot believe that the compiler or the instruction generated by the assembler is correct throughout the program.

Tsukamoto's idea is to use self-modifying code without defects. He recommends simply using running program stacks to store modified program code segments for execution. The code can be generated by the application and executed at runtime. This process does not destroy the program memory. After the code is executed, you can simply pop the stack and discard it. Both techniques are now not used, but they may become very important in future applications.

9.6 Influence of Stack Computer on Calculation

We have already seen how many instructions are performed using the original per second, and the stack computer is as fast as the register-based computer, and they show the advanced features in real-time control applications. We also see that as long as you use small nested process calls, you can use local local variables in the program, however, which is the best such problem for CISC, RISC, and stack computers, because all of these design technologies are different There is your own location in your application. Stack computer seems to be used as a basic CPU to work well to workstation and small machine markets, for this reason, they have not received attention. But in the suitable field, they have true existence and have good performance.

I can find that we can find that the problem in supporting a certain computing task is not a stack computer, but the current programming practice. Consider the stack computer features suitable procedures: Use a high modular program for many small, depth nested processes; transfers a small amount of parameters in the process, hide the procedures of the operation details; frequently use these small modules to reduce size and complexity Program; programs that are easily debugged due to small size; use a single interface to connect various levels of modules abstraction, from advanced subroutines to the instructions. All of these features seem to be needed. Unfortunately, they are rarely used in practice, and use the stack computer to change this situation.

It may be useful, traditional programming languages and hardware formed by register computer should be responsible for this. Process calls cannot be used frequently because they take time. Because the process is relatively long, the grammar of the language does not enforce simple definitions and references, so different number of special code, parameter list, type information, and other similar things, and even project management style requires every new The subroutine creates separate instrument work and formal processes. Because of the relatively difficulties of writing and using many small processes, it can only use fewer processes, which forms a vicious circle.

The stack computer provides an opportunity to change this cycle. Process call overhead is very low, and language like Forth makes the smallest cost of defining a new process, which is actually provided an environment that encourages development and testing modular, easy to test code. All needs to be done is to design a new high-level language to fit the advanced language computer (CHEN et al. 1980), and the stack computer is a practitioner. We hope to see some control structures in traditional languages to better use stack computer hardware.

The register computer provides a certain performance return to programs that are poor structures, but it is usually more difficult to maintain, more difficult to debug and larger program size. The stack computer encourages better program practices by writing a reward of a good code of the programmer. This may further affect the transformation of traditional languages, and traditional languages also tend to provide a better way to create, maintain, and execute procedures.

Appendix A Computer Overview of Hardware Stack Support

This appendix is an overview of the stack architecture included in Chapter 2. It includes most stack computers that appear in conferences and magazines. Each machine has classified, realizes technology, and has been solved. Further, reference and brief description are included.

AADC

Category: SL1 Implementation: 16-Bit MiniComputer App: Direct APL Execution, Military Environment Manufacturer and Time: Raytheon for the US Navy, 1971 Reference: Nissen, S. & Wallach, S. (1973)

The AADC (All Applications Digital Computer) was designed for direct execution of the APL language. The target application area was Naval platforms (especially Naval aircraft), so small size and weight were important. The APL language was chosen for efficient machine code and execution . in particular, APL was chosen for its conciseness, which was predicted to give smaller programs and therefore fewer page faults in a virtual memory environment. The AADC converted expressions from infix to Polish notation on-the-fly at execution time. Programs were interpreted At run-time by the program management unit and execute by an arithmetic processor. The Execution Unit Used 1-Operand Stack Notation.Aamp

Category: SS1 Implementation: 16-Bit Microcoded Silicon-On-Sapphire App: RAPPHIRE App: RADI-Tasking Manufacturers and Time: Rockwell International, 1981 Reference: Best et al. (1982)

The AAMP (Advanced Architecture MicroProcessor) was designed for military and space use. A stack architecture was chosen for ease of compilation and good code density since it can use mostly 1-byte instructions. AAMP uses a single stack with a frame pointer for activation records as well as expression evaluation with a separate stack pointer. The expression evaluation area is just on top of the current frame. Many instructions are 1 byte long, with the possibility of using local variable addresses relative to the frame pointer for 1-operand addressing. Four Top-of-Stack Registers Are Used for Evaluation, with spillage INTO Program Memory.

Action Processor

Category Directory: MS0 Implementation: 16-Bit Microcode Bit-SLICED App: Direct Execution Of Forth Manufacturers and Time: Computer Tools, 1979 Reference: Rust (1981)

The ACTION Processor FORTHRIGHT is a microcoded Forth-language processor. Typical of Forth hardware implementations, it has a data stack used for expression evaluation and parameter passing as well as a return address stack used for subroutine return address storage. The top elements of both stacks Are Kept in Registers in The Bit Slices. Stacks Reside in Program Memory To Reduce Hardware COSTS.AEROSPACE COMPUTER

Category: SS0 Implementation: 64-Bit Processor App: High Reliability, Multiprocessor Spacecraft Computer Manufacturers and Time: Intermetrics, 1973 Reference: Miller & Vandever (1973)

The Aerospace Computer used stack instructions to save program memory space, which has a major impact on reducing size, weight, power, and cost for spacecraft applications. Stack instructions were also chosen to direct support high order languages to enhance software reliability. The design draws heavily from the B6700 architecture. All computation was done in floating point in the ALU, with integers converted to floating point format as fetched from memory. When set, the highest bit of an operand on the stack indicated that the element was really a pointer to Memory, Which Caused An Transparent Fetch.

Alcor

Category: ML0 Implementation: Emulator On Various Early EUROPEAN Computers Applications: Conceptual Machine Emulated for Transportable Algol Programming Manufacturers and Time: Alcor Joint Project, 1958-60 Reference: Samelson & Bauer (1962)

The ALCOR (ALgol COnverteR) joint project was a very early conceptual design for the interpretation of ALGOL 60. The European group devised a high level language machine architecture which was emulated on various machines. The conceptual machine had two stacks which were used for expression parsing And Evaluation. One Stack Helding Operations, While The Other Stack Held Intermediate Results. Variables and Return Addresses Were Static Allocated in Program Memory.an Algol Machine

Category: ML0 Implementation: Research Prototype Project Applications: Direct Execution Of Algol Manufacturers and Time: Burroghs, 1961 Reference: Anderson (1961)

The exploration for a direct execution architecture was motivated by the observation that two-thirds of computer time was then spent doing compilation and debugging. The focus of the research was on making computers easier to use. The approach taken was to directly execute a high level language. The machine discussed used hardware stacks to execute ALGOL constructs. Two stacks formed a value and operator stack pair for expression parsing and evaluation. The third stack held subroutine return address information three.

AM29000

Category: SL2 Implementation: 32-Bit Microprocessor App: General Purpose Risc Processor Manufacturer And Time: Advanced Micro Devices (AMD) 1987 Reference: Johnson (1987)

The AM29000 is a RISC processor. While its instructions are not stack-oriented, it provides considerable hardware support for stacks for parameter passing for high level languages. It has 192 registers, 64 of which are conventional registers, the other 128 of which are used as a stack cache. A stack frame pointer into the register file provides relative addressing of registers. If the stack cache overflows, it is spilled to program memory under software control. The chip has the capability of dividing the 256 register address space into 16 banks For Multi-Tasking. Each Instruction May Access Registers Either Globally, or Based on The Register Stack Pointer.apl Language

Category: MS0 Implementation: Microcode Emulation ON IBM 360/25 with WCS Applications: Direct Execution Of APL Manufacturers and Time: International Business Machines, 1973 Reference: Hasst et al. (1973)

APL is an inherently interpreted language, so creating an APL direct execution machine is an attractive alternative to interpreters on conventional machines. This project used an IBM 360 Model 25 with writable control store to emulate an operational APL machine. The machine used two stacks resident in Program Memory: ONE for Expression Evaluation, The Other for Temporary Allocation of Variable Space.

Buffalo Stack Machine

Category: SS1 Implementation: 32-Bit Microcode Emulation ON A B1700 Application: Block Structured Language Execution Manufacturer and Time: State University of New York At Buffalo, 1972 Reference: Lutz (1973)

The BSM (Buffalo Stack Machine) was a microcoded emulation of a stack architecture that ran on a Burroughs B1700 system. The architecture was designed to support ALGOL-60 type languages. Variables were stored as tagged data in memory with 32 data bits and 4 tag bits. interrupts were treated as hardware invoked procedure calls, thus saving state automatically on the stack. A sticky point with doing this was that interrupts on stack overflow / underflow had to be made before the stack is completely full / empty to prevent a system crash .Burroghs Machines

Category: SS0 Implementation: a Family of MiniComputers App: General Purpose Multi-User Computing Manufacturers and Time: Burroghs Corporation: 1961-77 (AND BEYOND) Reference: Carlson (1963), Doran (1979), Earnest (1980), Organick (1973)

The Burroughs line of stack computers originated with the ALGOL-oriented B5000 machine in 1961. One of the motivations for this machine was the observation that conventional machines required compilers that were so complex that they were too expensive to run (in 1961).

The B5000 was a 0-operand pure stack machine that kept the stack elements in program memory. The top two stack elements of the B5000 were kept in special registers in the CPU. A special feature of these registers is that there were hardware status bits that allowed 0, 1, or 2 of the registers to contain valid data. This reduced the amount of memory bus traffic by eliminating redundant reads and writes to memory (for example, a POP followed by a PUSH would not cause the same value to be read .

The stacks on these machines were used both for expression evaluation and stack frames for ALGOL procedure calls. Thus, return addresses were interleaved with parameters on the stack. One of the advantages to keeping the stacks resident in program memory was rapid response to interrupts and a Low Cost For Task Swapping. Stacks Enabled The Hardware To Treat Procedure Calls, Interrupts, And Task Calls in a Uniform Manner.caltech Chip

Category: SS0 Implementation: 8-Bit Microcoded VLSI CHIP Applications: University VLSI Design Project Manufacturers and Time: California Institute of Technology, 1979 Reference: Efland & Mosteller (1979)

This stack machine was implemented as a student project for a VLSI design course. The objective was to design and lay out the simplest possible computer in a two and one-half week period. To keep the design simple, the students chose a 0-operand stack machine. The stack on this machine was maintained in program memory with 2 registers containing the top two stack elements on-chip. The instruction set was patterned after the primitives needed by a student-written Pascal compiler.

CRISP

Category: SL2 Implementation: 32-Bit CMOS Microprocessor App: c Language Risc Machine. Manufacturer and Time: AT & T Bell Laboratories, 1987 Reference: Ditzel et al. (1987a), Ditzel et al. (1987b), Ditzel & McLellan 1982), Ditzel & mclellan (1987)

The CRISP microprocessor is a RISC machine optimized for executing the C programming language. It is designed as a register-less machine, with all operands memory-resident. However, since the C language uses stacks to allocate storage for local parameters, most of the operand data references are to memory locations relative to a stack pointer register. to support these stack references, CRISP has a 32-element stack cache on-chip. Thus, when a memory-to-memory operation is performed on data near the top of The Stack, The Operands Are Fetched and Stored Using The on-Chip Cache. CRISP Also Supports Branch Folding, a Technique WHERE BRANCHES ARE EXECUTED IN Parallel with some of the sale

Category: SL2 Implementation: 2-Chip Microprocessor. Application: Experimental Multiprocessor Design. Manufacturer and Time: Xerox Palo Alto Research Center, 1985 Reference: Atkinson & MccReiGHT (1987)

The Dragon is an experimental design created with an emphasis on compact binary instruction encodings and fast procedure calls. Variable length instructions and the use of stack-register addressing keep instruction size small while allowing the use of 3-operand instructions. The Dragon has a 128 -Element Execution Unit Register Stack With Variable Size Frames IMPLEMENTED by a Pointer Pair That Define The Upper and Lower Frame Bounds.

EM-1

Category: SS1 Implementation: Conceptual Design Application: Structured Programming Manufacturers and Time: Vrije University, The Netherlands, 1978 Reference: Tanenbaum (1978)

This often-cited paper gives a discussion of how structured programming techniques should impact machine design, then presents an example design, the Experimental Machine-1 (EM-1). The motivation behind the EM-1 is to provide an efficient environment for well -structured programs. to do this, it uses a single memory-resident stack in typical block-structured language style, and provides 1-operand addressing to access local variables on the stack frame for evaluation using the top of stack. The design skirts the Issue of memory bus contention BETWEEN Stack Items and instructions by PRESUMING The Existence of A Stack Cache Independent of An Instruction Cache.euler

Category: SS0 Implementation: Microcoded Interpreter on IBM 360/30 Applications: Research INTO IMPLEMENTING DIRECT High Level Interpreters in Microcode Manufacturers and Time: IBM Systems Development Division, 1967 Reference: Weber (1967)

EULER is an extension of the ALGOL programming language. The EULER project discussed by this paper was an early attempt to implement a direct interpretation machine by adding special-purpose microprogramming to a standard IBM Model 360 computer. In operation, an EULER program was compiled to an intermediate byte-code format. Each byte-code invoked a routine resident in microprogram memory. This system may well have been the first "p-coded" machine. The use of microcoded interpretation was justified for this project by the fact that EULER supports structures such as dynamic typing, dynamic storage allocation, and list processing that were poorly handled by available compilers. The EULER implementation used a single stack resident in program memory for dynamic data allocation and expression evaluation. Data operation instructions were 0-operand RPN byte codes .

Forth Engine Category: ML0 Implementation: Discrete LS-TTL App: Execution Of Forth Programming Language Reference: Winkel (1981)

The Forth Engine was a discrete TTL microcoded stack processor for the Forth language. In addition to a hardware stack for evaluation and subroutine parameter passing and the hardware stack for return address storage, this processor featured a 60-bit writable control store for microcode.

Fortran Machine

Category: MS0 Implementation: Conceptual Design App: Direct Execution of the Fortran Language Manufacturer and Time: University of Science and Technology of China, PRC, 1980 Reference: Chen et al. (1980)

This paper presents a conceptual design paper for a direct execution FORTRAN machine. The proposed machine would have several hardware stacks for return address, loop limit and branch address, and expression evaluation storage. While the implementation method was not specified, isolated memory space stacks would Certainly Be Appropriate To Reduce MEMORY Traffic. AS with MOST OTHER DIRECT Execution Machines, Stacks Were Mandatory To Support Program Parsing.

FRISC 3

Categories: ML0 achieved: 32-Bit 2 micron silicon compiler CMOS microprocessor applications: General purpose space-borne computing and control Optimized for the Forth language Manufacturer and time:. Johns Hopkins University, 1986 Reference: Fraeman et al (1986). Hayes (1986), Hayes & Lee (1988), Hayes et al. (1987), Williams et al. (1986)

The Johns Hopkins University / APL Forth processing chip is designed for spacecraft processing applications. The chip executes Forth primitives, and allows multiple operations to be compacted into microcode-like fields in the instruction. Although Forth is a 0-operand language, the chip allows selecting any of the top 4 stack elements to be used with the top stack element for an operation, thus making it a "1/2" operand machine The on-chip data and return stacks are rather small:. 16 elements each, forced mostly TECHNOLOGY CONSTRAINTS.G-MACHINE

Category: SL0 Implementation: 32-Bit Processor Simulation Applications: Graph Reduction Manufacturers And Time: Oregon Graduate Center, 1985 Reference: Kieburtz (1985)

The G-Machine was specially built to perform graph reduction in support of executing functional programming languages. It executed G-code, which was a zero-address machine language designed to manipulate its single stack. Program memory was highly structured to support the requirements of GRAPH Reduction. Each Memory Word Included Four Fields Used for Reference Counting and TWO 32-Bit Cells Used for graph Pointers.

Gloss

Category: SS0 Implementation: Conceptual Design App: Multiple Communicating Processors. Manufacturer and Time: University of Washington, 1973 Reference: Herriot (1973)

The GLOSS conceptual design was an attempt to define a generic high level language machine for a variety of languages, including ALGOL 68, LISP 1.5, and SNOBOL 4. It was based on using a demand driven data-flow system where sub-processes were invoked ON Multiple Parallel Processors in a manner similar to procedure calls. Each Processor Had A Set of Evaluation Stacks Resident In Memory.

Hitac-10

Category: SS0 Implementation: Add-on Hardware to a MiniComputer App: Experimental MiniComputer Addition Manufacturers and Time: Keio University, Japan, 1974 Reference: Ohdate et al. (1975) The Stack Hardware Discussed in this paper Was Back-Fitted Onto an existing HITAC-10 minicomputer. in order to simplify the design and construction, the stack hardware was added as an I / O device on the system bus. The stack controller had four top-of-stack registers. All extra elements were stored in an area of memory using DMA The controller had two stack limit pointers for memory bounds checking The stack controller was used for subroutine parameter passing;.. no arithmetic could be performed on stack elements.

HP300 & HP3000

Category: SS1 Implementation: 16-Bit MiniComputer Family App: General Purpose Multi-User Computer. Manufacturer and Time: Hewlett Packard, 1976-1980'S Reference: Bartlett (1973), Bergh & Mei (1979), Blake (1977)

The HP3000 family is a commercial line of minicomputers based on a 1-operand stack architecture. The origins of the family may be found in the HP300 computer, which could be considered a 3-address machine that had two top-of-stack registers buffering a program memory resident stack. Later, the HP3000 series used a stack / accumulator addressing mode, and included four top-of-stack registers. The stacks were featured in the architectures to ease implementation of reentrancy, recursion, code sharing, program protection, And Dynamic Storage Allocation In a Multi-User Environment. The Stack IS Used Not Only for Expression Evaluation, But Also for Parameter Passing and Subroutine Return Address Storage.

HUT

Categories: MS0 achieved: 16-Bit AM2903 bit-sliced processor applications: Spacecraft experiment control Optimized for the Forth language Manufacturers and time:. Johns Hopkins University, Applied Physics Laboratory 1982 Reference: Ballard (1984) The HUT processor was designed to control the Hopkins Ultraviolet Telescope (HUT) Space Shuttle experiment. At the time it was designed, no space-qualified microprocessors were powerful enough for the task, so a bit-sliced processor was custom designed for the job. The designers chose to implement a Forth Language Processor for Simplicity of Implementation, Extensibility, and flexibility.

ICL2900

Category: SS1 Implementation: Family of Mini-Computers Application: General Purpose Computing. Manufacturer and Time: ICL, 1975 Reference: Keedy (1977)

The designers of the ICL family were concerned with protection and code sharing in a multiprogrammed environment, as well as efficient compilation and execution with compact object code. While often compared with the contemporary Burroughs machines, the ICL machines had several distinct characteristics. One of these .

Intel 80x86

Category: SS2 (When Used In Stack Mode) Implementation: Family of 16 and 32-Bit Microprocessors Applications: General Purpose Computing Manufacturers and Time: Intel Corporation 1980's Reference: Intel (1981)

The 80x86 processor family, which includes the 8088, 8086, 80186, 80286, and 80386 is a family of microprocessors with a general purpose register architecture. Simple PUSH and POP instructions are supported to manipulate the stack. Many high level language compilers produce code that uses the BP (base pointer) register as a frame pointer to a combined return address and parameter passing stack. When used in this mode, the 80x86 family can be considered to be doing stack processing. in the context of stack computers, the 80x86 is SIMPLY INCLUDEDITING AS A REPRESENTATIVE EXAMPLE OF A Conventional Machine That Can Be Used As An Ss2 Architecture.Internal Machine

Category: MS0 Implementation: Conceptual Design App: Directly Interpretable Languages, Manufacturers and Time: North Electric Co., 1973 Reference: Welin (1973)

The Internal Machine was a conceptual design for a machine that could efficiently execute directly interpretable languages A stack instruction model was picked for generality The design specifies two stacks:.. One for expression evaluation and parameter passing, and a second stack for subroutine return addresses.

IPL-VI

Category: SS1 Implementation: Conceptual Design for Microcoded Interpreter App: General Purpose Computing Manufacturers and Time: Rand Corporation, 1958 Reference: Shaw et al. (1959)

The Information Processing Languages (IPL's) were a series of conceptual language designs for implementing high level programs. IPL-VI was a language designed meant to be implemented as an interpreted language with microcode support. IPL-VI emphasized advanced (for 1959) computing structures for non-numerical computing, especially list manipulation. A stack was used to pass parameters between subroutines. Since all memory in the IPL-VI design was formatted as list elements, the subroutine parameter LIFO consisted of a list of elements that pointed to the next Element further down in the list. iPL-VI Instructions Used 1-Operand Addressing.ITS (Pascal)

Category: SS0 Implementation: 16-Bit Microprocessor App: Direct Execution of Pascal P-CODE Manufacturers and Time: Nippon Electric Co., 1980 Reference: Tanabe & Yamamoto (1980)

The ITS processor was designed to execute UCSD Pascal P-code. The designers claimed a several-times speedup over fully compiled code on a contemporary microprocessor (presumably an 8086). The ITS had a 256-word stack on-chip, which was apparently ONLY Used for Expression Evaluation. The Top TWO Stack Elements WERE Kept in Registers for Speed.

KDF-9

Category: ML0 Implementation: 48-Bit Mainframe App: General Purpose Computing Using Algol Manufacturer And Time: English Electric, 1960 Reference: Allmark & Lucking (1962), Duncan (1977), Haley (1962)

The KDF-9 was perhaps the first true stack computer. It was inspired by the advent of ALGOL-60, and introduced many of the features found on modern stack computers. The KDF-9 had an expression evaluation stack which could be used for parameter passing, as well as a separate return address stack. Unfortunately, these stacks were limited by technology considerations to only 16 elements apiece (constructed from magnetic cores!). A problem with the design was that while 16 elements is quite sufficient for expression evaluation, The Algol Compiler Was Constrained by The 16-Element Stack Depth, Causeing Slow Compilation.kobe University Machine

Category: ML0 Implementation: 16-Bit Word AM2903 Bit-Slice With Writable Control Store. Applications: Academic Research Manufacturers and Time: Koible University, Kobe Japan, 1983 Reference: Kaneda et al. (1983), Wada et al. 1982A), WADA et al. (1982b)

This machine was designed to execute both Forth and Pascal efficiently using a stack architecture. Forth was executed by directly implementing Forth primitives in microcode. Pascal was executed by supporting a UCSD P-code emulator. This machine had separate data memory in addition to program memory .

LAX2

Category: SS0 Implementation: Microcoded Interpreter on Varian V73 Applications: Experimental Manufacturers And Time: Group for Datalogical Research & Royal Institute of Technology, Sweden, 1980 Reference: Bage & Thorelli (1980)

The LAX2 architecture was implemented as a partially microcoded interpreter with the goals of cost effective software production along with good memory and execution time economy for string manipulation and interactive applications. The architecture used tagged data types. Each process in the machine had a private memory area Shared Between The Evaluation Stack and a Garbage-Collected Heap for Temporary String Storage.lilith

Category: ML1 / MS1 Implementation: 16-Bit AM2901 Bit-SLICED Processor App: Direct Execution Of MODULA-2 M-CODE AND INTERACTIVE User Interfaces. Manufacturer and Time: Eth (Swiss Federal Institute of Technology), 1979. Reference: Ohran (1984), WIRTH (1979)

Lilith was a Modula-2 execution machine developed by Niklaus Wirth. The goals of the machine were to provide efficient support for the Modula-2 language as well as an effective user interface. A stack architecture was chosen for compact code. The Lilith had two stacks: a program memory resident stack for parameter passing, and a hardware expression evaluation stack The instruction set was stack-based, with the ability to read an element from any location in the parameter stack and operate upon it with the top element of the. Evaluation stack.

Lisp Machines

Category: ML1 Implementation: Various Machines. Applications: List Processing, Artificial Intelligence Research Manufacturer and Time: Various Companies (SUCH AS SYMBOLICS) Reference: Lim (1987), Moon (1985), Sansonnet et al. 1982)

LISP machine architecture is a whole topic in its own right;. This is simply a summary of some of the common characteristics of LISP-specific processors LISP machines tend to have multiple stacks and 1-address (relative to top of stack) instruction formats. Some machines have substantial hardware stacks (over 1K word) that can overflow into program memory. Procedure calls tend to be very important, because of the recursion commonly used in traversing lists. These machines typically store data elements in a garbage-collected program / data Memory.mcode

Category: SS1 Implementation: Unspecified Applications: Execution of Modula-2 M-CODE (Using Starmod Language) Manufacturers and Time: University of Wisconsin - Madison, 1980 Reference: Cook, R. & Donde, N. (1982), Cook , R. & Lee, I. (1980)

The MCODE machine was designed to execute Modula-2 M-code, which was compiled from a language called StarMod, a Modula-2 derivative. MCODE was based on Tanenbaum's design EM-1, but with several improvements to solve problems that arise in real One Improvement Was The Use of a Set-Mode Instruction That Changed The Interpretation of The Data Types (Integer, Floating Point, etc. for all subssequent operations.

Mesa

Categories: SS0 achieve: Architectural family for workstations Applications: Graphics-intensive workstations (Alto, Dorado machines) manufacturers and Time: Xerox Office Products Division, 1979 Reference: Johnsson & Wick (1982), McDaniel (1982), Sweet & Sandman (1982)

Mesa was actually a modular high level language expanded to include an architecture for a family of processors. The goals of the architecture were efficient implementation, compact encoding, and technology independence. In order to accomplish these goals, the Mesa architecture specified a single stack for use in expression evaluation and parameter passing to procedures for the purpose of producing compact 0-operand stack instructions. While the stack implementation was not specified, Mesa follows the general pattern of machines with a small stack buffer and the bulk of the stack in program memory An Interesting Instruction Was The "Recover" Operation, Which Captured A Previously Popped Stack Value That Had Not Yet Been Overwritten by Doing a Stack Push With Writing a value.mf1600

Category: ML0 Implementation: TTL Discrete Components, 16-Bit Machine App: General Purpose Processing Manufacturers and Time: XYCOM & Advance Processor Designs, 1987 Reference: Burnley & Harkaway, 1987

The Advanced Processor Design MF1600, which is the processor used on the Xycom XVME-616 product, is a high performance Forth machine design that makes use of fast TTL logic devices. It features a 16-bit data path and a microcode memory ROM that can Be customized by The Manufacturer for Specific Applications.

Micro-3L

Category: SL1 Implementation: SIMULATED MACHINE App: Functional Programs Manufacturers and Time: University of Utah, 1982 Reference: Castan & Organick (1982)

The micro-3L processor project used the 3L-model (Lisp Like Language model) for specifying a processor that is well suited to list processing. The project proposed creating a multiprocessor system to execute functional languages. Each micro-3L processor was to use a 256-element register file. 128 of the registers were intended to be used as a return address stack, with overflow handled by swapping into main memory. Data manipulations were performed using an Accumulator and an operand from the bottom 128 elements of the register file. MicroData 32 / S

Category: SS0 Implementation: Microcode Upgrade to A 16-Bit Register-Oriented MiniComputer. Applications: Running a Version of PL / I Manufacturers and Time: MicroData Corporation, 1973 Reference: Burns & Savitt (1973)

The Microdata 32 / S was a version of the Microdata 3200 general-purpose minicomputer that had additional microcode to implement stack instructions. The 3200 system was a 16-bit minicomputer implemented in discrete TTL technology. The reason for adding the stack-based capabilities was that compilers of the time could not produce efficient code. Stack architectures made code generation easier. The reason good code generation is important was to remove the impetus for programming in assembly language. The main memory stack was used for expression evaluation and parameter passing, with Up to four stack elements buffered in registers.

Misc M17

Category: MS0 Implementation: 16-Bit 2.0 Micron HCMOS Gate Array App: Low Cost Real Time Control Manufacturers and Time: Minimum Instruction Set Computer, Inc., 1988 Reference: Misc (1988)

The MISC M17 microprocessor is a low cost, embedded micro-controller. The M17 instruction set is based on Forth primitives. In contrast with most other Forth machines, the M17 reduces hardware costs at some compromise in performance by keeping its two stacks in program memory WITH A FEW TOP-OF-Stack Buffer Registers on-Chip.motorola 680x0

Category: MS2 (When Used In Stack Mode) Implementation: Family of 32-Bit MicroproProcessors Applications: General Purpose Computing Manufacturers and Time: Motorola Corporation 1980'S Reference: Kane et al. (1981)

The 680x0 processor family, which includes the 68000, 68010, 68020, and 68030, is a family of microprocessors with a general purpose register architecture Registers are divided into two groups:.. Address registers and data registers The address registers support postincremented and predecremented addressing modes. This allows a programmer to use up to eight stacks, one stack per address register. By convention, the A7 register is used as the stack frame pointer for most languages. Of course, the 680x0 family is usually not used as a multiple- Stack Machine, But Nonetheless this Capability EXISTS.

Mu5

Category: SS1 Implementation: MiniComputer App: Research Manufacturers and Time: University of Manchester, 1971 Reference: Morris & Ibbett (1979)

The MU5 used a 1-operand instruction format with a single stack in program memory. Stack instructions were used because they led to easy code generation, compact programs, and easily pipelined hardware. An interesting twist is that there were five registers accessible to the programmer , all of which were simultaneously at the "top-of-stack." Pushing a value into any register pushed the previous register value onto the single stack. This arrangement is subtly different from having the top five stack elements accessible as registers.NC4016

Category: ML0 Implementation: 16-Bit Gate Array Processor App: Real Time Control, Direct Support for Forth Programming Language Manufacturers and Time: Novix, 1985 Reference: Golden et al. (1985), Jennings (1985), Miller (1987 ), Novix (1985)

The NC4000, later renamed the NC4016, was the first chip designed to execute Forth. Since it is on a gate array, the two hardware stack memories reside off-chip, connected to the processor by dedicated stack busses. The top two data stack elements are buffered on-chip, as is the top return stack element. The processor executes most Forth primitives including subroutine call in a single clock cycle, and allows multiple primitive operations to be compressed into a single instruction in some circumstances.

Norma

Category: SL0 Implementation: Experimental Machine Using MSI / LSI Standard Logic and Gate Arrays Applications: Functional Programming / Graph Reduction Manufacturer and Time: Burroghs Corporation Austin Research Center, 1986 Reference: Scheevel (1986)

The Normal Order Reduction MAchine (NORMA) is a research processor developed by Burroughs for high speed graph reduction operations in support of functional programming languages. Five specialized functional units that handle arithmetic, graph memory, garbage collection, graph processing, and external I / O Are Connected Used a Central Bus. The Graph Processor Maintains a Single Stack Used During The Depth-First Traversal of The Tree-Structured Program Graphs.opa (Pascal)

Category: ML0 Implementation: Emulator Running On Lilith Computer App: Support for Pascal & Modula-2 Programs Manufacturers and Time: Federal Institute of Technology, Zurich, 1984 Reference: Schulthess (1984)

The Object Pascal Architecture (OPA) is a design for a machine that efficiently executes compiled Pascal code The OPA contains three stacks:. One for descriptors and expression evaluation, one for storing subroutine parameters, and one for return addresses The OPA instruction set is. Billed As a "Reduced High Level Language" Instruction Set, Since It Supports Pascal Constructions with A Small Number of Opcodes.

Pascal Machine

Category: ML0 Implementation: Experimental Processor App: Direct Execution Of Tiny-Pascal Source Code Manufacturers and Time: University of Maryland, 1981 Reference: Lor & Chu (1981)

The Pascal interactive computer is an experimental system for direct execution of Pascal source code. Since the system includes a hardware compiler as well as execution unit, hardware stacks in the system abound. Some of the stacks are used to store return addresses, operator precedence, expression evaluation values, and subprogram nesting levels. Since expressions are evaluated as they are interpreted, the actions taken by the execution unit are the same as would be taken by a 0-operand stack architecture.PDP-11

Category: MS1 (When Used In Stack Mode) Implementation: Family of Mini & Microcomputers (Also, Later THE VAX FAMILY) Applications: General Purpose Mini-Computer Manufacturers and Time: Digital Equipment, 1970 Reference: Bell et al. (1970 )

The DEC PDP-11 was an early general-purpose computer to integrate stack usage into a general-purpose register machine. While the machine is clearly register-oriented, it includes as a subset of its capabilities those of a one-address stack machine. By using register-indirect addressing with auto-postincrement and auto-predecrement, a general-purpose register can be used as a stack pointer for an evaluation stack. The PDP-11 also has a stack pointer for use with interrupts, traps, and subroutine calls. Later, the VAX line of computers introduced hardware support for single-stack dynamic frame allocation for block-oriented languages. of course the PDP-11 is really a general-purpose register machine, but Bell's article describes how it can be used in AN MS1 Stack Mode.

POMP Pascal

Category: SS1 Implementation: Bit-SLICED Processor (AMD 290x) Applications: Research Into Emulating Intermediate Forms for Block Structured Languages And Time: Stanford University, 1980 Reference: HARRIS (1980)

The Pascal Oriented MicroProcessor (POMP) project used a bit-sliced processor to execute stack code. Stack code was chosen to reduce program size from 3 to 8 times smaller than traditional compiler outputs. In fact, the POMP code was claimed to be only 50 % larger than Flynn's ideal DEL encoding, but is much easier to decode since it was encoded in byte-wide blocks. The stack machine could accesses up to 8 local variables for operations, making it a 1-operand machine.PSP

Category: ML2 Implementation: Architectural Proposal App: General Purpose Computing Manufacturers and Time: University of Illinois, 1985 Reference: Eickemeyer & Patel (1985)

The Parallel Stack Processor (PSP) architecture is an attempt to preserve the function of a normal general purpose register machine yet reap the benefits of having hardware stacks for saving registers on a subroutine call. To accomplish this, the machine hides a stack behind every register in the machine. Whenever a subroutine call is encountered, each register is pushed onto its own stack simultaneously, performing a single-cycle multiple register save. Strictly speaking, this is more of a register machine that has hardware to save registers than a stack processor Architecture, But The Idea Is Intriguing for Other Stack Applications.

Pyramid 90x

Category: SL2 Implementation: 32-Bit MiniComputer App: General Purpose Risc Processor Manufacturers And Time: Pyramid Technology, 1983 Reference: Ragan-Kelley & Clark (1983)

The Pyramid 90x was one of the first commercial processors to have many RISC attributes. The 90x uses a register stack that is organized as 16 non-overlapped windows of 32 registers plus 16 global registers for a total of 528 registers. The registers are spilled to Memory if Subroutine Nesting is more Than 15 Levels deep.

QForth

Categories: ML0 achieve: Architectural study Application: Direct support for Forth programming language Manufacturer and time: Queens College of CUNY, 1984 Reference: Vickery (1984) The QFORTH architecture was built for multitasking single-user execution of the Forth programming language. The internal architecture included two source busses (which could be read the top two elements of the data stack) and a single destination bus to write the top-of-stack back. The stack management unit internally buffered the top stack elements in high speed registers , And Allowed for a Single Stack Memory to Be Partitioned INTO Several Simultaneously Used Stacks.

Reduction Language Machine

Category: ML0 Implementation: Laboratory Model App: Execution of Reduction Language Program Manufacturers and Time: GMD Bonn, 1979 Reference: Kluge & Schlutter (1980)

Reduction languages use structures of the form: apply function to argument These structures are well represented by subtrees with a function node having children that are its operands Since the execution of a program involves evaluating these tree structures, three major stacks are central to the.. operation of the machine. One stack acts as a program source, another as a program sink, and the third as a temporary evaluation stack area. An interesting feature of the machine is that there is no program memory, and the operation of the machine does NOT Involve Any Addresses as Suf. All Programs Are Shuffled Between The Source and Sink Stack Memories.

Rekursiv

Category: ML0 Implementation: 1.5 Micron CMOS Using 3 Gate Arrays Applications: Object Oriented Programming Manufacturers And Time: Linn Products, 1984-88 Reference: Pountain (1988)

Rekursiv is designed for fast execution of object-oriented programs. It supports a very high level instruction set that may be extended using a large amount of off-chip microcode, and has extensive support for memory management designed into the system. An evaluation stack is Used for Expression Evaluation, While A Control Stack IS Used for Microcode Procedure Return Address Storage.Risc i

Category: SL2 Implementation: 32-Bit Microprocessor App: RISC Processor for C and Other High Level Languages Manufacturers and Time: University of California, Berkeley, 1981 Reference: Patterson & Piepho (1982), Patterson & Sequin (1982), Sequin & Patterson (1982), Tamir & Sequin (1983)

The RISC I was the first highly publicized RISC computer. It owes a substantial amount of its performance to the use of register windows. The "gold" RISC I chip uses an overlapped register window scheme with 78 registers. At any given time, there are 32 addressable registers: 10 global registers, 6 registers shared with the calling subroutine, 10 private registers, and 6 registers used to pass parameters to subroutines at the next deeper nesting level The registers are accessed using normal 2-operand register-to-register. instructions. The RISC I allows accessing the contents of a register as a memory location by automatically mapping the memory access into the register space. This solves the up-level addressing problem that can occur in languages like Pascal.

Rockwell MicroControllers

Category: MS0 Implementation: Forth-in-ROM on 6502 and 68000 MicroControllers. Applications: Embedded Controllers That Run Forth Programs. Manufacturers and Time: Rockwell International, 1983 Reference: Dumse (1984)

While not strictly speaking hardware-supported stack machines, microcontrollers that have Forth burned into their ROM's are an interesting member of the stack-based computer family. The R65F11, based on the 6502 processor, and the F68K processor, based on the 68200 microcontroller of the 68000 processor family, are general purpose microcontrollers that come with preprogrammed Forth primitives. These chips in effect emulate a two-stack Forth engine, using variables and program memory to provide the emulation. Other dedicated Forth microcontrollers have been made since (including the Zilog Super8 Chip), But Rockwell Was The First To Do It.rtx 2000

Categories: ML0 achieved: 16-Bit, 2 micron standard cell CMOS microprocessor applications: Semicustom design for application-specific designs Optimized for Forth programming language and manufacturers time:. Harris Semiconductor, 1987-89 Reference: Danile & Malinowski (1987) Harris Semiconductor (1988a), Harris Semiconductor (1988B), Jones et al. (1987)

The RTX (Real Time Express) is a macrocell in the Harris standard cell library. This allows the processor to be built as a stand-alone microprocessor, or as an integrated microprocessor with I / O devices, hardware multiplier and stack memory on-chip . The instruction set directly corresponds to FORTH programming language primitives. The design uses an unencoded instruction format that allows multiple operations to be compacted into each instruction. As with many Forth processors, the RTX 2000 supports single-cycle subroutine calls.

RTX 32P

Category: ML0 Implementation: 32 BITS, 2.5 Micron CMOS Applications: Stack-Based Processing for Real Time Control and Expert Systems. Manufacturer and Time: Harris Semiconductor and Wisc Technologies, 1987-89 Reference: Koopman (1987C), Koopman (1987D) Koopman (1989)

The Harris RTX 32P is a prototype 32-bit stack processor chip set. A unique feature of the RTX 32P is the combination of an opcode with a next-address field in every instruction. This allows zero-cost subroutine calls, returns, and unconditional Branches by overlapping the next address computation. The system can execut one opcode and a.comroutine call each memory cycle.rufor

Category: ML0 Implementation: 16-Bit AM2901 Bit-Slice Microcoded Processor App: Research Processor for Forth Language Manufacturers And Time: Wright State University, 1984 Reference: Grewe & Dixon (1984)

The RUFOR system is a conventional bit-sliced approach to building a machine optimized for the Forth programming language. There are two hardware stacks, one for data and one for return addresses. The top entry of each stack is held in one of the 2901 internal Registers, SO That Only A Single Input Bus to the alu and a single output bus back to the stacks is request.

SF1

Category: ML2 Implementation: 3-Chip, 32-Bit Microprocessor Using 3 Micron CMOS App: High Level Language Support for Real Time Control Manufacturers and Time: Wright State University, 1987-88 Reference: Dixon (1987), Longway (1988 )

The SF1 (which stands for Stack Frame computer number 1) is an experimental multi-stack processor designed to efficiently execute high level languages, including both Forth and C. The current implementation has five stacks, any two of which may be selected as the source And destination for an instruction. The sf1 allows Arbitrary Access To Its Stack Elements by Using A 13 Bit Address Relative To The Top Stack Element in The Instruction Format.

Soar

Category: SL2 Implementation: Microprocessor App: Support for SmallTalk-80 Language Manufacturers and Time: University of California, Berkeley, 1984 Reference: Bush et al. (1987)

The Smalltalk On A RISC project (SOAR) modified the Berkeley RISC II architecture to adapt it to Smalltalk-80. Since Smalltalk-80 is a stack-oriented bytecode language, this is an exercise in mapping stack code onto a register-oriented RISC, which in turn has its registers arranged in an overlapped window register stack. The window size of the register stack was only 16 registers, half that of RISC II, since Smalltalk methods tend to be smaller than procedures in traditional programming languages.SOCRATES

Category: ML2 Implementation: Conceptual Design App: Use of Bubble Memories for Main Program Storage Manufacturers And Time: University of Massachusetts / Amherst, 1975 Reference: Foster (1975)

SOCRATES (Stack-Oriented Computer for Research and Teaching) was a design that proposed using magnetic bubble memories as its main storage. At the time of the design, bubble memories were projected to cost 100 times less per bit than other memories. The only problem was that they could only be accessed sequentially. SOCRATES took advantage of this situation by proposing 64 addressable registers of 32 bits, with each register being the top element of a 32K word bubble memory configured as a LIFO stack.

Soviet Machine

Category: ML1 Implementation: Conceptual Design Application: Execution Of Block-Structured Languages And Time: Academy of Sciences of The USSR, 1968 Reference: Myamlin & Smirnov (1969)

. This paper presented a design for a stack computer for executing block-structured languages The design had two stacks: one for holding arithmetic operations and one for holding operands While not a directly interpreting machine, it was apparently intended to have source programs maintain an. Infix format with infix to postfix conversion done on-the-fly. stacks could be addressed as part of programmed

Category: MS0 Implementation: Discrete TTL Prototype Machine App: Research Manufacturers and Time: Iowa State University, 1971 Reference: Ditzel & Kwinn (1980), Hutchison & Ethington (1973)

The SYMBOL project constructed an operational computer using no software. The editor, debugger, and compiler were all implemented using random logic circuits. User programs were entered in source code, then compiled and executed using hardwired control circuits. The compilation unit transformed code into a Stack-based Intermediate Form Before Execution. Several Other STACKS WERE Used Elsewhere As Required by The Compiler.

Transputer

Category: SS0 Implementation: Family of 16- and 32-Bit Microprocessors Applications: Parallel Processing Manufacturers And Time: Inmos Limited, 1983 Reference: Whitby-Strevens (1985)

The Transputer is a single-chip microprocessor system designed for parallel processing. Since replicating a complete processor with memory and peripherals is very expensive, the Transputer attempts to squeeze an entire functional system onto a single chip to hold costs down for systems with large numbers of processors. This constraint places program memory space at a premium, so a stack-based instruction set was selected to reduce program size. The Transputer uses 3 registers to form an expression evaluation stack.

Categories: ML0 achieve: Simulated design applications: Research Manufacturers and time: Carnegie Mellon University, 1980 Reference: Harbison (1982) The Tree Machine (TM) architecture was an attempt to make compilers simpler by performing common compiler optimizations using a value cache . This cache would do common subexpression elimination and invariant code motion in hardware by caching results to recently computed expressions. A stack-based architecture was chosen because this allowed better operation with the value caching hardware and eliminated the compiler complexity associated with register allocation. The TM Used TWO Stacks: a Data Stack for Expression Evaluation, And A Control Stack for Dependency Information and return address storage.

Tree Machine

Category: MS0 Implementation: Conceptual Design App: Executing Block-Structured Languages, Manufacturers and Time: Massey University, New Zealand, 1971 Reference: Doran (1972)

Doran's tree machine recognized that good programs have an inherent tree structure, and was tailored to execute these well-structured programs The machine had three stacks resident in program memory:. A control stack for return addresses, a value stack to store intermediate results for non -tree-leaf nodes, and a data stack for scratch storage allocation. An interesting feature of the machine was that conditional branches are not required. All conditional execution were accomplished with a conditional procedure return to the parent program node.

Vaughan & Smith's Machine

Category: ML0 Implementation: Conceptual Design App: Support for Forth Programming Language Reference: Vaughan & Smith (1984)

This paper discusses the design of a Forth-based computer. The architecture was chosen because Forth is good at representing the tree nature of structured programs. Forth's small subroutine size allows good code compaction through subroutine reuse. The proposed design featured two independent hardware stacks. The Return Stack Had One Top-of-Stack Register, While The Data Stack Had Two Registers.wd9000 P-ENGINE

Category: SS0 Implementation: 5-CHIP LSI SET Application: Direct Execution Of PASCAL P-CDE Manufacturers and Time: Western Digital, 1979 Reference: O'Neill (1979)

The Western Digital Pascal micro-engine (the WD9000 chip set) was built to execute Pascal P-code. Since P-code presumes the existence of a single data stack, the WD9000 supported a single program memory resident stack for expression evaluation and parameter passing .

WISC CPU / 16

Category: ML0 Implementation: Discrete LS-TTL, 16-Bit Data Paths Applications: Stack-Based Processing Manufacturers and Time: Wisc Technologies, 1986 Reference: Haydon & Koopman (1986), Koopman (1986), Koopman (1987B), Koopman & Haydon (1986)

The WISC CPU-16 is a user-microcodable processor with a Forth-language machine heritage. It has both a data stack and a return address stack. Additionally, it has a 2K word by 30 bit writable control store for user-defined microcode. . ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

WISC CPU / 32

Category: ML0 Implementation: 32 BITS, Discrete TTL App: Stack-Based Processing for Real Time Control and Expert Systems. Manufacturer and Time: Wisc Technologies, 1986-87 Reference: Koopman (1987C), Koopman (1987D)

The Wisc CPU / 32 Is The Discrete TTL System Upon Which The Harris RTX 32P is based on the RTX 32p and CPU / 32 Are Microcode and Instruction Set Compatible. Other Resources

Bulman (1977): A general tutorial on stack architectures with emphasis on the B5500 and HP3000 as example architectures Also mentions the Data General Eclipse and PDP-11 as examples of conventional machines incorporating some stack concepts..

Carlson (1975): a Good Survey of Various High Level Language Computer Architectures, Many Of Which Are Stack-Oriented.

Doran (1975): Reviews The Use of Stacks for Expression Evaluation, Data Tree Traversal, And Subroutine Return Address Saving, Concentrating On The Burroghs Architectures.

McKeeman, W. (1975): a Comprehensive Tutorial on The Operation of Stack-based Computers and The Role of Stacks in General Purpose computing.

Myers (1982): While not specifically about stack machines, this text describes many of the architectural innovations that are pertinent to discussions on stack machines Of particular interest is the discussion of semantic gap..

SiewiOREK, Bell & Newell (1982): This Computer Architecture Text Contains MANTER CHAPTERS ON Stack Machines.

Yamamoto, M. (1981): Gives a List of High Level Language Machines Developed in Japan, Many Of Which Are Stack-Oriented.

Appendix B Forth Original Summary

The Forth language is based on an extensible, interactive compiler that generates code for virtual stack computers. This virtual machine has two stacks. Data stacks are used to pass the expression computing and subroutine parameters. The return stack is used to save the subroutine return address and loop control variables. The FortH source code directly reflects the underlying of the stack computer, which uses the reverse Poland representation (RPN) to perform all operations.

The Forth program is based in subprofits. Each subroutine is called "word" in Forth terminology. A program is called another word and forms the tree structure of the program in this way. At the low-level, the leaves of the tree are the original word of Forth, which handles the stack and completes the calculation.

Below is the Forth original language that is discussed in this book. Most primitives can be used in any language to write programs (for example, adding two elements of the stack or swaping the order of two stack top elements). The use of Forth terminology in the discussion is to ensure the consistency of standard words with existing stack computers. Each Forth word has a "stack description" after the same line, which gives the input and output parameters of the descriptive word in the FORTH data stack. In the "-" value indicates the input parameters, "_" is the output parameter. In the parameter list. The top of the stack is at the right side. N1, N2, N3, etc. represent single precision integers. D1, D2, etc. represent a double precision integer, which occupies the position of the two data stacks. Addr is address and can be considered a pointer. Flag is an integer, if 0 is false, if not 0 is true.

0 - 0

Put the integer 0 on the stack

If N1 is negative, return FLAG flag is true.

0 = N1 - FLAG

If N1 is 0, return the FLAG flag as true

0 N1 - FLAG

If N1 is greater than or equal to 0, it returns Flag to true.

0branch N1 -

If N1 is false (value 0), the branch has the address in the next program unit, otherwise continue

1 n1 - n2

Plus N1 plus 1, return N2

1- N1 - N2

Dimension from N1, return N2

2 N1 - N2

N1 plus 2, return N2

2 * n1 - n2

Take the N1 by 2, return N2

2 / N1 - N2

N1 divided by 2, return N2

4 N1 - N2

N1 plus 4, return N2

If N1 is less than N2, return FLAG is true.

<> N1 n2 - flag

If N1 is not equal to N2, return FLAG is true

= N1 N2 - FLAG

If N1 is equal to N2, it returns Flag as true.

R n1 -

Press N1 into the backrest

> N1 N2 - FLAG

If N1 is greater than or equal to N2, return FLAG is true.

! N1 addr -

Put the N1 storage program memory AddR location

N1 N2 - N3

Add N1 and N2, give and N3.

! N1 AddR -

Add the value at the N1 with the pointer AddR location, the result is stored in that position

- N1 n2 - n3

Split N2 from N1 to give a discharging N3

: -

Start a definition of a subroutine. Every time this subroutine is quoted by other definitions, the primitive [Call] is compiled

; -

The execution subroutine returns and ends a definition of a subroutine. The primitive [exit] was compiled.

? DUP N1 - N1 N1 (if n1 non-zero)

N1 - n1 (if n1 is zero)

If N1 is not equal to 0, the top of the stack is copied, otherwise the data stack does not change.

@ Addr - n1 <

Read out the value at the program memory Addr, return N1

ABS N1 - N2

Take N1 absolute value return result n2

And n1 n2 - n3

Perform the bit of N1 and N2 and give the result N3

Branch -

Execute an unconditional branch

D! D1 addr -

Store a double-precision value in the location where the program memory addr begins.

D D1 D2 - D3 double precision D1 and D2 summary and return D3

D @ addr - d1

Remove the double precision value D1 from the address starting from the program memory addr

DDROP D1 -

Remove double precision integer D1

DDUP D1 - D1 D1

Copy the double precision D1 on the stack

DNEGATE D1 - D2

Return D2, it is D1 binary complement

Drop N1 -

Remove N1 from the stack

DSWAP D1 D2 - D2 D1

Two double precision numbers on the swap stack

DUP N1 - N1 N1

Copy the N1 to the top of the stack

I - n1

Returns the index of the current activity loop

I '- N1

Returns the limit value of the current activity cycle

J - N1

Returns the index value of the outer loop of the nesting loop structure

Leave -

The loop is retired by setting the loop counter that returns the top of the stack or equal to the loop.

LIT - N1

Take the compiled embedded value as an integer and press it into the return stack as N1

Negate n1 - n2

Return n2, it is the binary complement of N1

NOP -

To do anything

NOT flag1 - flag2

Reverse sign

OR N1 N2 - N3

For N1 and N2 execution positions or give results N3

OVER N1 N2 - N1 N2 N1

Copy the second element of the stack N1 as a new stack

Pick ... n1 - ... n2

Put the Nth element on the data stack as the top of the stack. In Forth-83, 0 Pick is equal to DUP,

1 Pick is equal to OVER.

R> - N1

Pop up the current return top element to press it into the data stack as N1

R @ - N1

Copy Back to Top N1 to Data Stack

Roll ... n1 - ... n2

Remove the Nth element of the data stack to the top of the stack, fill the space on the stack.

In FortH-83, 1 roll is equal to SWAP, 2 roll is equal to ROT

Rot N1 N2 N3 - N2 N3 N1

Pull out the third element on the stack as the top of the stack

S-D N1 - D2

The symbol extension of N1 makes it a position of two words to become a double precision data D2.

SWAP N1 N2 - N2 N1

Exchange two stack top elements

If the no sign number N1 is less than N2, Flag returns true

U U1 U2 - FLAG

If the symbolic data U1 is greater than U2, Flag returns true

U * N1 N2 - D3

Perform an unsigned multiplication for N1 and N2 to get a symbolic double precision result D3

U / MOD D1 N2 - N3 N4

Execute the unsigned integer division of the double precision data D1 and the single-precision N2, get the commercial N4 and the remainder N3

XOR N1 N2 - N3

Perform a bit of positioning of N1 and N2 or resulting from the result n3

Appendix C Complex Frequency Statistics The following is a dynamic command frequency that is not discussed in Chapter 6. Names Frac Life Math Compile AVE

1.89% 0.00% 0.71% 0.98% 0.90%

* 0.00% 0.00% 0.02% 0.05% 0.02%

3.41% 10.45% 0.60% 2.26% 4.18%

! 0.00% 0.00% 0.1% 0.83% 0.24%

- 0.34% 0.00% 0.00% 0.02% 0.09% - 0.97% 1.24% 0.08% 1.94% 1.06%

/ 0.07% 0.00% 0.00% 0.05% 0.03%

0 <1.84% 0.00% 0.66% 0.05% 0.64%

0 = 0.00% 0.00% 0.77% 0.00% 0.19%

0> 0.00% 0.00% 0.09% 0.02% 0.03%

0Branch 3.39% 6.38% 3.23% 6.11% 4.78%

1 1.72% 0.08% 0.01% 1.36% 0.79%

1- 0.41% 0.00% 0.54% 0.01% 0.24%

2 * 2.11% 2.05% 0.02% 0.64% 1.21%

2 0.49% 0.00% 0.19% 0.66% 0.34%

2- 0.07% 0.00% 0.00% 1.02% 0.27%

2 / 0.92% 0.00% 0.00% 0.01% 0.23%

<0.11% 0.08% 0.01% 1.08% 0.32%

< LOOP> 0.00% 0.00% 0.00% 0.00% 0.00%

0.20% 0.00% 0.01% 0.18% 0.10%

<< CMOVE> 0.00% 0.00% 0.00% 0.00% 0.00%

0.00% 0.00% 0.00% 0.56% 0.14%

0.23% 0.00% 0.09% 0.02% 0.09%

0.00% 0.00% 0.00% 0.00% 0.00%

0.00% 0.00% 0.00% 0.84% 0.21%

1.44% 3.32% 1.08% 0.01% 1.46%

= 0.33% 4.48% 0.01% 1.87% 1.67%

> 0.62% 0.08% 0.06% 1.19% 0.49%

> R 2.05% 0.00% 11.28% 2.16% 3.87%

? DUP 0.00% 0.00% 0.00% 1.11% 0.28%

• Stack 0.00% 0.00% 0.00% 0.49% 0.12%

@ 7.49% 2.05% 0.96% 11.09% 5.40% ABS 0.51% 0.00% 0.01% 0.01% 0.13%

ADC 0.00% 0.00% 2.53% 0.00% 0.63%

AND 0.17% 3.12% 3.14% 0.04% 1.61%

ASR 0.00% 0.00% 0.88% 0.00% 0.22%

BRANCH 1.61% 1.57% 0.72% 2.26% 1.54%

C! 0.07% 0.36% 0.03% 0.87% 0.33%

C @ 0.00% 7.52% 0.01% 0.36% 1.97%

Call 11.16% 12.73% 12.59% 12.36% 12.21%

CONSTANT 3.92% 3.50% 2.78% 4.50% 3.68%

CONVERT 0.00% 0.00% 0.00% 0.04% 0.01%

D! 0.21% 0.00% 0.59% 0.00% 0.20%

D 1.15% 0.00% 0.54% 0.00% 0.42%

D - 0.07% 0.00% 0.03% 0.02% 0.03%

D <0.00% 0.00% 0.00% 0.00% 0.00%

D @ 0.21% 0.00% 0.62% 0.00% 0.21%

DDROP 2.08% 0.52% 0.1% 0.35% 0.77%

DDUP 1.86% 0.00% 1.16% 0.84% 0.97%

DIGIT 0.00% 0.00% 0.00% 0.00% 0.00%

DNGATE 0.00% 0.00% 0.1% 0.00% 0.03%

DOVER 0.00% 0.00% 0.91% 0.00% 0.23%

DROP 3.08% 0.16% 0.68% 1.04% 1.24%

DROT 0.00% 0.00% 0.17% 0.00% 0.04%

DSWAP 0.00% 0.00% 0.92% 0.00% 0.23%

DUP 4.08% 0.45% 1.88% 5.78% 3.05%

ENCLOSE 0.00% 0.00% 0.00% 0.58% 0.15%

EXECUTE 0.14% 0.00% 0.02% 2.45% 0.65%

EXIT 11.07% 12.72% 12.55% 10.60% 11.74%

I 0.58% 6.66% 0.01% 0.23% 1.87% I '0.00% 0.00% 0.00% 0.00% 0.00%

J 0.16% 0.08% 0.00% 0.00% 0.06%

LEAVE 0.00% 0.00% 0.00% 0.00% 0.00%

Lit 3.94% 5.22% 4.92% 4.09% 4.54%

LSL 0.00% 0.00% 0.04% 0.00% 0.01%

LSR 0.00% 0.00% 0.96% 0.00% 0.24%

MAX 0.00% 0.00% 0.00% 0.01% 0.00%

MIN 0.00% 0.00% 0.05% 0.00% 0.01%

NEGATE 0.52% 0.00% 0.00% 0.00% 0.13%

NOT 0.00% 0.00% 0.69% 0.25% 0.24%

OR 0.00% 0.08% 1.41% 0.64% 0.53%

Over 1.23% 1.75% 1.24% 0.89% 1.28%

PICK 1.92% 0.00% 0.53% 0.09% 0.64%

R> 2.05% 0.00% 11.28% 2.23% 3.89%

R @ 0.14% 0.00% 0.02% 0.71% 0.22%

RLC 0.00% 0.00% 0.01% 0.00% 0.00%

ROLL 0.21% 0.00% 0.81% 0.00% 0.26%

Rot 4.05% 0.00% 4.61% 0.48% 2.29%

RP! 0.00% 0.00% 0.00% 0.00% 0.00%

RP @ 0.00% 0.00% 0.00% 0.00% 0.00%

RRC 0.00% 0.00% 0.00% 0.00% 0.00%

S-> D 0.07% 0.00% 0.00% 0.01% 0.02%

SP @ 0.00% 0.00% 0.00% 0.05% 0.01%

SWAP 4.43% 2.99% 7.00% 1.17% 3.90%

TOGGLE 0.00% 0.06% 0.00% 0.08% 0.04%

TRAVERSE 0.00% 0.00% 0.00% 0.05% 0.01%

U * 0.62% 0.00% 0.34% 0.01% 0.24%

U / MOD 0.60% 0.00% 0.01% 0.05% 0.17% u <0.00% 0.00% 0.00% 0.00% 0.00%

User 0.07% 0.00% 0.06% 8.59% 2.18%

VARIABLE 7.63% 10.30% 2.26% 1.65% 5.46%

XOR 0.29% 0.00% 0.24% 0.01% 0.14%

Instructions: 2051600 1296143 6133519 447050

Below is the discussed static instruction frequency Names Frac Life Math Bench Ave in Chapter 6

! 3.28% 2.12% 0.90% 2.99% 2.32%

* 0.00% 0.21% 0.00% 0.43% 0.16%

3.28% 2.97% 0.76% 4.61% 2.90%

! 0.00% 0.00% 0.18% 0.17% 0.09%

- 0.14% 0.00% 0.00% 0.09% 0.06%

- 2.05% 1.91% 0.58% 1.54% 1.52%

/ 0.14% 0.00% 0.00% 0.09% 0.06%

0 <0.96% 0.00% 0.65% 0.68% 0.57%

0 = 0.00% 0.00% 0.1% 0.26% 0.10%

0> 0.00% 0.00% 0.47% 0.00% 0.12%

0Branch 3.01% 2.55% 3.67% 3.16% 3.10%

1 0.41% 0.64% 0.72% 0.51% 0.57%

1- 1.09% 0.42% 0.54% 1.28% 0.83%

2 * 1.92% 2.12% 0.14% 1.79% 1.49%

2 0.27% 0.00% 0.1% 0.34% 0.18%

2- 0.27% 0.00% 0.00% 0.34% 0.15%

2 / 0.96% 0.00% 0.00% 0.77% 0.43%

<0.14% 0.42% 0.47% 0.34% 0.34%

< LOOP> 0.27% 0.21% 0.04% 0.26% 0.20%

0.27% 0.00% 0.00% 0.17% 0.11%

<< CMOVE> 0.00% 0.00% 0.00% 0.00% 0.00%

0.00% 0.00% 0.00% 0.00% 0.00%

1.92% 2.34% 0.61% 1.96% 1.71%

0.00% 0.00% 0.00% 0.00% 0.00%

1.37% 2.12% 0.58% 1.54% 1.40%

= 0.14% 2.76% 0.29% 0.26% 0.86%

> 1.23% 0.21% 0.32% 1.11% 0.72%

> R 0.55% 0.00% 4.1% 0.77% 1.36%

? DUP 0.00% 0.00% 0.04% 0.00% 0.01%

• Stack 0.00% 0.00% 0.07% 0.09% 0.04%

@ 10.81% 1.27% 1.40% 8.88% 5.59%

ABS 0.27% 0.00% 0.18% 0.17% 0.16%

ADC 0.00% 0.00% 0.07% 0.00% 0.02%

And 0.27% 1.06% 0.54% 0.43% 0.58%

ASR 0.00% 0.00% 0.1% 0.00% 0.03%

BRANCH 1.92% 0.85% 2.09% 2.05% 1.73%

C! 0.00% 1.49% 0.04% 0.68% 0.55%

C @ 0.00% 3.40% 0.61% 0.34% 1.09%

Call 16.82% 31.44% 37.61% 17.62% 25.87%

CONSTANT 1.23% 1.91% 0.07% 1.62% 1.21%

CONVERT 0.00% 0.00% 0.00% 0.00% 0.00%

D! 0.41% 0.00% 0.18% 0.17% 0.19%

D 0.55% 0.21% 0.25% 0.51% 0.38%

D - 0.00% 0.00% 0.14% 0.00% 0.04%

D <0.00% 0.00% 0.14% 0.00% 0.04%

D @ 0.27% 0.00% 0.32% 0.17% 0.19%

DDROP 2.60% 0.42% 0.79% 1.88% 1.42%

DDUP 1.23% 0.21% 0.61% 1.71% 0.94%

DIGIT 0.00% 0.00% 0.1% 0.00% 0.03% DNGATE 0.00% 0.00% 0.18% 0.00% 0.05%

DOVER 0.00% 0.00% 0.32% 0.00% 0.08%

DROP 2.60% 0.85% 1.69% 2.31% 1.86%

DROT 0.00% 0.00% 0.29% 0.00% 0.07%

DSWAP 0.00% 0.00% 1.22% 0.00% 0.31%

DUP 4.38% 1.70% 2.84% 4.18% 3.28%

ENCLOSE 0.00% 0.00% 0.00% 0.00% 0.00%

EXECUTE 0.00% 0.00% 0.07% 0.00% 0.02%

EXIT 5.75% 7.22% 9.90% 7.00% 7.47%

I 1.37% 5.10% 0.1% 1.62% 2.05%

I '0.00% 0.00% 0.07% 0.00% 0.02%

J 0.27% 1.91% 0.07% 0.26% 0.63%

LEAVE 0.00% 0.00% 0.00% 0.09% 0.02%

Lit 11.35% 7.22% 11.02% 8.03% 9.41%

LSL 0.00% 0.00% 0.04% 0.00% 0.01%

LSR 0.00% 0.00% 0.07% 0.00% 0.02%

MAX 0.00% 0.00% 0.1% 0.09% 0.05%

MIN 0.00% 0.00% 0.04% 0.17% 0.05%

NEGATE 0.14% 0.00% 0.04% 0.26% 0.1%

NOT 0.00% 0.00% 0.47% 0.26% 0.18%

OR 0.00% 0.21% 0.61% 0.00% 0.21%

Over 2.05% 5.10% 0.76% 2.05% 2.49%

PICK 6.29% 0.00% 1.04% 4.53% 2.97%

R> 0.55% 0.00% 4.68% 0.77% 1.50%

R @ 0.00% 0.00% 0.29% 0.17% 0.12%

RLC 0.00% 0.00% 0.07% 0.00% 0.02%

ROLL 0.14% 0.00% 0.32% 0.09% 0.14%

Rot 1.50% 0.00% 0.58% 1.37% 0.86%

RP! 0.00% 0.00% 0.00% 0.00% 0.00% RP @ 0.00% 0.00% 0.00% 0.00% 0.00%

RRC 0.00% 0.00% 0.07% 0.00% 0.02%

S-> D 0.00% 0.00% 0.25% 0.00% 0.06%

SP @ 0.00% 0.00% 0.00% 0.00% 0.00%

SWAP 1.78% 5.10% 1.19% 3.16% 2.81%

Toggle 0.00% 0.42% 0.00% 0.00% 0.1%

TRAVERSE 0.00% 0.00% 0.00% 0.00% 0.00%

U * 0.41% 0.00% 0.14% 0.26% 0.20%

U / MOD 0.14% 0.00% 0.00% 0.09% 0.06%

U <0.00% 0.00% 0.04% 0.00% 0.01%

USER 0.00% 0.00% 0.00% 0.00% 0.00%

VARIABLE 1.09% 1.91% 0.29% 1.37% 1.17%

XOR 0.14% 0.00% 0.50% 0.09% 0.18%

Instructions: 731 471 2777 1171 Reference Allmark, R. & Lucking, J. (1962) Design of an arithmetic unit incorporating a nesting store In: Information Processing 1962: Proc of the IFIP Cong 62, 27 August - 1 September... 1962, Munich, North-Holland, Amsterdam, 1963, PP. 694-698 Anderson, J. (1961) A Computer For Direct Execution of Algorithmic Languages. In: Proc. Of the EJCC, 12-14 December 1961, Washington DC, Vol. 20, Macillan, New York, 1961, PP. 184-193 Atkinson, R. & McCreight, E. (1987) The Dragon Processor. In: Proc. Of The Second Int. Conf. On architectural support for Programming Languages and Operating Systems (ASPLOS II), Palo Alto CA, 5-8 October 1987, PP. 65-69 Backus, J. (1978) CAN Programming BELIBERATED A VON NEUMANN STYLE? A Functional Style and ITS Algebra Of Programs. Comm. Of The ACM, August 1978, 21 (8) 613-641 Bage, G., & Thorelli, L. (1980) Partial Evaluation of a high-level architecture. in: proc. of the int. Workshop On High-Level LanguageComputer Architecture, Fort Lauderdale FL, 26-28 May 1980, pp. 44-51 Ballard, B. (1984) Forth direct execution processors in the Hopkins Ultraviolet Telescope. J. Forth Application and Research 2 (1) 33-47 Bartlett, J. (1973) The HP 3000 Computer System. IN: ACM-IEEE SYMP. ON High-Level-Language Computer Architecture, College Park MD, 7-8 November 1973, PP. 61-69 Belinfante, J. (1987) S / K / I: Combinators in Forth. J. Forth Application and Research, 4 (4) 555-580 Bell, G., Cady, R., McFarland, H., Delaig, B., O'Laughlin, J., Noonan, R., & WULF, W. (1970) A New Architecture for Mini-Computers: The Dec. SJCC, 1970, PP. 657-675. Reprinted in: Siewiorek, D., Bell, CG, &

NEWELL, A. (1982) Computer Structure: Principles and Examples, McGraw-Hill, 1982, PP. 649-661 Bell, J. (1973) Threaded Code. Comm. Of the ACM, June 1973, 16 (6) 370- 372 Bergh, A. & Mei, K. (1979) HP300 Architecture. In: Proc. Of the Nineteenth IEEE Computer Society Int. Conf. (Fall Compcon 79), Washington DC, 4-7 September 1979, PP. 62-66 BEST, D., Kress, C., Mykris, N., Russell, J. & Smith, W. (1982) MOS / SOS MicroProcessor. IEEE Micro, August 1982, 2 (3) PP. 10-26 Blake, R (1977) EXPLORING A Stack Architecture. Computer, May 1977, 10 (5) 18-28 Bruno, J. & Lassagne, T. (1975) The Generation of Optimal Code for Stack Machines. J. Of The ACM, July 1975 , 22 (3) PP. 382-396 Bulman, D. (1977) Stack Computers: An Introduction. Computer, May 1977, 10 (5) 18-28 Burnley, P. & Harkaway, R. (1987) A High Performance VME Processor Card WHEN 32-BIT Super-Micros Can't Cut It 1987 Rochester Forth Conf., (J. Forth Application And Research 5 (1)) 101-107 BURNS, R. & SAVITT, D. (1973) Micr O-Programming, Stack Architecture Ease Minicomputer Program's Burden. Electronics, 15 February 1973, 46 (4) 95-101 Bush, W., Samples, A., UNGAR, D. & Hilfinger, P. (1987) Compiling Smalltalk-80 TO A RISC. INT. CONF. ON Architectural Support for Programming Languages and Operating Systems (ASPLOS II), Palo Alto CA, 5-8 October 1987, PP. 112-116 Carlson, C. (1963 ) The mechanization of a push-down stack. In: Afips conf. Proc., 1963 FJCC, VOL. 24, Spartan Books, Baltimore, PP. 243-250 Carlson, C. (1975) A Survey of High-Level Language Computer Architecture. in: CHU, Y. (Ed.) High-Level Language Computer Architecture, Academic Press, New York, 1975 PP. 31-62 Carr, H. &

Kessler, R. (1987) An Emulator for Utah Common Lisp's Abstract Virtual Machine. In: Proc. Of the 1987 Rochester Forth conf., (J. Forth Application and research 5 (1)) 113-116 Castan, M. & Organick , E. (1982) Micro-3L: An HLL-RISC Processor for Execution of FP-Language Programs. In: conf. Proc .: The 9th Annual Symp. On Computer Architecture, 26-29 April 1982, Austin TX, PP. 239-247 Chen, Y., Chen, K. & Huang, K. (1980) Direct-execution high-level language FORTRAN computer In:. Proc of the Int Workshop on High-Level Language Computer Architecture, Fort Lauderdale FL.. ., 26-28 May 1980, pp 9-16 Cook, R. & Donde, N. (1982) An experiment to improve operand addressing In:... Proc of the Symp on Architectural Support for Programming Languages and Operating Systems (ASPLOS I), Palo Alto Ca, 1-3 March 1982, PP. 87-91 Cook, R. & Lee, I. (1980) An Extensible Stack-Oriented Architecture for a high-level language machine. In: Proc. Of the Int. Workshop On High-Level Language Computer Architecture, F Ort Lauderdale Fl, 26-28 May 1980, PP. 231-237 Couch, J. & Hamm, T. (1977) Semantic Structures For Efficient Code Generation On A Stack Machine. Computer, May 1977, 10 (5) 42-48 Cragon, H. (1979) An evaluation of code space requirements and performance of various architectures. Computer Architecture News, February 1979, 7 (5) 5-21 Cragon, H. (1980) A case against high-level language computer architecture. IN: Proc. Of the int. Workshop on High-Level Language Computer Architecture, Fort Lauderdale Fl, 26-28 May 1980, Pp. 88-91 Danile, P. & Malinowski, C. (1987) Forth Processor Core for Integrated 16 -bit systems. VLSI Systems Design, June 1987, 8 (7) 98-104 Davidson, J. &

Vaughan, R. (1987) The effect of instruction set complexity on program size and memory performance In:.... Proc of the Second Int Conf on Architectural Support for Programming Languages and Operating Systems (ASPLOS II), Palo Alto CA, 5 -8 October 1987, PP. 60-64 Dewar, R. (1975) Indirect Threaded Code. Comm. Of the ACM, June 1975, 18 (6) 330-331 Ditzel, D. & Kwinn, W. (1980) Reflections on a high level language computer system, or, Parting thoughts on the SYMBOL project In:.... Proc of the Int Workshop on High-Level Language Computer Architecture, Fort Lauderdale FL, 26-28 May 1980, pp 80-87 Ditzel , D. & McLellan, H. (1982) Register allocation for free: The C machine stack cache In:... Proc of the Symp on Architectural Support for Programming Languages and Operating Systems (ASPLOS I), Palo Alto CA, 1- 3 March 1982, PP. 48-56 Ditzel, D. & McLellan, H. (1987) Branch Folding in The CRISP Microprocessor: Reducing Branch Delay To Zero. In: The 14th Annual Int. Symp. On Computer Architecture; Conf. Proc., 2-5 June 1987, Pittsburgh, PP. 2-9 Ditzel, D., McLellan, H., & Berenbaum, A. (1987A) The Hardware Architecture of The Crisp Microprocessor. In: The 14th Annual Int Conf. Proc., 2-5 June 1987, Pittsburgh, PP. 309-319 Ditzel, D., McLELLAN, H. & BERENBAUM, A. (1987B) Design TradeOffs To Support the c Programming Language in the CRISP microprocessor in:. Proc of the Second Int Conf on Architectural Support for Programming Languages and Operating Systems (ASPLOS II), Palo Alto CA, 5-8 October 1987, pp 158-163 Ditzel, D. &....

Patterson, D. (1980) Retrospective on High-Level Language Computer Architecture. In: Proc. Of the 7th Int. Conf. On Computer Architecture, 1980, PP. 97-104, Reprinted in: Fernandez, E. & Lang, T . (. Eds) Tutorial:. Software-Oriented Computer Architecture, IEEE Computer Society Press, Washington DC, 1986, pp 44-51 Dixon, R. (1987) A stack-frame architecture language processor In:.. Proc of the. 1987 Rochester Forth Conf. (J. Forth Application And Research 5 (1)) 11-25 Dress, W. (1986) Real-Ops: a real-time engineering application language for Writing Expert systems. In: proc. Of the 1986 Rochester Forth Conf. (J. Forth Application and Research 4 (2)) 113-124 Dress, W. (1987) High-Performance Neural NetWorks. In: Proc. Of the 1987 Rochester Forth conf., (J. Forth) Application and research 5 (1)) 137-140 Dorate, R. (1972) A Computer Organization with an expene-structure. The Australian Computer, February 1972, 4 (1) 21-30 Dorate, R. 1975) Architecture of Sta CK Machines. In: Chu, Y. (Ed.) High-Level Language Computer Architecture, Academic Press, New York, 1975, PP. 63-108 Dorate, R. (1979) Computer Architecture: A Structured Approach (APIC Studies in Data Processing No. 15), Academic Press, London. Dumse, R. (1984) The R65f11 and F68k Single-Chip Forth Computers. J. Forth Application And Research 2 (1) 11-21 Duncan, F. (1977) Stack Machine Developments: Australia, Great Britain, And Europe. Computer, May 1977, 10 (5) 50-52 Earnest, E. (1980) TWENTY YEARS OF BURROUGHS HIGH-Level Language Machines. in: The Proc. of the int. Workshop ON High-Level Language Computer Architecture, 26-28 May 1980, Fort Lauderdale Fl, PP. 64-71 Efland, G. & Mosteller, R. (1979) Stack Data Engine;

Description and Implementation, Technical Report # 3364, Computer Science Department, California Institute of Technology, Pasadena CA, December 1979. Eickemeyer, R. & Patel, J. (1985) A parallel stack processor (PSP) In:. Proc .: IEEE Int. Conf. On Computer Design: VLSI In Computers (ICCD 85), 7-10 October 1985, Port Chester NY, PP. 473-476 EVEY, R. (1963) Application of Pushdown-Store Machines. In: Afips conf. Proc., 1963 FJCC, VOL. 24, Spartan Books, Baltimore MD, 1963, PP. 215-227 Foster, C. (1975) Socrates. In: conf. Proc .; the 2nd Annual Symp. On Computer Architecture, 20- 22 January 1975, PP. 165-169 Fraeman, M., Hayes, J., Williams, R. & Zaremba, T. (1986) A 32 Bit Processor Architecture For Direct Execution Of. In: 1986 Forml conf Conf. Proc. , 28-30 November 1986, Pacific Grove Ca, PP. 197-210 Golden, J., Moore, C. & Brodie, L. (1985) Fast Processor Chip Takes ITS ITSTRUCTLY from Forth. Electronic Design, 21 March 1985, 127-138 Grewe R. & Dixon, R. (1984) A Forth Machi Ne for the S-100 System. J. Forth Application and Research 2 (1) 23-32 Haikala, I. (1982) More Design Data for Stack Architectures. in: Proc. of the acm '

82 Conf., Dallas, October 25-27 1982, PP. 30-36 Haley, A. (1962) The KDF.9 Computer System. In: Afips confuter conf., Vol. 22: 1962 Fall Joint Computer conf., Spartan Books, Washington DC, 1962, PP. 108-120 Hand, T. (1987) A FORTH IMPLEMENTATION OF LISP. IN: Proc. Of the 1987 Rochester Forth conf., (J. Forth Application and research 5 (1)) 141-144 Harbison, S. (1982) An architectural alternative to optimizing compilers In:... Proc of the Symp on Architectural Support for Programming Languages and Operating Systems (ASPLOS I), Palo Alto CA, 1-3 March 1982, pp . 57-65 Harris, N. (1980) A directly executable language suitable for a bit slice microprocessor implementation In:. Proc of the Int Workshop on High-Level Language Computer Architecture, Fort Lauderdale FL, 26-28 May 1980,.. PP. 40-43 Harris Semiconductor (1988a) RTX 2000 Instruction Set, Harris Corporation, Melbourne Fl Harris Semiconductor (1988B) RTX 2000 Real Time Express MicroController Data Sheet, Harris Corporation, Melbourne FL Hasegawa, M. & Shigei, Y. (1985) High-speed top-of-stack scheme for VLSI processor: a management algorithm and its analysis In:... The 12th Annual Int Symp on Computer Architecture, 17-19 June 1985 , Boston, PP. 48-54 Hasst, A., Lageshulte, J. & Lyon, L. (1973) Implementation of a high level language machine. Comm. Of the ACM, APRIL 1973, 16 (4) 199-212. Haydon, G. (1983) All About Forth: An Annotated Glossary, 2nd Ed., Mountain View Press, Mountain View Ca Haydon, G. &

Koopman, P. (1986) MVP Microcoded CPU / 16: History. In: Proc. Of the 1986 Rochester Forth conf., (J. Forth Application And Research 4 (2)) PP. 273-276 Hayes, J. (1986 An Interpreter and Object Code Optimizer for A 32 Bit Forth CHIP. IN: 1986 Forml Conf. Proc., 28-30 November 1986, Pacific Grove CA 211-221 Hayes, J. & Fraeman, M. (1988) Private Communications, october 1988. Hayes, J., Fraeman, M., Williams, R. & Zaremba, T. (1987) An architecture for the direct execution of the Forth programming language In:.... Proc of the Second Int Conf on Architectural Support for Programming Languages and Operating Systems (Asplos II), Palo Alto CA, 5-8 October 1987, PP. 42-49 Hayes, J. & Lee, S. (1988) The Architecture of Frisc 3: A Summary. In: Proc. Of the 1988 Rochester Forth Conf., 14-18 June 1987 PP. 81-82. Hennesy, J. (1984) VLSI Processor Architecture. IEEE TRANS COMPUTERS, DECEMBER 1984, C-33 (12) 1221-1246. Reprinted in: Fernandez, E. & Lang, T. (EDS.) TUTORALIAL: SOFTWARE-Oriented Computer Architecture, IEEE Computer Society Press, Washington DC, 1986, PP. 90-115. Herriot, R. (1973) Gloss: a high level machine. ACM-IEEE SYMP. On High-Level-Language Computer Architecture, 7-8 November 1973, College Park, MD, PP. 81-90. Hutchison, P. & Ethington, K. (1973) Program Execution in The Symbol 2R Computer. ACM-IEEE SYMP. On High-Level-Language Computer Architecture, 7-8 November 1973, College Park, MD, PP. 20-26. INTEL (1981) The IAPX 88 Book. Intel Corporation, 1981. Jennings, E. (1985) The Novix NC4000 Project. Computer Language, October 1985, 2 (10) 37-46 Johnson, M. (1987) System considitys in The Design of the Am29000. IEEE Micro, August 1987, 7 (4) 28-41 Johnsson, R. &

Wick, J. (1982) An overview of the Mesa processor architecture In:.. Proc of the Symp on Architectural Support for Programming Languages and Operating Systems (ASPLOS I), Palo Alto CA, 1-3 March 1982, pp 20.. -29 Jonak, J. (1986) Experience with a forth-like language. Sigplan Notices, February 1986, 21 (2) THE IMPLEMENTAGES, PRENTICE-HALL, New York Jones , T., Malinowski, C. & ZEPP, S. (1987) Standard-Cell CPU Toolkit Crafts Potent Processors. Electronic Design, 14 May 1987, 35 (12) 93-101 Kane, G, Hawkins, D. & Leventhal, L. (1981) 68000 Assembly Language Programming, Osborne / McGraw-Hill, Berkeley CA Kaneda, Y., Wada, K. & Maekawa, S. (1983) High-speed execution of Forth and Pascal programs on a high-level language machine in:. Microcomputers:. developments in industry, business and education, Ninth EUROMICRO Symp on microprocessing and microprogramming, 13-16 September 1983, Madrid, North-Holland, Amsterdam, 1983, pp 259-266. , Bulkhouche, B., Bullard, E., Delcamber, L. & Nemecek, S. (1982) HLL Architectures: Pitfalls and Predilections. In: conf. Proc .: THE 9th Annual Symp. On Computer Architecture , 26-29 April 1982, Austin TX, PP. 18-32 Kavipurapu, K. & Cragon, H. (1980) Quest for An 'ideal' Machine Language. In: Proc. Of the int. Workshop on High-Level Language Computer Architecture, Fort Lauderdale Fl, 26-28 May 1980, PP. 33-39 Keedy, J. (1977) An Outline of the ICL 2900 Series System Architecture. Australian Computer Journalure. Australian Computer Journal, July 1977, 9 (2) 53-62. Reprinted in: Siewiorek, D., Bell, CG, &

Newell, A., Computer Structures:. Principles and Examples, McGraw-Hill, 1982, pp 251-259 Keedy, J. (1978a) On the use of stacks in the evaluation of expressions Computer Architecture News, February 1978, 6 (. 6) 22-28 Keedy, J. (1978B) ON The Evaluation of Expressions Using Accumulators, Stacks, and Store-to-Store Instructions. Computer Architecture News, December 1978, 7 (4) 24-27 Keedy, J. (1979 ) More on the use of stacks in The Evaluation of Expressions. Computer Architecture News, 15 June 1979, 7 (8) 18-21 Kieburtz, R. (1985) The G-Machine: A FAST, Graph-Reduction Evaluator. In: Jouannaud, J. (Ed.) Functional Programming Languages and Computer Architecture, 16-19 September, Nancy, France, PP. 400-413 (GOOS, G. & Hartmanis, J. Lecture Notes in Computer Science, No. 201) Kluge , W. &

Schlutter, H. (1980) An architecture for direct execution of reduction languages In:.. Proc of the Int Workshop on High-Level Language Computer Architecture, Fort Lauderdale FL, 26-28 May 1980, pp 174-180 Kogge,.. P. (1982) An Architectural Trail to Threaded-Code Systems. Computer, March 1982, 15 (3) 22-32 Koopman, P. (1985) Forth Floating Point: MVP-Forth Series Vol. 3 (Revised), Mountain View Press, Mountain View, Ca Koopman, P. (1986) CPU / 16 Technical Reference Manual. Wisc Technologies, Inc., La Honda Ca. Koopman, P. (1987A) Microcoded Versus Hard-Wired Control. Byte, January 1987, 12 (1) 235-242 Koopman, P. (1987B) The Wisc Concept. Byte, April 1987, 12 (4) 187-194 Koopman, P. (1987C) Writable Instruction Set, Stack Oriented Computers: The Wisc Concept. In: Proc. Of the 1987 Rochester Forth Conf. (J. Forth Application And Research 5 (1)) 49-71 Koopman, P. (1987D) CPU / 32 Technical Reference Manual. Wisc Technologies, Inc., La Honda CA. Koopman , P. (1987E) BRESENHAM LINE DRAWING. FORTH Dimensions, March / April 1987, 8 (6) 12-16. Reprinted in: Dr. Dobb's Toolbook of Forth, Vol. 2, M & T Books, Redwood City CA, 1987, PP. 347-356 Koopman, P. (1987F) Fractal Landscapes, March / April 1987, 9 (1) 12-16. Reprinted in: Dr. Dobb's Toolbook of Forth, Vol. 2, M & T Books, Redwood City Ca, 1987, PP. 357-365 Koopman, P INTRODUCTION OF THE RTX 32P. J. Forth Application And Research 5 (2), Forthcoming Koopman, P., & Haydon, G. (1986) MVP Microcode CPU / 16: Architecture. In: Proc. Of the 1986 Rochester Forth conf., (J. Forth Application and Research 4 (2)) PP. 277-280 Koopman, P., &

Lee, P. (1989) A fresh look at combinator graph reduction In:. 1989 Conf on Programming Language Design and Implementation, June Lampson, B. (1982) Fast procedure calls In:... Proc of the Symp on Architectural.. Support for Programming Languages and Operating Systems (ASPLOS I), Palo Alto CA, 1-3 March 1982, pp 66-76 Lim, R. (1987) LISP machine architecture issues In:.. Digest of Papers, Thirty-second IEEE Computer SoCiety Int. Conf. (Spring Compcon 87), San Francisco, 23-27 February 1987, PP. 116-119 Lipovski, G. (1975) ON A Stack Organization for Microcomputers. In: Hartenstein, R. & Zaks, R. (Eds.) Workshop on the Micro-architecture of Computer Systems, 23-25 June 1975, Nice, North-Holland, Amsterdam, 1975, pp. 137-147 Longway, C. (1988) Instruction Sequencing and Decoding in the SF1, Master of Science Thesis, Wright State University Lor, K. & Chu, Y. (1981) Design of a Pascal Interactive Direct-Execution Computer, Technical Report TR-1088. Department of Computer Science, University Of Maryland, College Park MD, August 1981. Lutz, M. (1973) The Design and Implementation of A Small Scale Stack Processor System. in: Afips conf. Proc., Vol. 42: 1973 National Computer conf. and exposition, 4 -8 June 1973, Afips Press, Montvale NJ, PP. 545-553 Matheus, C. (1986) The Internals of Forps: a forth-based Production System. J. Forth Application And Research 4 (1) 7-27 McDaniel, G. (1982) An analysis of a Mesa instruction set using dynamic instruction frequencies In:. Proc of the Symp on Architectural Support for Programming Languages and Operating Systems (ASPLOS I), Palo Alto CA, 1-3 March 1982, pp.. 167-176 McFarling, S. &

Hennesy, J. (1986) Reducing The cost of branches. Symp. On Computer architecture; conf. Proc., 2-5 June 1986, Tokyo, PP. 396-403 McKeeman, W. (1975 ) Stack computers In:... Stone, H. (Ed) Introduction to Computer Architecture, Science Research Associates, Chicago, 1975, pp 281-317 Miller, G. (1967) Psychology of Communication: Seven Essays, Basic Books, New york Miller, J. & Vandever, W. (1973) Instruction architecture of an aerospace multiprocessor In:.. ACM-IEEE Symp on High-Level-Language Computer Architecture, 7-8 November 1973, College Park, MD, pp 52. -60. Miller, D. (1987) Stack Machines and Compiler Design. Byte, April 1987, 12 (4) 177-185 Misc (1988) Misc M17 Technical Reference Manual, Misc Inc., 1988. Moon, D. (1985 Architecture of the Symbolics 3600. in: The 12th Annual Int. Symp. On Computer Architecture, 17-19 June 1985, Boston, PP. 76-83 Moore, C. (1980) The Evolution of Forth, An Unusual Language. Byte , August 1980, 5 (8) 76-92 Morris, D. & Ibbett, R. (1979) The MU5 Computer System, Springer-Verlag, New York. Myamlin, A. & Smirnov, V. (1969) Computer With Stack Memory. In: Morell, A. (Ed.) Information Processing 68: Proc. Of iFIP Cong. 1968, 5-10 August 1968, Edinburgh,, Vol. 2, North-Holland, Amsterdam, 1969, PP. 818-823 MYERS, G. (1977) The Case Against Stack-Oriented Instruction Sets. Computer Architecture News, August 1977, 6 (3) 7-10 MYERS, G. (1982) Advances in Computer Architecture, John Wiley & Sons, New York, 1982 Nissen, S. &

Wallach, S. (1973) The All Applications Digital Computer. In: ACM-IEEE SYMP. ON High-Level-Language Computer Architecture, 7-8 November 1973, College Park, MD, PP. 43-51. Novix (1985) Programmers Introduction to the NC4016 Microprocessor, Novix Inc., Cupertino CA. Odette, L. (1987) Compiling Prolog to Forth. J. Forth Application and Research, 4 (4) 487-533 Ohdate, S., Yamashita, K. & Hishinuma, C. (1975) Push-down stack architecture to a minicomputer interface in:.. Information Processing in Japan, Vol 15, Information Processing Society of Japan, Tokyo, 1975 Ohran, R. (1984) Lilith and Modula-2. BYTE, AUGUST 1984, 9 (8) 181-192 O'Neill, E. (1979) Pascal MicroEngine. In: Proc. Of the Nineteenth IEEE Computer Society Int. Conf. (Fall Compcon 79), Washington DC, 4-7 September 1979, PP. 112-113 Organick, E. (1973) Computer System Organization: The B5700 / B6700 Series, Academic Press, New York, 1973 Park, J. (1986) Toward The Development of a real-time Expert System. In: proc. Of the 1986 roch Ester Forth Conf., (J. Forth Application And Research 4 (2)) 133-154 Parnas, D. (1972) on the criteria to be used in Decomposing Systems Into Modules. Comm. of the ACM, DECEMBER 1972, 15 ( 12) 1053-1058 Patterson, D. (1985) Reduced instruction set computers. Comm. Of the ACM, January 1985, 28 (1) 8-21. ReprinTed in: Fernandez, E. & LANG, T. (EDS) Tutorial:. Software-Oriented Computer Architecture, IEEE Computer Society Press, Washington DC, 1986, pp 76-89 Patterson, D. & Piepho, S. (1982) RISC assessment: A high-level language experiment In:. Conf Proc. .. On Computer Architecture, 26-29 April 1982, Austin TX, PP. 3-8 Patterson, D. &

SEQUIN, C. (1982) A VLSI Risc. Proc. Of The Eighth Int. Symp. On Computer Architecture, May 1981, PP. 443-457. Reprinted in: Milutinovic, V. (Ed.) Tutorial on Advanced Microprocessors and high -Level Language Computer Architecture, IEEE Computer Society, Washington DC, 1986, PP. 145-157 Pountain, D. (1988) Rekursiv: An Object-Oriented CPU. Byte, November 1988, 13 (12) 341-349 Prabhala, B .,....................................................................................... , G., FURHT, B. & KIBLER, R. (1988) Three-Dimensional Computer Performance. Computer, July 1988, 21 (7) 59-60 Ragan-Kelley, R. & Clark, R. (1983) Applying Risc theory to a large computer Computer Design, November 1983. Reprinted In:. Milutinovic, V. (Ed.) Tutorial on Advanced Microprocessors and High-Level Language Computer Architecture, IEEE Computer Society, Washington DC, 1986, pp 297-301 Randell. , B. & Russell, L. (1 964) ALGOL 60 Implementation:. The translation and use of ALGOL 60 programs on a computer (APIC Studies in Data Processing No. 5), Academic Press, London, 1964, pp 22-33 Rust, T. (1981) ACTION processor FORTHRIGHT In: Proc. Of the 1981 Rochester Forth Standards Conf., Institute for Applied Forth Research, Rochester, NY, 1981, PP. 309-315. Samelson, K. & Bury, F. (1962) The Alcor Project. In: Symbolic Languages In Data Processing: Proc. Of The Symp. Organized and Edited by The Int. Computation Center, Rome, 26-31 March 1962, Gordon and Breach, New York, PP. 207-217 Sansonnet, J., Castan, M PERCEBOIS, C., Botella, D. &

Perez, J. (1982) Direct execution of LISP on a list-directed architecture In:. Proc of the Symp on Architectural Support for Programming Languages and Operating Systems (ASPLOS I), Palo Alto CA, 1-3 March 1982,.. PP. 132-139 Scheevel, M. (1986) NORMA: A Graph Reduction Processor. In: Proc. Of The 1986 ACM Conf. on Lisp and Functional Programming, PP. 212-218 Schoellkopf, J. (1980) PASC-HLL : A high-level-language computer architecture for Pascal In:. Proc of the Int Workshop on High-Level Language Computer Architecture, Fort Lauderdale FL, 26-28 May 1980, pp 222-225 Schulthess, P. (1984... ) A reduced high-level-language instruction set. IEEE Micro, June 1984, 4 (3) 55-67 Schulthess, P. & Mumprecht, E. (1977) Reply to the case against stack-oriented instruction sets. Computer Architecture News DECEMBER 1977, 6 (5) 24-26 Sequin, C. & Patterson, D. (1982) Design and Implementation of Risc I. In: Randell, B. & Treleaven, P. (EDS) VLSI Architecture: Advanced Course ON VLSI Architecture, 19-3 0 JULY 1982, Bristol England, Prentice-Hall, 1983, PP. 276-298 Shaw, J., Newell, A., Simon, H., & Ellis, T. (1959) A Commptioning. In. .. : Proc. Of the Western Joint Computer Conf., 6-8 May 1958, Los Angeles CA, American Institute of Electrical Engineers, 1959, PP. 119-128 Siewiorek, D., Bell, CG, & Newell, A. (1982 Computer Structures: Principles and Examples, McGraw-Hill, 1982 Sites, R. (1978) A Combined Register-Stack Architecture. Computer Architecture News, April 1978, 6 (8) 19 Sites, R. (1979) HOW to use 1000 Registers. in: Seitz, C. (Ed.) Proc. Of the caltech conf. on Very Large Scale Integration, 22-24 January 1979, PP. 527-532 Stanley, T. &

Wedig, R. (1987) A performance analysis of automatically managed top of stack buffers In:...... The 14th Annual Int Symp on Computer Architecture; Conf Proc, 2-5 June 1987, Pittsburgh, pp 272-281 Stephens , C. & Watson, W. (1985) Preliminary Report On The Novix 4000, Computer Solutions Ltd., Chertsey, Surrey England. Sweet, R. & Sandman, J. (1982) Empirical Analysis of the Mesa Instruction Set. In: Proc. Of the symp. On Architectural Support for Programming Languages and Operating Systems (Asplos i), Palo Alto CA, 1-3 March 1982, PP. 158-166 Tamir, Y. & Sequin, C. (1983) Strategies for Managing the register file in RISC IEEE Trans Computers, November 1983, C-32 (11) 977-989 Reprinted in:... Milutinovic, V. (Ed.) Tutorial on Advanced Microprocessors and High-Level Language Computer Architecture, IEEE Computer Society , Washington DC, 1986, PP. 167-179 Tanabe, K. & Yamamoto, M. (1980) Single Chip Pascal Processor: ITS Architecture and Performance Evaluation. In: Proc. Of the TWENTY-FIRST IE EE computer society Int. Conf. (Fall COMPCON 80), Washington DC, 23-25 September 1980, pp. 395-399 Tanenbaum, A. (1978) Implications of structured programming for machine architecture. Comm. Of the ACM, March 1978 21 (3) 237-246 Tsukamoto, M. (1977) Program Stacking Technique. IN: Information Processing in Japan, Vol. 17, Information Processing Society of Japan, Tokyo, 1977, PP. 114-120 Vaughan, J. & SMITH, R. (1984) The Design of a forth computer. J. Forth Application and research 2 (1) 49-64 Vickery, C. (1984) QForth: a Multitasking Forth Language Processor. J. Forth Application and research 2 ( 1) 65-75 WADA, K., KaNeda, Y, &

Maekawa, S. (1982A) Software and System Evaluation of a Forth Machine System. Systems, Computers, Controls, 1982, 13 (2) 19-28 Wada, K., Kanda, Y., & Maekawa, S. (1982B) System design and hardware structure of a Forth machine system. Systems, Computers, Controls, 1982, 13 (2) 11-18 Weber, H. (1967) A microprogrammed implementation of EULER on IBM System / 360 Model 30. Comm. of the ACM, September 1967, 10 (9) 549-558 Welin, A. (1973) The Internal Machine. In: ACM-IEEE SYMP. ON High-Level-Language Computer Architecture, 7-8 November 1973, College Park, MD, PP. 101-108. Williams, R., Fraeman, M., Hayes, J. & Zaremba, T. (1986) The Development of A VLSI Forth MicroproProcessor. In: 1986 Forml conf Conf. proc., 28-30 November 1986 Pacific Grove Ca 189-196 Whitby-Strevens, C. (1985) The Transputer. In: The 12th Annual Int. Symp. On Computer Architecture, 17-19 June 1985, Boston Pp. 292-300 Wilkes, M. (1982 Keynote Address: The Processor Instruction Set. 15th Workshop On Microprogramming, PP. 3-5 WINK El, D. (1981) The Forth Engine. Forth Dimensions, September / October 1981, 3 (3) 78-79 Wirth, N. (1968) Stack vs. MultiRegister Computers. Sigplan Notices, March 1968, 3 (3) 13 -19 Wirth, N. (1979) a Personal Computer on A High-Level Language. In: Tobias, J. (Ed.), Language Design and Programming Methodology. Proc. Of a symp. Held in Sydney, Australia, 10 -11 September 1979, PP. 191-193. REPRINTED IN: Goos, G. &

转载请注明原文地址:https://www.9cbs.com/read-49779.html

9cbs

New Post(0)