[Reproduced] cache for mips

xiaoxiao2021-03-06 65

The MIPSCPU without Cache cannot be referred to as a real RISC. It may be unfair. But for some special purposes, you can design a MIPSCPU containing small and tight memory, and these memory only need to fix a number of pipeline steps (preferably one) can be accessed. But most of the MIPS CPUs contain cache.

This chapter will introduce how MIPS's cache works and software should make it can be used and reliable. After the MIPSCPU is restarted, the status of the Cache is uncertain, so the software must be very careful. You have some clues to know the size of the cache (if you know directly after the size of the cache is initialized, this is a bad software habit.). For diagnostic programmers, we will discuss how to test Cache and get special entrances.

For programmers for real-time applications, you want to control Cache when you run in the CPU. We will also discuss how to do it, although I have doubts about using some tricks.

Of course, these also progress with the development of MIPSCPU. For early 32-bit MIPS processors, initialization Cache or make it invalid, first let Cache enters a special state and then complete through ordinary read and write operations. For later processors, some special instructions are defined to do these related operations.

4.1 Management of Cache and Cache

Cache's job is to keep part of the data in Cache, so that the data can be quickly accessed and returned to the CPU within a fixed very short period of time, so that the continuous operation of the pipeline can be ensured.

Most MIPSCPUs have their respective cache (known as ICache and DCache, respectively) for instructions and data, so that reading a directive and a data read operation can occur at the same time.

Old CPU family (like x86) In order to ensure the consistency of code written to the CPU, there is no Cache. The current X86 chip has a more flexible hardware design, ensuring that the software does not have to know Cache from even more (if you are loading MS / DOS, it will provide consistency in nature).

But because the MIPS machine has its own cache, it is not necessary to flexibly. Cache must be transparent to the application, except that the running speed can be increased. But for system programs or drivers, MIPSCPUs with Cache do not try Cache to transparent. Cache only allows the CPU to run faster, and cannot help the system programmer. In an operating system like UNIX, the operating system can completely hide Cache for the application, and of course, for more unable to win the operating system, it can also hide most of the Cache processing, but you may have to know When you need to call the appropriate subroutine to do some necessary operations for Cache.

4.2 How to work in cache

Conceptually, Cache is a relaying memory (ASSOCIATIVE MEMORY), when the data is written, part of the data is used as a storage area that is logged as a keyword. In Cache, the keyword is the address of the entire memory. Provides an identical keyword to the connected memory, you will get the same data. When there is a real connection, it will be fully in accordance with their keywords, unless it is full. However, since the current keyword must be compared at the same time and all the keywords are required, the real connected memory is not less efficient or slower, or both.

How can we design a useful cache so that it is not only efficient and fast? Figure 4.1 shows the basic design of the simplest cache, direct mapping cache. It was widely used by MIPSCPU before 1992.

Direct mapping Cache is composed of many simple cache arrangements (usually each one is called one line), and indexes over the entire range through the address low. Each line of Cache contains a word or several words and a tag (TAG) area, TAG records the address of the memory where the data is located. When a read operation, each line can be accessed, and tag will compare the high position of the memory address; if you match, we know that you find the right data, this is called hit (HIT). If there is more than one word in this block, the corresponding word of the corresponding word is selected by several digits.

If the tag does not match, this is called no hit (MISS), then the data needs to be read from the memory and then copied to the LINE corresponding to the Cache. This corresponding to the original data will be abandoned. If the CPU needs to be abandoned, it needs to be obtained from memory again.

Such direct mapping cache has a feature, which is for any memory address, only one LINE in the cache can be used to save its data. This is also a bad place. The advantage is that such an architecture is simple, it can make the CPU faster. But there is also a bad side: if your program should alternately use two data, and they just have to correspond to the same block in the cache (maybe it is just the same as the low position of the memory address), so that two The data will keep the other party to replace the cache, so that the efficiency of the cache is completely lowered.

The real connected memory will not encounter such a toss, but for any reasonable size, it will be difficult to imagine, expensive and slow.

A compromised method is to use Two-Way Set-Associative Cache, in fact, two Direct-mapped cache parallel, match memory locations simultaneously. Figure 4.2. At this time, the corresponding address will have two opportunities. Four-Way Set-Associative Cache (there are four direct mapping sub-caches) is also usually usually usually in the design of Cache. But this is punished. A set-associate cache requires more bus connections than directly map Cache, so Cache is too large to construct direct mapping on a piece of chip.

However, there are also smart places, because Direct mapping Cache is only a unique candidate for the data you need, so it is possible to put some things before the TAG match is possible (as long as the CPU does not do with a data-related operation). This improves each clock utilization.

Since Cache will be filled after a period of time, the original data in Cache will be abandoned when the data read from the memory is stored. If you know that these data is consistent in cache and memory, you can discard them directly in cache; but if you are updated in cache, you need to save these data back to memory.

This brings us a question, how does cache handle writes?

4.3 Write-Through Caches in Early MIPS CPUS

The CPU cannot be just read data (as discussion above), they also write data. Since Cache just makes a backup of some of the data in the main memory, there is an obvious way to process the write operation of the CPU, referred to as Write-Through Cache.

For Write-Through Cache, the CPU always writes the data directly to the main memory when the data corresponding to the main memory location, then the backup in the cache is also updated. If we do this, any data in cache will be consistent with the main memory, so as long as we need us to abandon any CAHCE LINE data, and do not lose anything in addition to consumption. Of course, this is also dangerous. When we let the processor wait for the end of the write operation, the processor runs completely, but we can fix this problem. You can save the data and its addresses to the other side first, and then have the main memory controller to get these data itself and complete the write operation. This temporary storage write operation content is called Write Buffer, which is the first in the first out (FIFO).

Early MIPS CPU has a direct mapped Write-Through Cache and a write operation buffer, as well as a R3000 excitation setting. It constructs a Cache controller on the same chip, but an additional high speed memory chip is required to store TAG and data. Only the CPU runs some special procedures averages average. The main memory system can digest these write operations in this way of work and work very well.

However, the growth rate of CPU runs more than the memory block. Sometimes when 32-bit MIPs make the 64-bit R4000, the MIPS speed has exceeded the memory system to reasonably digest all write operations.

4.4 WRITE-BACH Cache in Recent MIPS CPUS

Early MIPS CPU uses simple Write-Through Cache. Later MIPS CPUs cannot apply this method because of the speed of being too fast, they will fall into the write operation of the storage system, and the speed is slow like crawling.

The method of solving is to keep the data to be written in Cache. The data to be written is only written in cache, and the CAHCE LINE is to be a tag, so that we will definitely not forget to write it back to memory at some point (a line needs to write back, call DIRTY ).

Write-Back Cache can also be divided into several different sub-processing methods. If there is no data corresponding to the address in the current cache, we can write directly to the main memory without ca cache, or you can read the data into cache in a special way, then write Cache directly, For Write Allocate. Use a selfish point to see a program running on a CPU, write-allocate looks like a waste of time; however, it can make the design of the entire system become simple, because read and write memory during program running The read or write is a block in a Cache Line size.

Starting from MIPS R4000, MIPS CPU has Cache in the chip, and both Write-Through and Write-Allocate work mode, Line's size is also two kinds of 16Byte and 32byte.

These working modes of MIPS Cache can be applied to the use of Sillicon Graphics Design R4000 and other large CPUs, and other computer systems are also affected by these Cache operating modes due to multiprocessor systems.

4.5 Other options for Cache Design

How to design Cache in the 1980s and 1990s have made a lot of work and research. So there are still many other design choices.

Physically address / virtually address:

When the CPU runs a mature operating system, the data and instructions in the program (program address or virtual address) will be converted into physical addresses used in system memory.

If cache works purely in physical address mode, it will be easily managed (we will discuss why it will be discussed later). However, the legitimate virtual address allows Cache to start query matching earlier, which allows the system to run a slight block. But what is the problem with the virtual address? They are not unique; when many different programs run in different address spaces in the CPU, they may share the same virtual addresses and use different data. When we switch different address spaces, you need to reinitialize Cache each time; this approach is used many years ago, which can be used as a reasonable solution for very small Cache. But this way to big CAHCE is not only ridiculous and the efficiency is low, we need a zone to distinguish the address space in the Cache Tag, so we are not confused by them.

There are also other problems about virtual addresses: The same physical address can be described in different virtual addresses in different tasks. This will result in the content of the same physical address being mapped to different Cache entries (because they correspond to different virtual addresses, so they are selected by different indexes). Such situations must be avoided by the operating system's memory management. The detailed situation will be introduced in Section 4.14.2.

From R4000, both MIPS's primary cache uses virtual address indexes to provide fast Cache indexes. However, for the marker to mark each cache-line, the physical address is better than the virtual address. The physical address is unique and efficient, because such design shows that the CPU can convert virtual addresses to physical addresses while making cache indexes.

Choice of Line Size:

The size of the LINE is data corresponding to how many words can be stored in each TAG. Early MIPS Cache corresponds to a TAG only stores data of a word. However, the corresponding data of a TAG can store multiple words, especially the memory system supports fast Burst Read. Modern MIPS Cache tends to use four or eight word size Line, and a larger second layer and the third layer Cache uses larger line.

When Cache Miss occurs, the entire LINE data is available from memory. But it is likely to take a few line data; a word Cache Line MIPS CPU is often taken with multiple words.

Separate / unified (split / unified):

MIPS's main cache is always divided into I-cache and d-cache. When you look at I-Cache, see D-Cache when reading and writing data. (By the way, if you want to execute the CPU just copy to the memory code, you must not only the D-Cache part is invalid to make these code data no longer exist in D-Cache, but also guarantee that they are loaded into I- Cache)

However, the second layer of Cache not on the same piece is rarely divided into two pieces in this way. This will not be true. This will take too much pin unless you can provide a separate data bus for both Cache.

4.6 Cache Management (Magaging Cache)

With the help of the system software, the Cache system must guarantee the consistency of any application data, and they do not have the DMA I / O controller, especially the DMA I / O controller (acquired directly from memory). Data.

For CISC CPUs, there is usually no need for system software to cache; because it will cost additional memory space, Silicon Area, clock cycle to make Cache becomes real transparent.

When the system is started, the MIPS CPU needs to initialize its cache; this is a very complex process, with some suggestions about it. However, when the system is started, the CPU must be interfered.

Before the DMA device is used from memory:

If a device acquires data from memory, it must obtain the correct data. If D-Cache is Write-back, and the program has written some data, then some of these correct data will remain in D-Cache without writing back to the main memory. CPU is of course impossible to see this problem; if the CPU needs this data, it will get the correct data from Cache. Therefore, before the DMA device begins to read the data from the memory, it must be written to the memory if it remains in the D-Cache, and must be written back to the memory.

DMA device write data to memory:

This is very important if a device wants to store data into memory. It is important to enable any LINE that will write in the memory location in Cache. Otherwise, the data of the CPU reads these locations will get the wrong data. Cache should invalidate the corresponding Cache Line before the data is written to the memory by DMA.

Copy command:

When the CPU writes some instructions to the memory in order to perform later execution, you must first ensure that these instructions will be written back to memory, followed by Line that corresponds to these instructions in I-Cache will be invalidated. In the MIPS CPU, D-Cache and I-Cache are not available. (When the CPU writes instructions to memory, this time the instruction is written as a data, it is likely to be written only in cache, so we must ensure that these instructions will be replied in memory; why do I- Cache invalidation, this and data is the same as if the DMA is written directly to the memory.)

If your software needs to solve these problems, two unique operations for Cache Line are needed.

The first operation is called a reproction operation. The CPU must be able to find the corresponding cache line for the address in Cache. If you find it, and the corresponding line is Dirty, you need to write this line to the memory.

The CPU adds other different levels of cache (speed and size) to reduce the processing of MISS. So the designer can make the inner Cache mechanism simple, so that it can be inquired on a high clock frequency. This is obvious that the less Cache in the inner layer will be. Since 1998, many high-speed CPUs use the second-stage Cache on the same chip, the main Cache size is small, and the double 16K main Cache is favored.

Cache not in the same piece is usually directly mapped because the group-connected Cache system requires more bus to need more pin to connect. This is also a domain worth studying; the MIPS R10000 is connected to a two-way group with only one data bus, if the hit is not the hope of the hope, after returning data after a delay (two groups share a data bus ).

During the development of Cache, two primary software interfaces were produced for Cache. From the software's point of view, a class is based on the 32-bit MIPS CPU represented by R3000; the other is based on the 64-bit MIPS CPU represented by R4000. R3000 This type of MIPS CPU cache is Write-Through, direct mapped, physical address as an index. The minimum unit of Cache is a word, so writing a byte (or write less than one word) must be specially processed. Data in reading and writing is a special mode for cache management.

Why not manage Cache through hardware?

Manage Cache is often referred to as "love idle" by hardware. When another CPU is or the DMA device accesses the memory, the content corresponding to the accessed address can be seen for Cache.

4.7 Layer 2 and Layer 3 Cache

In large systems, a nesting multi-layer Cache is usually required. A small and fast master Cache is closest to the CPU. When visiting the main cache, it is not found directly from the memory to find it from the second floor cache. The second layer Cache is between the main Cache and the memory in the speed and size. The number of Cache levels can be determined by memory speed and CPU fastest access speed; because CPU speed has developed much more faster than memory, there is no Cache to develop to two-story Cache in the past 12 years of desktop computer system . The fastest CPU speed in the late 1990s can approximately 500 MHz, and there are three floral cache. 4.8 MIPS CPU Cache Construction

By observing the development of Cache, the development of hierarchy (see Table 4.1), we can divide the MIPS CPU into two categories, old and modern.

The speed of the clock becomes faster, the more Cache constructs can be seen, because the designer is getting faster and faster than the CPU runs faster than the memory system. In order to ensure the smooth operation, Cache must improve the running speed, and ensure that the speed of providing data is faster than the periphery, and it is also necessary to ensure that it is possible. Comparing the R4000 type CPU, the main Cache is the Write Back type, which is the Cache connected to the Write Allocate, Virtually Indexed, Physical Tagged, 2nd or four-way groups.

Many R4x00 and its subsequent CPUs have a second-layer Cache controller on the same piece, which has such a first CPU in 1998.

Due to the different generations, we will be divided into two sections.

note! Some of the second layer Cache of the system is not controlled by hardware inside the MIPS CPU, but is built on a memory bus. For such Cache's software interfaces, there may be a large difference than the software interface of the Cache, which is controlled by the CPU.

4.9 programming r3000-style caches

The MIPS R2000 breaks the basis of the Cache controller in the chip and separates cache into i-cache and d-cache. This is a demonstration, it will not be surprised, that is, such a pioneer's adventure leads to many things behind. Cache has a special software access disadvantage.

In order to save chip pin, Cache will not have different gates to perform bytes, halfwords, and other write operations that are less than one word unit. Therefore, when performing a write operation less than the word unit in the R2000 series, it will write back to the main memory and invalidate the line where the word in cache is invalid. This is aimed at cache management, providing a method that makes cache: only write one byte.

You can see support for these simplified perspectives. R2000 designer proposes a reason for the reason that is less than the word is usually used for character operation, and the character operation is always provided by the library function, and these library functions are overridden with the operation of the entire word. These assumptions are always considered to be wrong, or half a half.

Until I realize that all systems can use the same function library, and each byte write is not a good idea, these debates have not continued. Because this is not tolerated, there is a big change, and the R3000 series CPU performs a write operation less than the word unit through a RMW (Read-modify-Write sequence). This RMW appears in the 32-bit MIPS CPU and adds a clock cycle to the latency of such a write operation.

Such cache is invalidated by the dilemma; R2000 has a advantage because it's strange habits, can be invalidated by the write operation of bytes. R3000 Cache needs to save the pattern called isolation, which is just used for Cache diagnosis. The RMW queue is suppressed because of this mode, and the write operation of less than one word unit is still invalidated in that state. This is unfortunate but not a tragic (disaster), and there is a more beneficial place for some running systems. Significant is when Cache is in the isolation mode, Cache will not read and write, and any read and write will deal directly with the memory. 4.9.1 Using Cache Isolation and Swapping

All R3000 Series CPU cache is all Write-Through mode, which is to say that Cache will not have updated data than in memory. That is to say, the data in cache never needs to write back to memory, so we only need to make D-Cache and I-Cache are invalid.

Just need to do different Cache operations to do CAHCE management, and the management of cache is not necessary to pass a special memory address space. So there is a status register with a SR bit to turn the D-Cache to close the ISOLATION mode; read the write operation in this mode affects Cache, read or hit, but whether tag is equal. When D-Cache is in the isolation mode, the write operation of less than one word unit will cause the corresponding Cache Line to be invalidated.

Caution !!!

When D-Cache is in an isolation mode, any read-write operation does not operate in accordance with non-Cache in the influence of its corresponding address or TLB entry. Such a result is that the Cache Manager must guarantee that some data cannot be accessed; if you can make good controls through your compiler, and you can save all the variables you have saved in the register, you can Write them in a high-level language. You must also ensure that the interrupt is shocked when running these programs.

I-cache is also completely irrelevant in typically operating mode. Therefore, the CPU provides another mode, cache exchange (swapped), by setting the SWC bit of the status register; then D-Cache can take I-Cache, i-cache can take D-Cache. When Cache is a swap mode, ISOLATED I-Cache entry can be read, write, and invalidation.

D-Cache can perfectly serve as I-Cache (probably I-Cache can also work like D-Cache), but i-cache cannot act as D-Cache. This is also reliable, when Cache is useful when the switch mode is used, is Isolation is useless.

If you need to use the exchanged i-cache to store the data of the word unit (and the data write operation that is less than the word unit as before, you must ensure that the corresponding Cache should be guaranteed when returning to normal mode Line must be invalidated.

4.9.2 Initializing and SIzing Initialization and Judgment Size

When the state of the Cache is uncertain when the machine starts, it is also unpredictable. You should also realize that the SWC bit and ISC bit of the status register after the robot restart are also unsure, so it is best to enable these states before the Cache read and write (even in the case of non-Cache).

Different MIPS CPUs, Cache have different sizes. In order to ensure the portability of your software, it is best to calculate the size of D-Cache and I-Cache when initialization. This is better than directly configuring a given value.

Here's how to get the value of the Cache size:

a. isolated Cache, let i-cache are swap mode.

b. In the R3000 Series CPU, the size of Cache may be 256k, 128k, 64k, 32k, 16k, 8k, 4k, 2k, 1k, and 0.5k (k is equal to 1024, the unit is byte). Write these possible values n (one of these values) to the physical address equal to their own place (large to small). The simplest generating physical address is the use of KSEG0 segment (n 0x80000000). Because the Cache address is overlapping, if N is a multiple of the Cache size, it will be overwritten by a small value behind. c. So reading the physical address zero (that is, 0x80000000), you can get the value of the Cache size.

Initialize Cache, you must ensure that each Cache entry is invalid, and correctly corresponds to a memory location, the value contained is correct:

a. Check the PZ bit of the status register SR is zero (1, turn off the odd bits, which is not a good idea for cache on the same chip).

b. isolated D-Cache and exchanges it and I-cache.

c. For each word of Cache, write a word value (make the tag, data, and parity bits of each line of Cache), and then write one byte (so that each line is invalid).

However, pay attention to the i-cache with four words for each line, this is very low; because it is enough to write a byte to make each line is invalid. Of course, unless you want to call this to make the Cache invalid program, this problem is not obvious. However, if you want to optimize the Cache invalidation program according to the actual situation, you need to determine the structure of the Cache when starting.

4.9.3 Cache Invalidation

Enable cache, follow the procedures below:

a. Calculate the range of addresses you need to fade in cache. Use more than the Cache size is a waste time.

b. Make D-Cache isolate. Once you are isolated, you can't read and write memory, so you have to spend all the cost to prevent abnormal production. Turn off all interrupts and ensure that the latter programs will not cause memory access to exceptions.

c. If you still want to invalidate i-cache, make Cache in swap mode.

d. Write a byte content for each LINE within the address range just calculated.

e. Turn off the Cache exchange and isolate mode.

Usually, you should run the program to exercise the Cache invalid in the mode of I-Cache. This sounds chaos and dangerous, but in fact you don't have to spend additional steps to run cache. A program that causes Cache failure in the case of Cache is 4 to 10 times.

When your CPU is set to the ISC bit, you must close all interrupts because it is not accessible.

4.9.4 Test and exploration

Drawing a Cache entry is very helpful when testing, commissioning or profile. You can't read the value of TAG directly, but there is a detailed approach to legitimate line:

isolate the cache.

b. Acquire from each line (low address matching, high address including the physical address of the memory of your system) through the start address of each line (low address matching, the high address including the physical address of the memory of your system). Every time you read all the CM bits of the status register, only the bit is zero, the obtained TAG value is correct.

This requires a lot of computer cycles, but for the entire query for the entire query for the entire query for the 20 MHz processor, 1K D-Cache corresponds to 4MB of physical memory. It takes only a few seconds.

4.10 Programming R4000-Style Caches

The R4000 has modified the early CPU Cache as inappropriate. However, the R4000 success is that Cache has a variety of working modes (WRITE-BACK, WRITE-ALLOCATE), as well as longer Line. Because there is a Write-Back mode of work mode, each line requires a status bit to mark this line as Dirty when writing is written by the CPU. So indicating that the data is different in memory).

For such cache, we need INVALIDATE and WRITE-BACK operations: and must also ensure that the data in any CPU writes in cache must be written to memory.

For the purposes of diagnosis and maintenance, TAG will be easier to read; R4000 adds a pair of registers taglo and tagHi to transfer data between cache tag and system management software. For the R4000, the data within the Cache Line is not directly read, of course, you can also access the data through the cache hit. The CPU can take data from the Cache TAG to the Taglo and TagHi registers from the Cache TAG, or write the contents of these registers to Cache Tags. Figure 4.3 shows the details of these registers.

Cache's address TAG except for all other bits used to query Cache INDEX; the length of the main Cache TAG will vary depending on the maximum physical address (R4x000 is 36bit) and the number of index cache bits. The 13bit used as the index of the original R4000 8kb size master Cache, and there is no less than this bit. Such a TAG length has 23 bits, and taglo is 24bit; TAGHI TAGHI is always zero in the current CPU. This is important for the minimum cache size or the maximum physical address that can be supported. For R4000, now TAGHI is redundant; set it to zero invested it.

So the TAGLO register contains all the bit of the TAG corresponding to Cache Line. Taglo (PState) also contains status bits. In most cases (multiprocessor), this will become very complicated, but the management and initialization of all Cache is sufficient to indicate that when PState is zero is a legal value to correspond to an invalid Cache Entry.

This area is used by later CPUs, used to store status information of the second layer Cache, but this is a custom value of zero is safe and suitable for initialization.

Finally, taglo (p) is a parity bit, which is one for the entire Cache tag doll. Taglo for all zero indicates the correct parity. Some CPUs ignore this bit without checking it, and there is no harm.

4.10.1 Cacheerr, Err, And ErroRorepc Register: Cache Errorhandling

The CACHE of the CPU is a critical part of the memory system. For efficient and correct systems, you can find that the integrity of the data stored here here is worth it.

The calibration of the memory system is idealized to be constructed in the first end; when the data is generated or passed into the system calibration bit, it is calculated as the data is stored and is checked before the data is used. That Way The CHECK CATCHES FAULTS NOT JUST IN The Memory Array But in The Complex Buses and GizMos That Data Passes Through On It To The CPU and back. This will not grasp errors in the memory queue and on the bus, and Transfer to the CPU and returns in this way.

For this reason, the R4x000 CPU (design applies to large computers) provides an error check in Cache. Like the main memory, you can use simple parity or use error correction code (ECC).

The parity is a simple use of an additional bit to correspond to each byte in memory. A parity error can tell the system that this data is unreliable and allows some control to stop instead of CreePing random errors. A vital task of parity is the process of providing huge assistance in the process of system development, because it does not explicitly point out problems caused by memory data integrity.

But a BYTE waste will have a percentage of a chance to have a correct parity, and a random garbage on the 72-bit data bus will have not been discovered in 256. Some systems may be better.

The error correction code will be more complicated because of a 64-bit data, there will be 8 Bit check digits. This will be very thorough, a bit error will be unique and correct, and any two BIT errors will not be ignored. In a very large memory queue, ECC will elysen random errors from essentially.

Because the error correction code can check the entire 64-bit data at a time, the memory of the error correction code cannot perform a write operation of less than one word, and the data that is less than one word must be incorporated into new data and recalculates the correction. Method. MIPS CPU requires a memory system to make a write operation less than one word data in the case of Cache Close. This will make things complicated. Memory system hardware must convert a write operation less than one word data into first read and then merge, followed by recalculation, and finally the sequence of operations written.

For simple systems, the general choice is parity bit, not other. This is very meaningful to make a selectable way, so that it will be conducive to diagnosis during design development, but do not have to pay accordingly when it becomes a product.

Whether the inspection mechanism is running in the memory system or in the Cache of R4x00, the CPU will provide a parity bit corresponding to the byte, or corresponding 64-bit 8bit error correction code, or simply not protect.

When the error detection is supported, the detection bit of the data is usually stored directly through the system interface to the Cache memory when the Cache fill is stored. It is only checked if the data is used, which ensures that any Cache parity is transferred to the instruction caused by it, rather than being transferred to the common cache line. When as a degraded situation, a badge error in non-Cache will be marked as a cache parity error, which will make you chaotic.

Note that the data that the system interface mark comes in is possible to have no legal check bits. In this case, the CPU will regenerate the check bit for Cache it inside.

If an error occurs, the CPU will generate a special error trap. This Vector will reach a non-Cache location (if it is wrong within Cache, it will be stupid to perform the Cache Mainland code). If the system uses ECC, the hardware generates an error correction bit when the write operation will be checked. Hardware does not know how to correct; this will be the work of software.

The format of the ERR register (Figure 4.4) is as follows:

a. ER / ED / EE / EB: These bits have an error in what Cache (main Cache or second layer Cache, instruction cache or data cache) has occurred, or it is outside the system interface.

b. PIDX: Cache Index for the wrong location. You can get this content here for the Cache operation of the Index type; it will get the correct line, regardless of whether Cache is directly mapped or group.

When the error occurs, the ERROREPC register points to the command location of the error. The ERR register saves the ECC bit, you need to correct the error that can be corrected, but we will no longer talk about it - because this requires a lot of space, you will need to close the processor manual. You will be able to get some simple code for MIPS computing rules.

4.10.2 The Cache Instruction Cache Directive

The Cache instruction has a similar format similar to the MIPS storage instruction (with the usual register plus the sixteen signed offset address), but the value indicating the data register will be translated into the selection area indicating what the Cache directive is. There is no standard name here to represent these cache operations; I will use the name from the SDE-MIPS algorithm, which is based on the SCI / MIPS, based on the INCLUDE FILES. The selection area is not a complete bit encoding, but it is almost; look at Table 4.2. The Cache selection area allows you to do this:

a. That class cache: Choose ICache or DCache, is the primary Cache or the second layer of Cache. Because there is no excess position, there is no choice of three-layer Cache. But here I want to remind you that this is closely related to the CPU, and the 64-bit CPU after the R4000 provides R4000 compatibility is very helpful.

b. How cache is addressed: There are two different types. If you are a hit mode, you need to provide a normal program address (virtual address) that must be transformed. If the address provided is indeed in cache, the action will be completed corresponding to the corresponding Cache Line; if it is no longer cache, then don't do anything.

The other is an Index method. The address low is used to select a certain cache line, regardless of this LINE now. This shows no principle of cache internal organization.

Cache maintenance is usually a way of life, while the Index mode is required when initialization.

c. Backup: If the corresponding line is Dirty, write the data back to the memory; if not, it is like a NOP instruction.

d. invalidate: Invalid this LINE flag is invalid, making its data cannot be used. It is possible to do reply and invalid; but this is not automatic, if you need you to make a Dirty's line is invalid. So some applications may lose data.

e. LOAD / STORE TAGS: These operations are in Tag content in the Tag content within the taglo and TAGHI registers or read from these two registers to the corresponding line.

Storage TAG uses a more timeless manner (Taglo and TAGHI registers to be set to zero), which is part of Cache initialization.

f. Fill: This is designed for i-cache, which fills a cache line through a special memory address. There is no need to fill DCache, because when the cache is read, it will reach the same effect when it is not hit.

g. Create Data: This type of operation is to allow users to write memory arrangements with high speeds, and avoid any Cache re-filled. Unless you can guarantee that all of these data is overwritten before the data is used or written in memory.

This feature is very helpful for initialization and diagnosis (you will see the code example of the second layer Cache later, and know how to clear the second layer of Cache).

4.10.3 Calculate the uppercase and decide how to configure Cache

For R4x00 CPUs (and vast majority of CPUs), the size of the primary Cache size and Line will be reliable with the CONFIG register of CP0.

But you have to get your Cache is a direct mapping or a group is quite difficult. But for the running cache, it is not difficult to test, you only need to refer to the address in the direct mapped in the direct map, then use the Index class's operation to check the existence of it in cache; Of course, if you haven't initialized Cache yet, this is useless. Fortunately, you can only write the same program to initialize direct mapping cache or group Connect Cache.

4.10.4 Initializer

This will introduce a good method: 1. Open a number of memory corresponding to any data, but if your system uses parity code or error correction code, you must ensure that this is correct, and uses these data to populate Cache. (In the algorithm database, we keep at least 32K system memory until cache initialization; as long as you write these memory when Cache is closed, you can get the correct parity code.) There is also a sufficient space to initialize the two Layer cache; we will use a roundabout way.

2. Set the taglo register to zero, which ensures that the one of the LINE is valid is not placed and the TAG's parity is consistent.

The Taglo register is used by the Cache Store_TAG instruction, forcing the corresponding LINE to invalid and clear the TAG's parity.

3. Shield disruption, otherwise there will be some unexpected occurrence.

4. First initialize ICache, then DCache. Here is the C code initialized ICache. (You must believe in functions like index_store_tag_i () or the operation of the macro can be used; they or trivial assembly codon functions, can run on the machine of the corresponding instruction, or corresponding Gun C users with a macro Embed compilation.)

For (addr = kseg0; addr

{

/ * CLEAR TAG TO Invalidate * /

Index_Store_Tag_i (AddR);

/ * Fill So Data Field Parity Is Correct * /

Fill_i (addr);

/ * Invalidate Again - Prudent But Not Strictly Necessay * /

Index_store_tag_i ();

}

5. DCache's initialization is relatively complicated, because there is no corresponding DCache's index_fill_d operation; we can only achieve the purpose by relying on the usual no-hit process from the Cache to rely on the usual without hitting process. When the Cache fill instruction corresponds to the INDEX operation, the read project relies on the memory address through a tag. You have to be very careful to Tag; corresponding to Two-Way's cache, initialize DCache to initialize half of the DCache by initializing DCache, because the PTAGLO will be reset to determine the next time no hit is the set of cache lines. Bit. The following is the right method.

/ * CLEAR ALL TAGS * /

For (addr = kseg0; addr

Index_Store_TAG_D (AddR);

/ * Load from Each Line (In Cached Space) * /

For (addr = kseg0; addr

Junk = * addr;

/ * CLEAR ALL TAGS * /

For (addr = kseg0; addr

Index_Store_TAG_D (AddR);

4.10.5 Invalidating or Writing Back A Region of Memory in the cache

For programs or physical addresses of some I / O spaces, parameters for invalidation or backup are constant.

You almost always use the Cache instructions of your own type to make the location you need in Cache to make it invalid or back. If you need to make your memory a huge range, use the index type instruction to make the entire cache invalid or backwritten, although this is an optimized approach, but you are likely to ignore. We have enough reason to do this:

PI_CACHE_INVALIDATE (Void * BUF, INT NBYTES)

{

Char * S;

For (s = (char *) BUF; S

Hit_INVALIDATE_I (S);

}

Note that this is not necessary to produce a special address, as long as the BUF is enough, it is like the following example, if P is the physical address, you must add a constant to the KSEG0 range.

PI_CACHE_INVALIDATE (P 0x80000000, NBYTES);

4.11 Cache Efficiency

From the 1990s, Cache is designed on the same piece of CPU. The performance of the high-speed CPU is largely determined by their Cache system performance. Many systems now (especially embedded systems require saving Cache's size and memory performance), with 50-60% of the CPU waiting to fill again in Cache. This will double the performance of the CPU performance will increase the performance of 15-25% of the application.

Cache performance depends on the total time that the system is waiting for Cache again. You can break it down to the results generated by the two parameters:

A.Cache does not hit a chance: CPU (Take instruction or data access) is not hit in Cache, thereby requiring the ratio of memory.

B.Cache does not have a replacement of the delay: This delay is taken from the memory to the CPU pipeline to continue the time.

Of course, this is not necessary to measure well. For example, the number of registers of the X86's CPU is rare, so the same program is compiled to the X86 usage to use many data access to MIPS. Of course, the X86 uses the stack instead of the register to give these additional access; the stack position will be a very frequent area in memory, and the use efficiency of Cache will be very high. Through a large special procedure, we can get the number of cache without hit.

The above mentioned comments will be helpful for several ways to improve system running speeds.

a. Reduce the chance of cache without hit:

1. Let Cache becomes larger. This is the most effective, of course, is also the most expensive. In 1996, 64K cache would occupy half of the silicon area of the high-speed embedded CPU or even more than the size of Cache, you only wait for the molar theory to invent more doors in the same area.

2. Increase the group of Cache. It is worth increasing to the four groups, and it is almost not seen when it is increased.

3. Add additional levels of cache. Of course, this will make the calculation more complicated. In addition to the complexity of the polysest subsystem, the second floor cache's non-hit rate will be suppressed; the main Cache has been able to reassign the CREAM of the data behavior. In order to make it worth, the second layer Cache must be much larger than the main Cache (generally eight times or more), and the second layer of Cache's access time must be fast than the memory (twice or faster) .

4. Optimize your programs to reduce non-hit rates. If the work is in practice, it is hard to say clearly. For small program optimization is relatively simple, but it is very laborious for trivial procedures. But so far, no one has made a tool to optimize any procedures. See Section 4.12.

b. Reduce the penalty for cache replacement:

1. Accelerate the data of the CPU to get the first word. The DRAM memory system needs to be a lot of launch, in order to provide the data very fast. Make memory and CPUs more closely, decrease the path between them, so that data will be faster between them. Note that this is the only thing that can be applied in an inexpensive system, and the effect is good. But ridiculous is that it is rarely noticed, perhaps because it needs more comprehensive factors between the CPU interface and memory system design. When the CPU designer is willing to handle these problems when designing chips, it is likely that their work is complicated enough!

2. Increase the width of memory Burst. This is traditionally applied, expensive technology, two or more memory systems to alternately store data; you can alternate data from each memory system after the initialization, so that it seems like Twice. The first application such that the memory technology is that the synchronous DRAM (SDRAM) that appears in 1996. SDRAM modifies the DRAM interface to provide a larger bandwidth.

c. Early restarting CPU: This simple method is to start from the data that is not hit by the Cache, and then restarts the CPU immediately when the data will be immediately restarted. Cache's replacement can be run parallel with the CPU. The MIPS CPU is applied from the R4x000 to apply this technology, and uses a sub-buffer to replace the data to replace, and the data of the word is the first to need. But only R4600 and its CPUs have reflected the benefits of this technology.

The radical way is to allow the CPU to bypass the number of hours to continue to perform; the operation of the number is controlled by the bus interface controller, and the CPU continues to run until the data it needs is stored to the corresponding register weight. This is called not hindering, starting from R10000 and RM7000.

A more aggressive method is to perform any subsequent code, as long as it does not rely on the yet-yetful data, the R10000 can be executed in order of the instruction. This technology is very thorough, not only applied to the calculation and jump instructions.

4.12 Modifying the software to make the Influence Cache efficiency

Many times we work on the basis of program access to the basis, and we are fair to operate in an unsuspected work mode. For most purposes, we can also assume that access is appropriate random distributed. For workstations, you must be able to support the implementation of enough applications, this is a fair assumption. But when an embedded system runs a simple application, there is no hit whether a special program is caused by special compilation. This is very tempting, if we can use the application code to improve the efficiency of Cache in the system. In order to understand how this is handled, you have no hit category, follow the reasons they produce:

a. For the first time: So any data must be read from memory.

b. Replace: Cache size is limited, so when your program does not run, Cache does not have hit, need to replace a portion of legal data. When the program is running, Cache will keep up with the data and then take it. You can use large Cache and reduce the size of the program to minimize replacement and no hit (in fact, the proportion of program size and Cache size).

c. From the actual situation, Cache usually does not exceed the four-way group, so for any program address in the cache, there are up to four locations can be stored; if the corresponding direct mapping is only one, the Cache connected to the secondary group has two (Continuous compare Thrastings will lose the possibility of reducing speed; however, most researchers suggest that a four-way group connected Cache almost does not have lost performance on Way's choice).

If your program will use N-segment space data very often, and the low level of the address of this N-segment space is very close, then they use the same line. If n is greater than cache's number of cache, then cache's no hit will also be very frequent, because the data of each space will not stop the data of other spaces within Cache. Understand the above knowledge, how do you change to the program make it better for Cache?

a. Make the program smaller: If you can do it, this is a good idea. You can use the appropriate compiler optimization (external optimization is usually a bigger program).

b. The part that is often performed in the program is smaller: Access density is not always average distributed in a program. There is usually a considerable part of the code almost not used (error handling, unknown system management), or only once (initialization code). If you can peel out these few code, you can get a good cache hit for the remaining programs.

The way eligibility is advantageous to distinguish some frequently used programs and fixed to a bit of memory to reduce the placement of time. Such at least these frequently used procedures do not collide with each other because of the position of Cache.

c. Forced some important code or data standing Cache: Some machines can allow some Cache to protect the data they have not replaced. This part of the code is typically interrupt or otherwise confirmed in important software. These code or data are generally retained in one of the groups of Cache in the secondary group (so when the system is heavy, Cache is like a direct mapped cache).

I doubt this method's survival ability, I don't know that there are those research to support its effectiveness. The loss after the system reciprocates is likely to be greater than the perpetition of those reserved code. The lock lock is likely to be like an uncertain market tool to deserve the desire of Cache heuristic features. This desire can be understood, the faster the program is more complicated, but Cache is only a part of the factor of the results.

d. Arrange the program to avoid collision: The above mentioned makes the execution part of the program become smaller, which is not a good idea for me. And for group-connected Cache (especially two ways) make this method more meaningless.

e. Let the data and code that are rarely used without cache: This looks very attractive, let Cache only gives the important code or data services to exclude code and data that are only once or rarely used.

But this is almost always a mistake. If the data is really rare, then they can't be in Cache at the beginning. Because the Cache is taking a line length 4 or 16 words, even if the data is transmitted only once, it can have a high speed; Burst replaces almost no time than a word access, and Can provide you free of additional 3 or 5 words.

In short, we will introduce the following as a starting point (unless you have a lot of practices and very deep ideas, you can give up). Start We first think that in addition to I / O registers Cache are open and rarely use remote memory. Before you try to make predictions, please understand what cache is inspired by your app. The second is eliminated any problems on the hardware. There is also no software assisted to recover the performance of the high Cache replacement rate and small memory bandwidth. Try to reduce the non-hit rate of Cache from the new organization software without increasing its length and makes it complicated, but it must also understand that the harvest is small and is not easy. Also try to optimize on the hardware.

4.13 Write Buffers and when you need to worry

The 32-bit MIPS CPU of Write-Though Cache is usually used, and each write operation will be written directly to the main memory. If the CPU wants to wait for each write operation to continue, this will be a big performance bottleneck.

The C language program is compiled to MIPS, with an average of 10% of instructions being stored instructions; however, these operations may tend to merge into burst, and an example of on-site protection (saving some registers) at the beginning of a function. DRAM memory usually has such a feature. The first group will usually take a long time (these CPUs are typically 5 to 10 clock cycles), and the second and back will be the same. If the CPU is simply waiting for each write operation, it is very strong in performance. So a backcroping buffer will usually be provided, and the first deposited deposited data and addresses.

The 32-bit MIPS CPU using Write-Through Cache is very relied back to write buffers. In these CPUs, the queue that buffers four times can be buffered when the CPU clock frequency reaches 40 MHz. It will be difficult to provide a good buffer.

The subsequent CPU (with Write-back Cache) buffer can save the LINE you need to write and improve the time written by non-Cache.

Many times of writing buffer operations are transparent for software. But sometimes programmers should pay attention to the situation below:

a. Time of I / O register access: This has an impact on all MIPS CPUs. When you perform a write to the I / O register, this will have a delay that cannot be determined. Other communications between the I / O system may be very fast, and an example is that you may still see an active interrupt after you tell the device. In other examples, if the I / O register needs to be recovered after a write operation, you must ensure that the backshop buffer is empty before you start calculating this delay. Here you must ensure that the CPU waits until the backshp buffer is empty. It is a good habit of defining subroutines to do this; subroutine is WBFLUSH (), which is a tradition. Look at the next 4.13.1, how to implement it.

The above describes that may occur on any MIPS R4x000 (MIPSII ISA). Also view the entire IDT R3051 family, and the vast majority of popular embedded CPUs. But in some early 32-bit systems, more strange things may happen:

a. Read operation catching the write: When an item (non-Cache or no cache) is executed, the backup buffer is not empty. The CPU needs to be selected: Is the wait for the operation or use the memory interface to take the operation? Let the action will first improve the efficiency, because the CPU needs to wait until the data arrived. This is a good opportunity, the write operation is overwhelmed, but it is still in parallel with the CPU.

The initial R3000 hardware will be decided to leave this selection to the system hardware. Most MIPSI CPUs starting from IDT do not allow read operation overwriting operations, and there is no condition priority. Most MIPSII CPUs do not allow read operations to be overwhelmed, but software is not necessarily considered. See Section 8.4.9 Description of the SYNC Directive.

If you confirm that your MIPS i CPU does not have no conditional write priority, then when you process the I / O register, the necessary address detection cannot help you; because of the early days of writing to different addresses, there is no completion Then, then take action at this time, there will be an error. In this case you need to call WBFLUSH ().

b. Byte merge: When the buffer notes written to the same word address, the buffer combines these write operations into a simple write operation. This is not adopted by all the CPUs of all R3051 because it may generate errors when the I / O register is written.

If your I / O register maps into each register, this is not a bad note. But you can't do this.

4.13.1 Executing WBFUSH

Unless your CPU is one of the special types mentioned above, you can guarantee that when performing operations for any address, the backup buffer is empty (this will suspend the CPU until the write operation is completed, and the operation is also completed) . But this is not high; you can overcome some by using the fastest memory to minimize. For those who have never want to consider these, it is followed by a write operation (if you are running on the MIPS III or the CPU behind it, you need to add SYNC instructions between two instructions), need First empty back buffer (it is difficult to see if the CPU does not have this action).

Some systems indicate that the FIFO is empty by hardware signals, and even the input interface allows the CPU to immediately know. But not doing this CPU so far.

CAUTION! Some systems typically have a write-up buffer outside the CPU. Any bus or memory interface boasking will also have the same feature. The write buffer is the same as the CPU and in the CPU, will also bring you the same problem.

4.14 Other About MIPS CPU

Although you may never know, we still have many reasons to talk about these.

4.14.1 Multi-processor Cache Features

The discussion of this book will only be used for the single CPU system. Interested in reading the corresponding document (Sweazey and SMITH, 1986).

4.14.2 Cache Aliases

This issue will only affect such cache, used to generate index addresses and the address stored in tag is different. In the main cache of the R4000 type CPU, INDEX is a physical address in the tag memory. This is very good for performance because Cache queries and address transformations can be parallel, but this may also cause AliaSes.

Most of these CPUs can convert the address according to the size of 4KB, and the Cache size is 8kb or more. Two different virtual addresses can be mapped to a corresponding physical address, and this is two consecutive pages, we can assume that their start addresses are 0KB and 4KB, respectively. If the program accesses the address 0kb, the data is read into the cache of INDEX 0. We assume that you want the address 4kb to access the same data, Cache will take the number from the memory and saved to cache, but this Index is 4KB. At this point, there are two backups for the same data in Cache, and one is changed, and the other will not be affected. This is the alias of cache.

MIPS Layer 2 Cache is a physical address to generate index and stored into a TAG, so they do not produce the above problems.

However, avoid this problem is easier than correcting it. If any two different virtual addresses can produce the same index, then this problem will not appear. For 4KB-page, just guarantees that the minimum 12-bit address used to generate Index is consistent; as long as the physical address page of different virtual addresses corresponds to the mode of the main Cache size. If you can guarantee that the virtual address is a multiple of 64KB, this is impossible to produce this problem, you will not encounter trouble.

转载请注明原文地址:https://www.9cbs.com/read-86904.html

9cbs

New Post(0)