CPU learning Notes (2) Author: Badcoffee
Email: blog.oliver@gmail.com
April 2005
Original article: http://blog.9cbs.net/yayong
Copyright: Please be sure to indicate the original source, author information and this statement by hyperlink.
This is the author's notes in the process of learning hardware, because of the previous knowledge, lack of system
Learning, it is inevitable that there will be errors, I hope everyone is correct.
I. A article written in Cache Coherence in 2004
X86 Assembly Language Learning Intention (1), the code for GCC compiled by default 16 bytes
The problem of the stack. The reason why this is mainly considered by performance optimization.
Most modern CPUs One-Die L1 and L2Cache. Most of the L1 Cache, mostly Write Though; L2 Cache
It is Write Back, not immediately written back to Memory, which will cause inconsistencies of Cache and Memory;
For the environment of MP (Multi Processors), because cache is private, different CPU's cache content also exists.
Inconsistent issues, therefore, many MP calculation architectures, whether ccnuma or SMP is cache coherence
Mechanism, that is, different CACHE consistency mechanisms of different CPUs.
One implementation of Cache Coherence is through the cache-snooping protocol, each CPU is implemented by the Snoop's Snoop
Other CPUs read and write Cache monitoring:
First, Cache Line is the minimum unit of data transfer between Cache and Memory.
1. When CPU1 To write Cache, other CPUs check the Cache Line corresponding in their cache, if it is dirty,
On Write Back to Memory, and will refresh the CPU1 Cache Line; if not dirty, invalidate
This cache line.
2. When the CPU1 is to read Cache, other CPUs are marked in the Cache Line corresponding in Cache as Dirty.
Write Back to Memory and refreshes the CPU1 Cache Line.
Therefore, improve the Cache Hit Rate of the CPU, reduce data transfer between Cache and Memory, will increase the performance of the system.
Therefore, keep cache line aligned in memory allocation of program and binary objects is very important, if not guaranteed
Cache Line Aligns, a process or thread that appears in parallel operation in multiple CPUs, reads the same cache line
The probability will be large. At this point, Write Back and Refresh cases are repeated between the CPU Cache and Memory.
The situation is called cache thrawing.
In order to effectively avoid cache thrawing, there is usually the following two ways:
1. For the assignment of HEAP, many systems enforce the mandatory alignment in the Malloc call.
2. For Stack assignment, many compilers provide the STACK ALIGNED option.
Of course, if the STACK ALIGNED is specified in the compiler, the size of the program will become large, and more memory is taken up. therefore,
This level of hits need to be carefully considered, below is a discussion I searched on Google:
One of Our Customers Complained About The Additional Code Generated To
Maintain the stack aligned to 16-byte boundaries, and suggested us Todefault to the minimum alignment when Optimizing for code size. this
Has the Cavet That, When You Link Code Optimized for Size with Code
Optimized for Speed, IF A Function Optimized for Size Calls A
Performance-critical function with the stack misaligned, the
Performance-critical function may Perform poorly.
Second, GCC alignment parameters
-mpreferred-stack-boundary
X86 Assembly Language Learning Book (1) has been mentioned, in addition, also search on Google
I have a message about the discussion of the stack, share with you:
----- Original Message -----
From: "Andreas Jaeger"
TO: gcc@gcc.gnu.org
CC: "Jens Wallner" Wallner@ims.uni-hannover.de
Sent: Saturday, February 03, 2001 2:37 AM
Subject: Question About -mpreferRed-Stack-Boundary
>
> WE (Glibc Team) Got A Bug Report That The Stack Is Not Aligned
> Properly - and I'm a bit confused by the documentation of
> -Mpreferred-Stack-Boundary Which IS:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~
> @item -mpreferred-stack-boundary = @ var {num}
> Attempt To Keep The Stack Boundary Aligned To A 2 Raised to @var {Num}
> BYTE Boundary. if @samp {-mpreferred-stack-boundary} is not specific,
> The default is 4 (16 bytes or 128 bits).
>
> The Stack is Required To Be Aligned on A 4 Byte Boundary. On Pentium
> and pentiumpro, @code {double} and @code {long double} VALUES SHOULD BE
> Aligned to an 8 byte boundary (see @SAMP {-malign-double}) or suffer
> Significant Run Time Performance Penalties. on Pentium III, THE
> Streaming simd extension (SSE) Data Type @code {__ m128} SUFFERS Similar
> Penalties if it is not 16 byte aligned.
>
.
> FURTHER, EVERY FUNCTION MUST Be generated Such That It Keeps The Stack
> Aligned. Thus calling a function completed with a higher preferred
> Stack Boundary from a Function Compiled with a Lower Preferred Stack
> Boundary Will Most Likey Misalign The Stack. It is recommented That
> Libraries That Use Callbacks Always Use the default setting.
>
> This Extra Alignment Does Consume Extra Stack Space. Code That ISensitive
> To Stack Space Usage, Such As Embedded Systems and Operating System Kernels,
> May Want To Reduce the PreferRed Alignment To
> @SAMP {-mpreferred-stack-boundary = 2}.
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~
>
> Who Has TO Align The Stack for Calls to a Function - The Caller or T
> Callee? in Other Words: Does this mean That The Stack Has To BE
> Aligned Before Calling a Function? or does it has to be aligned by
> Entering a function?
>
> Andreas
> -
> Andreas Jaeger
> Suse labs aj@suse.de
> private aj@arthur.inka.de
>
http://www.suse.de/~aj
I believe the preferred alignment for long double is a 16 byte boundary, and
The Stack (AND INSTRUCTION) Alignments Must Be Set Before Entering a function.
Pentium 4 Increases Preferred Data Alignments to 32 Bytes in Some Situations,
As well as increasing the number of situations (sse2 instructions) Where 16 Byte
Alignment is needed.
As you can see here, the stack must be guaranteed before the function is called:
The Stack (AND Instruction) Alignments Must Be Set Before Entering A Function
Related documents: X86 assembly language learning incoming (1)
CPU learning notes (1)
Cache Cohernce with Multi-Processor