Floating point representation and its implementation.

xiaoxiao2021-04-04  259

I know two years ago, I shouldn't use the == number to judge the equal number of floats, because there is a precision problem, but I haven't cared more about these things, and in fact, I have the structure of the floating point, though Learn, but it is not clear. As a C enthusiast, we should try to figure out every problem, so I figure out the inherent expressions and realization of floating point numbers. In the case of no big problems, everything is easy to understand and memorize .

First, let's talk about the original, anti, make up, and shift code. The shift code is actually equal to the complement, but the symbol is opposite. For positive numbers, the original, inverse, complement is the same, the negative number, the anti-code except On the basis of the original code, the reckon is inversely, and the complement is on the basis of the inverse code. When the minimum bit is added, it is required to make the complement, and then the symbol will be changed.

The floating point score is float and double, accounting for 4, 8 bytes, ie 32,64. I only take 32-bit Float as an example, and take Double.

In the IEEE754 standard, it is specified that FLOAT's 32-bit is divided into:

Symbol bit (S) 1th order code (E) 8 mantissa (m) 23

It should be noted here: A, the degree code is expressed by the shift code, where there is a 127 offset, its 127 is equivalent to 0, and is negative when it is less than 127, and is a positive, such as: 100000000 The index is 129-127 = 2, indicating that the true value is 2 ^ 2, and 0111110 represents 2 ^ (- 1).

B, the mantissa is all the number behind the decimal point,

C, but a 1 is omitted in the mantissa, so the mantissa is all 0, and it is also 1.0 ... 00;

Next, as long as the problem is explained, it will be understood that 123.456 as an example, it is expressed as binary: n (2) = 1111011. 01110100101111001, here, will be left at 6 bits, obtained N (2) = 1.111011 0111010010111001 * 2 ^ 6 This form can be used in the format of the above figure.

Symbol bit (s) 0th class code (e) 00000110 mantissa (M) 11101101110100101111001

It is noted that the first bit of the first bit of the order is 0 tables, the mantissa is less than the first bit of N (2), which is that the default is the first bit of 1. Because of the decimal During the binary process, it is often not just right, (of course, there will be no loss like 4.0, and the inevitable loss of 1.0 / 3.0), so it produces the accuracy of the floating point number, in fact, the decimal point The 23-digit binary number can affect the top 8 of the decimal number. Why is this? The average person is often fascinated, in fact, it is very simple. In the above-mentioned mantissa, it is binary, and there is a decimal point. When the last bit is 1, it is 1/2 ^ 22 = 0.00000000238 It is definitely 0.0000002, which means that for a floating point of a float type, its effective bit is from left to left Right 7 (including the default 1 is 7), when arriving above this 8th, it is unreliable, but our VC6 can output the longest 1.0 / 3.0 is 0.3333333333333331, this is mainly compiler. The problem is not that 16 people after the floating point number are effective. If you don't believe it, you can try 1.0 / 3.0 of the Double type, and you will be 17 digits after the decimal point. .. Additionally, the compiler or The circuit board generally has "noise" "correction" ability, which makes more than 7 digits, even if it is invalid, it will not be outlined, which is why it has always been output 333 instead of 345, Can you try this:

Float f = 123456789; cout << f << endl; // here is definitely 123456789.

Here is a problem that is forgotten, that is, how is the decimal decimal to become a decimal decimal, in fact, it is very simple, it is to multiply 10 into the decimal part of 2, and write the corresponding 2. 1. Therefore, the top N (2) = 1.111011 01110100101111001 * 2 ^ 6 is then repeated to the decimal number, it is likely that it is no longer 123.456. Good, accuracy issues should be clear. The scope of the number will be said below.

The number of scales of the division is 8-bit shift code, up to 127 minimum is -127, which is used as a 2 index, so it is 2 ^ 127, about equal to 1.7014 * 10 ^ 38, and we know, float The scope is approximately -3.4 * 10 ^ 38 ------- 3.4 * 10 ^ 38, this is because 24 digits of the mantissa (default first is 1) all 1, very close to 2, 1.11 .. 11 is obviously about 2, so the range of floating point is out. Double is completely similar to FLOAT, but its inner form is

Symbol bit (s) 1th order code (E) 11 mantissa (m) 52

The main difference is that its degree is 11, this is 2 ^ 1023 is about 0.8572 * 10 ^ 308, the mantissa is 53, so the Double number is about -1.7 * 10 ^ 308.- ---- 1.7 * 10 ^ 308. As for its precision, the same, 1.0 / 2 ^ 51 = 4.4 * 10 ^ (- 16). After the decimal point, 15 in the decimal point is effective, plus the default bit, so for Double Floating point numbers, from left to right 16 is reliable.

Sometimes, we will hear the word "fixed-point decimal", single-chip (such as mobile phones, etc.) generally only use fixed number, when we are confused, we will think Float A = 23.4; this is a fixed point decimal, float a = 2.34e1 For the floating point number, this is wrong. The above is only the same floating point number. It is a floating point. The fixed point decimal is this kind of proposal, and it is considered to be a fixed point decimal, and the decimal point is fixed behind, the fractional part is 0. It can also be considered that the pure decimal is a fixed point decimal, but it can only represent a pure decimal of less than 1.

Then talk about several functions in C / C , 5 decimal outputs after the decimal point in C , but can be set, there are two ways: call setpression or use cout.pression, but the effect is different:

FLOAT mm = 123.456789f; cout << mm << Endl; //123.457 Although the default is 5 after no point, only one integer is only one. setPrecision (10); // Set the number of digits after the decimal point However, when the integer part has two times, there is no different from the default situation. Cout << mm << endl; //123.457 cout.precision (4); // Set the total number of bits. Cout << mm << Endl; //123.4 The effect is quite blamed, individuals think that although this is not certain enough, it is a hardware system. It is not possible.

For the actual representation of 0, some people think that 0 can be absolutely 0, while -0 may represent a minimum number. To this end, I think of a good verification method, prove regardless of 0 or -0 It is 2 ^ (- 127), the code is as follows:

FLOAT fdigital = 0.0f; Unsigned long nmem; // temporary variable, memory data for storing floating point numbers // copies memory bits to temporary change in order to take, NMEM at this time is not equal to fdigital, it It is replicated by bit. NMEM = * (unsigned long *) & fdigital; cout << nmem << endl; // generally gets a large integer.

BitSet <32> Mybit (NMEM); // This is here, the output here is 32float's memory. Finally, it is completely intuitive. Cout << mybit << endl; // @ 00000000000000000000000 with -0.0, is also like this.

If you think that the long string of 0 is absolute 0, then please re-see this article. In fact, this practice is more clever, indicating the above fdigital with any other floating point number, this BitSet The number can reflect its memory representation.

There is a transcode indicating that the order is reason, mainly the transcoding is easy to operate, thereby comparing the size of the two floating point numbers. It should be noted here that the degree cannot reach 11111111, IEEE specifies, when compiler When you encounter a class code is 0xFF, you call overflow instructions. In short, the classification is integer, the range is: -127 ~ 127. Finally, there is a place that is often mastered, must remember, the floating point number is not The symbol type USINGED FLOAT / DOUBLE is wrong.

This person is scattered, welcomes criticism.

转载请注明原文地址:https://www.9cbs.com/read-131940.html

New Post(0)