@@ -14,14 +14,14 @@ Arduino library to implement float16 data type.
1414
1515## Description
1616
17- This ** experimental** library defines the float16 (2 byte) data type, including conversion
17+ This ** experimental** library defines the float16 (2 byte) data type, including conversion
1818function to and from float32 type. It is definitely ** work in progress** .
1919
20- The library implements the ** Printable** interface so one can directly print the
20+ The library implements the ** Printable** interface so one can directly print the
2121float16 values in any stream e.g. Serial.
2222
23- The primary usage of the float16 data type is to efficiently store and transport
24- a floating point number. As it uses only 2 bytes where float and double have typical
23+ The primary usage of the float16 data type is to efficiently store and transport
24+ a floating point number. As it uses only 2 bytes where float and double have typical
25254 and 8 bytes, gains can be made at the price of range and precision.
2626
2727
@@ -31,13 +31,39 @@ a floating point number. As it uses only 2 bytes where float and double have typ
3131| attribute | value | notes |
3232| :----------| :-------------| :--------|
3333| size | 2 bytes | layout s eeeee mmmmmmmmmm
34- | sign | 1 bit |
35- | exponent | 5 bit |
36- | mantissa | 11 bit | ~ 3 digits
37- | minimum | 5.96046 E−8 | smallest positive number.
38- | | 1.0009765625 | 1 + 2^−10 = smallest nr larger than 1.
39- | maximum | 65504 |
40- | | |
34+ | sign | 1 bit |
35+ | exponent | 5 bit |
36+ | mantissa | 11 bit | ~ 3 digits
37+ | minimum | 5.96046 E−8 | smallest positive number.
38+ | | 1.0009765625 | 1 + 2^−10 = smallest nr larger than 1.
39+ | maximum | 65504 |
40+ | | |
41+
42+
43+ #### example values
44+
45+ ``` cpp
46+ /*
47+ SIGN EXP MANTISSA
48+ 0 01111 0000000000 = 1
49+ 0 01111 0000000001 = 1 + 2−10 = 1.0009765625 (next smallest float after 1)
50+ 1 10000 0000000000 = −2
51+
52+ 0 11110 1111111111 = 65504 (max half precision)
53+
54+ 0 00001 0000000000 = 2−14 ≈ 6.10352 × 10−5 (minimum positive normal)
55+ 0 00000 1111111111 = 2−14 - 2−24 ≈ 6.09756 × 10−5 (maximum subnormal)
56+ 0 00000 0000000001 = 2−24 ≈ 5.96046 × 10−8 (minimum positive subnormal)
57+
58+ 0 00000 0000000000 = 0
59+ 1 00000 0000000000 = −0
60+
61+ 0 11111 0000000000 = infinity
62+ 1 11111 0000000000 = −infinity
63+
64+ 0 01101 0101010101 = 0.333251953125 ≈ 1/3
65+ */
66+ ```
4167
4268
4369## Interface
@@ -66,7 +92,7 @@ See array example for efficient storage using set/getBinary() functions.
6692
6793#### Compare
6894
69- Standard compare functions. Since 0.1.5 these are quite optimized,
95+ Standard compare functions. Since 0.1.5 these are quite optimized,
7096so it is fast to compare e.g. 2 measurements.
7197
7298- ** bool operator == (const float16& f)**
@@ -80,7 +106,7 @@ so it is fast to compare e.g. 2 measurements.
80106#### Math (basic)
81107
82108Math is done by converting to double, do the math and convert back.
83- These operators are added for convenience only.
109+ These operators are added for convenience only.
84110Not planned to optimize these.
85111
86112- ** float16 operator + (const float16& f)**
@@ -106,7 +132,7 @@ negation operator.
106132## Future
107133
108134
109- #### 0.1.6
135+ #### 0.1.x
110136
111137- update documentation.
112138- unit tests of the above.
0 commit comments