Let’s go into bits and bytes
For 50 years, from the time of Kernighan, Ritchie, and their 1st version of the C Language e book, it was recognized {that a} single-precision “float” sort has a 32-bit dimension and a double-precision sort has 64 bits. There was additionally an 80-bit “lengthy double” sort with prolonged precision, and all these sorts lined nearly all of the wants for floating-point knowledge processing. Nonetheless, throughout the previous couple of years, the arrival of huge neural community fashions required builders to maneuver into one other a part of the spectrum and to shrink floating level sorts as a lot as attainable.
Actually, I used to be shocked after I found that the 4-bit floating-point format exists. How on Earth can it’s attainable? One of the simplest ways to know is to check it on our personal. On this article, we are going to uncover the preferred floating level codecs, make a easy neural community, and see the way it works.
Let’s get began.
A “Commonplace” 32-bit Floating level
Earlier than going into “excessive” codecs, let’s recall a typical one. An IEEE 754 commonplace for floating-point arithmetic was established in 1985 by the Institute of Electrical and Electronics Engineers (IEEE). A typical quantity in a 32-float sort seems like this:
Right here, the primary bit is an indication, the following 8 bits symbolize an exponent, and the final bits symbolize the mantissa. The ultimate worth is calculated utilizing the method:
This straightforward helper operate permits us to print a floating level worth in binary type:
import structdef print_float32(val: float):
""" Print Float32 in a binary type """
m = struct.unpack('I', struct.pack('f', val))[0]
return format(m, 'b').zfill(32)
print_float32(0.15625)
# > 00111110001000000000000000000000
Let’s additionally make one other helper for backward conversion, which might be helpful later:
def ieee_754_conversion(signal, exponent_raw, mantissa, exp_len=8, mant_len=23):
""" Convert binary knowledge into the floating level worth """
sign_mult = -1 if signal == 1 else 1
exponent = exponent_raw - (2 ** (exp_len - 1) - 1)
mant_mult = 1
for b in vary(mant_len - 1, -1, -1):
if mantissa & (2 **…