Floating Point Representation

In K&R, we have an exercise 2-1. It's stated like this

Write a program to determine the ranges of char, short, int, and long variables, both signed and unsigned, by printing appropriate values from standard headers and by direct computation. Harder if you compute them: determine the ranges of the various floating-point types.

This exercise is only part of K&R 2nd edition, which uses C89 (ANSI C).

This exercise forced me to learn a fair bit about the representation of floating point numbers. Really, I should know this stuff already, but the last time that I learned it was when I was probably under 18 years old -- more than 20 years ago; and back then, I didn't enough knowledge to understand it. At the time, it was taught to me in a very first-principles way. I don't say this is the wrong way to teach, but at that time it didn't work for me.

I actually got this information from a fun visualization on YouTube, so there are a few remaining perks to living in 2025.

There are 3 main components to floating point representation, which are all sub-components of a single type float. These are standardized under IEEE 754.

A sign bit (size 1).
An exponent (size 8).
A mantissa (size 23).

To 'decode' a number, you essentially multiply out all the components, according to a specific formula.

SIGN * MANTISSA * EXPONENT

The strategy is basically analogous to scientific notation. Each part needs to be decoded separately. The mantissa essentially encodes a repeatedly smaller fractional value, implicitly starting with 1., getting closer and closer to 2 at its maximum value. E.g. if there was a mantissa with a size of 1, it could either encode 1 or 1.5. If the mantissa had size 2, it could have values 1, 1.5, or 1.75.

The exponent is encoded as a binary integer, but with a few quirks. A 'bias' is used instead of a sign bit. For instance, imagine an exponent of size 8. This exponent would have 256 possible values (0-255). However a bias of 127 is used. So subtracting the bias (255 - 127), we get 128 as the highest value. However we know (from reading the spec... right?) that the top exponent value is reserved to represent infinity. So the actual maximum is therefore 127. By the same token, the lowest value is also reserved to represent NaN. So the lowest value is therefore 0 - 127 + 1 = -126.

Apparently a side effect of this strategy is that precision gets worse as you approach the end of the range, but precision is better within the lower values. This does seem like a reasonable trade off for me. However I do intuitively feel that fixed-point seems like a better approach for most problems that I actually encounter in computing.

Note that the exercise has a big problem which is that the details of floating point are actually not defined in C89. This is where language-lawyering the wording of the question gets a bit murky -- from the most abstract perspective, a float is just a method of storing a real number. You can't just impute a mantissa/exponent structure to it. However, perhaps the intended reading is for the word float itself to correspond directly to said representation, which is assumed-background-knowledge for the reader? We can't read K&R's minds, unfortunately.

There are several things to note: using these facts about the representation requires the ability to compute powers. Luckily, K&R have already introduced a power routine, so we just need to rewrite this to work on floats. However, the journeyman programmer might wonder, won't this suffer from the notorious precision issues? Luckily, because multiplying by 2 operates directly on the exponent and doesn't touch the mantissa at all, it's always precise. This means that you can compute everything using the type in question (float or double).

The other thing to note is that while FLT_MAX is a valid comparison here, FLT_MIN is actually not that relevant. At least colloquially, when we talk about the range of a value, we mean its full range from the maximally-negative to the maximally-positive. So, strictly the endpoints would be -inf and +inf. However FLT_MIN is not actually the maximally-negative value; that would be -FLT_MAX.

The solution to the exercise that I eventually settled on uses a direct encoding of the representation rules for IEEE 754 floats: ex2-1.c.

However my intuitive sense of the intended solution to this problem suggests that it shouldn't rely on any knowledge of the internal structure. The closest that I could get to this was patterned on a suggestion from a Reddit user.

#include <float.h>
#include <stdio.h>

float get_stage1_max(void) {
    float f = 1;
    float prev;

    while ((f * 2) != f) {
        prev = f;
        f *= 2;
    }

    return prev;
}

float calculate_by_approach(float start_point) {
    float increment = start_point;
    float f = start_point;
    float prev = f;

    while (1) {
        increment /= 2;

        if ((f + increment) == f)
            break;

        prev = f;
        f += increment;
    }

    return prev;
}


int main(void) {
    float result = calculate_by_approach(get_stage1_max());
    printf("Stage 2 max: %g\n", result);
    if (result == FLT_MAX) {
        printf("This is the largest float.\n");
    } else {
        printf("This is not the largest float.\n");
    }
}

This calculates the value in 2 stages: first it will calculate the largest float value that's reachable by doubling. Then it will repeatedly add smaller values, dividing by 2 until infinity is reached, and give the previous value before that happens. Realistically this does rely on a certain amount of assumptions about the behaviour of float arithmetic and therefore (by implication) of the internal structure of a float, but I believe that this is the closest one can get to a fully 'abstract' solution; happy to be corrected, though.

This exercise really illustrates a characteristic of K&R: the fact that they can forcibly-entail so much complexity from just this tiny statement, just an appendage to an exercise.