Essay #6: Bits & Bytes - nuwen.net

Free of trickery that exploits the representation's properties

Abstract.

This is an introductory essay on bits, bytes, and how computers store information. Endianness issues are discussed, and the portable method to store and retrieve multi-byte integers is explained.

Essay.

As it so happened, humans evolved on Earth with ten fingers. There is nothing especially magical about this number. Five fingers on a hand happened to increase the chances that our ancestors would survive. There's no fundamental reason why we couldn't have, say, four fingers on a hand. If we lacked a thumb, of course, we might not ever have gotten so good at making tools. Similarly, there's no fundamental reason why we couldn't have six fingers on a hand, aside from the fact that five fingers is the maximum that most mammals have (barring the occasional mutation). Really, the only signficiant thing about ten fingers is that it's divisible by two - we have bilateral symmetry. As a result, when humans learned to count, they most often did so in "base ten". As there is nothing special about the number of fingers on our hands, there is nothing special about base ten. It's just what we happen to use. In base ten, we have ten "digits" or "numerals". The word "digit" comes from the Latin word "digitus", which means "finger" - as usual, the English language is full of interest. The ten digits are: 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. In this way we can count any small group of things. However, we run into a problem when there are more than nine things to count: what digit comes after 9?

We could make up new digits to represent numbers greater than nine, but then we wouldn't have a base ten system anymore. The Romans used a different, non-base-ten system, with the symbols I, V, X, L, C, D, M, and it turns out that doing arithmetic in such a system - especially division - is a nightmare. No, we need to represent numbers greater than nine using only the digits 0-9. We use a "positional" system, whereby we use multiple digits to represent a single number - the position of a digit in a number determines its meaning. The leftmost digit in a number is the "most significant" and the rightmost digit is the "least significant". In the number 1983, the 1 stands for a thousand, the 9 stands for nine hundred, the 8 for eighty, and the 3 just means 3. The 1 is the most significant digit, since it represents the highest number. (There's a lot of difference between the years 1983 and 2983, but not a lot of difference between 1983 and 1984.) The rightmost place in a number is the "one's place" and determines how many ones to add to the number. The second rightmost place in a number is the "ten's place" and denotes how many tens we should add to the number, and so forth. We multiply each digit in a number by the amount it represents, and then we sum them all up. Thus, "1983" means 1 * 1000 + 9 * 100 + 8 * 10 + 3 * 1. We can write this using exponents, too: 1983 = 1 * 103 + 9 * 102 + 8 * 101 + 3 * 100. (When we write ab, we mean "a multiplied by itself b times". Thus 103 = 10 * 10 * 10 = 1000. By convention, a0 is 1 no matter what a is.)

It is important to remember that there is nothing intrinsically special about the fact that we write down numbers with the most significant digit first. We just as easily could write them the other way. We would count: 0, 1, 2, ..., 9, 01, 11, 21, 31, ..., 91, 02, .... However, we don't do it this way, because we read from left to right, and we like to read about the biggest things first. This convention is termed "big-endian", after a scene in Gulliver's Travels. This is because the most significant end, the "big end", comes first. The convention that we don't use, of putting the least significant end first, is "little-endian". Endianness is not an issue with decimal numbers.

Now enter the electronic computer. Computers count too, of course. In fact, that's just about all that they do. Unlike humans, however, computers count in a slightly different fashion. They don't have ten fingers, after all. Rather, in a wire, there can either be current flowing or no current flowing. Things are either on or off. Thus, the most natural base for computers is "base two". Base ten is termed decimal, while base two is termed binary. In binary, there are only two digits, 0 and 1. That's about the only difference, really. If we have some number in binary, say, 10011, we can figure out what it means using the exact same method that we determine the meaning of a decimal number. All that changes is the amount that each digit represents. Instead of powers of 10, we use powers of 2. "10011" means 1 * 24 + 0 * 23 + 0 * 22 + 1 * 21 + 1 * 20 = 1 * 16 + 0 * 8 + 0 * 4 + 1 * 2 + 1 * 1 = 16 + 0 + 0 + 2 + 1 = 19. In binary, the rightmost place is the "one's place", just like in decimal. The second rightmost place is the "two's place". The next is the "four's place", the next is the "eight's place", and so forth. Obviously, numbers in binary tend to use a lot more digits than if they were written in decimal. That's okay, though - we can add lots of wires into a computer for all the necessary digits.

Now, what of endianness? Here, I just wrote 10011 in big-endian form. Big-endian is the most natural for us to read, because it behaves just like the decimal numbers that we're taught in elementary school. Indeed, computers often use big-endian representation. It's standard on the Internet, in Portable Network Graphics files, in the Secure Hash Algorithm-1, and in plenty of other places. Unfortunately, some idiotic computer engineers have built their machines to represent numbers in little-endian. Among them is the "x86" architecture, meaning Intel microprocessors and other microprocessors compatible with them. It's idiotic because there is no reason why little-endian or big-endian should be preferable in a computer. The computer will still work either way. The deciding factor is the humans who have to work with and program the computers - they like to use big-endian, so there's no reason why the computer shouldn't use big-endian too.

Computers occasionally (Note 1) need to display information to the user. However, all they can work with are numbers in binary form. So, in order to display text on the screen, there has to be some way to associate letters and other symbols to numbers. The standard way this is done is called ASCII (pronounced "ask-ee"). Here is a table of the ASCII symbols (which are called characters) and their associated numbers:

Table of ASCII characters

There are 256 different ASCII characters, numbered 0 through 255 inclusive. In binary, that's 00000000 through 11111111 inclusive. Computers have various ways to store information. One is RAM, Random Access Memory, which is a collection of silicon chips which have millions of capacitors in them. A capacitor can either be full of charge or empty. Hence, the computer uses one capacitor to store a binary digit. This phrase, "BInary digiT" was contracted to "bit". A bit can either be on (1) or off (0). To represent larger numbers, computers use multiple bits. It is traditional to group together 8 bits and call them a "byte". A byte can represent any number from 0 to 255 inclusive. Thus, we say that there are 256 types of bytes, just as there are 10 decimal digits. ASCII associates each type of byte with a character. There is nothing intrinsically special about 8-bit bytes - some old computers have 9-bit bytes, and some digital signal processing chips have 32-bit bytes. However, personal computers all have 8-bit bytes.

Now, it so happens that most computers are not "bit addressable". That is to say, you can't ask the computer to go and retrieve a single bit from memory or the hard drive. Rather, computers are "byte addressable" - you have to ask for 8 bits at a time (and then you can look at any particular bit you want). This means that, as long as you work with bytes, you don't have to care about endianness issues. If you're programming in C, say, then a byte is called an unsigned char (Note 2). If you ask the machine to store, say, 137 in a variable of type unsigned char, it will gladly do so. You can then ask for it back, and you'll get the number 137. This is independent of how the machine actually stores a byte in memory. Since you can't actually look at the bits which compose a byte, you can't figure out in what pattern they are arranged, and you don't need to know anyways.

However, life is not entirely carefree. Occasionally (Note 3) it is necessary to store numbers larger - much larger - than 255. In that case, we need more than 8 bits. We might need, say, 32 bits (to store values up to 4,294,967,295) or even 64 bits (to store values up to 18 quintillion). Now, here's where the endianness of computers comes in. Modern computers are 32-bit - that is to say, they usually work with 32-bit numbers. (In a few years, personal computers will be 64-bit.) We can still look at individual bytes - it's way too useful of an ability to lose. Thus, there is the potential for us to figure out whether the machine is big-endian, little-endian, or worse (Note 4) by storing an integer larger than 8 bits and then asking for different parts of it. If we have a 32-bit number and want to store it, we write it down as 32 consecutive bits, and then make 4 groups of 8 bits. Thus, we can use 4 bytes to represent a 32-bit number. We don't care how the machine stores the bytes themselves, but there is the matter of what order the bytes should go in.

There are two ways to do this: the nonportable way, and the portable way. If code is portable, that means that it is written in a robust way which does not assume too much about the computer it is run on. Such code can easily be "ported" from one system to another - say, from Windows to Linux. Nonportable code assumes things about the system that it is compiled with or run on. Sometimes this is done in the interests of speed, but just as often it's due to lazy programmers writing crocky code.

Let's take a concrete example. If you have a 32-bit number in C programming, which is called an unsigned long (Note 5), you can work with it as a unit, without regard at all to how it is stored. However, eventually you may have to write it to disk, and that means you have to break it up into bytes. The lazy way to do this is to write it all as one unit, meaning that the machine will use its endianness convention to determine what order the bytes should be written in. The portable, correct way to do this is to write it yourself, byte by byte. To read it in later, you read individual bytes and synthesize the number yourself. You still have to choose an endianness convention, but this way it is part of the format you are writing the file in. If your program is the Foo Text Editor and it writes .foo files, then you simply define that 32-bit numbers in .foo files are to be written big-endian (because little-endian is evil), and so forth. Then, even if you compile and run the Foo Text Editor on a little-endian machine, it can still read .foo files created anywhere else. Some programmers simply never realize that there is a portable way to do this.

The C programming language tends to be a rather low-level language, "close to the machine". As such, it provides plenty of mechanisms for working with bytes. This is also called "bit twiddling". There are several operations you can perform on numbers, which are just sequences of bits. The simplest is NOT (~ in C), which inverts bits (changing 0 into 1 and 1 into 0). NOT 11001001 is 00110110. Another is AND (& in C). AND is a bitwise function; if you give AND two sequences of bits, it takes the first bit of each input and produces the first bit of the output, and so forth. If both input bits are 1, then the output bit is 1, otherwise it is 0. A similar operator is OR (| in C). If both input bits to OR are 0, then the output is 0, otherwise it is 1. (Finally, there is XOR (^ in C), which outputs 1 if both inputs are different and 0 if the inputs are identical.) Additionally, there is left shift (<< in C) and right shift (>> in C). If you left shift a number by N, you add N zeros on the right side of the number's binary representation. If you right shift it by N, you remove N digits from the right side of the number and add N zeroes to the left. When working with unsigned (positive) numbers, a left shift by 1 is identical to multiplication by 2, and a right shift by 1 is identical to division by 2 (throwing out any remainder).

A quick interlude. We can also write numbers in hexadecimal, which is base sixteen. It has sixteen digits: 0, 1, ..., 9, A, B, C, D, E, F. One hexadecimal digit represents four bits: 0 is 0000, 1 is 0001, 2 is 0010, ..., 9 is 1001, A is 1010, B is 1011, C is 1100, D is 1101, E is 1110, F is 1111. Hexadecimal numbers are written with 0x in front of them, so that it is clear that they are hexadecimal. A byte can be represented by two hexadecimal digits (called hexits). Byte 0 is 0x00. Byte 255 is 0xFF, and so forth.

Suppose we have a 32-bit number n, and we want to write it to disk in big-endian format and read it again, portably. Here is how we do it. (Comments are enclosed in /* comment */)

unsigned long n = 1234567890;               /* Under 4 billion, so this can be represented in 32 bits. */
unsigned long t;                            /* 32-bit number used for temporary storage. */
unsigned char first, second, third, fourth; /* These are the bytes we will break up n into. */
t = n & 0xFF000000;                         /* We have used AND to "mask out" the first byte of the number. */
                                            /* The only bits which can be on in t are the first 8 bits. */
first = t >> 24;                            /* Now we shift t so that it is between 0 and 255. This is the first, highest byte of n. */
t = n & 0x00FF0000;
second = t >> 16;
t = n & 0x0000FF00;
third = t >> 8;
t = n & 0x000000FF;
fourth = t;

/* Now we can write first, second, third, then fourth to disk */
Now, suppose we want to do the reverse. Again, we use shifting to do things portably.

unsigned long n = 0;                        /* This will hold the number we want to read in. It is important that it starts off being zero. */
unsigned long t;                            /* 32-bit number used for temporary storage. */
unsigned char first, second, third, fourth;

/* Read four bytes from the disk. Store the first byte in first, and so forth. */

t = first;
t = t << 24;
n = n | t;
t = second;
t = t << 16;
n = n | t;
t = third;
t = t << 8;
n = n | t;
t = fourth;
n = n | t;

/* Now n contains the original 32-bit number. */
Obviously, a real C program would look very different. It actually doesn't require all that much computation to do this portably, and the computer would have to do it anyways. And so ends this introduction to bits and bytes.

Notes.

  1. All the time, actually. (Back)
  2. The C Standard does not guarantee that bytes are 8 bits. It guarantees that bytes are at least 8 bits. They can be more. You could be programming C on a machine with 9-bit or 32-bit bytes. However, it is not that horrible to assume that bytes are 8 bits, especially in this day and age. (Back)
  3. All the time, actually. (Back)
  4. There's something even worse than little-endian. It's called middle-endian. Fortunately, it is very rare. (Back)
  5. Again, the C standard does not guarantee that unsigned long is 32 bits. It guarantees that at least that many, but there may be more. Again, it is not too horrible to assume that unsigned long is 32 bits. Now, since computers are currently 32-bit, the natural size of numbers they work with (termed a word or an unsigned int) is 32 bits. It IS bad to assume that unsigned int is 32 bits, because this will be changing within the next decade to 64 bits. Even worse, there are still programmers alive who remember the DOS era when machines were 16 bits, and hence say "word" and mean "16 bits". They call 32-bit values "dwords", for "double words", when modern programmers call 32-bit values "words" and 64-bit values "dwords". (Back)


http://nuwen.net/essay6.html
Stephan T. Lavavej
stl@nuwen.net
Updated a long time ago.