1A brief CRC tutorial. 2 3A CRC is a long-division remainder. You add the CRC to the message, 4and the whole thing (message+CRC) is a multiple of the given 5CRC polynomial. To check the CRC, you can either check that the 6CRC matches the recomputed value, *or* you can check that the 7remainder computed on the message+CRC is 0. This latter approach 8is used by a lot of hardware implementations, and is why so many 9protocols put the end-of-frame flag after the CRC. 10 11It's actually the same long division you learned in school, except that 12- We're working in binary, so the digits are only 0 and 1, and 13- When dividing polynomials, there are no carries. Rather than add and 14 subtract, we just xor. Thus, we tend to get a bit sloppy about 15 the difference between adding and subtracting. 16 17Like all division, the remainder is always smaller than the divisor. 18To produce a 32-bit CRC, the divisor is actually a 33-bit CRC polynomial. 19Since it's 33 bits long, bit 32 is always going to be set, so usually the 20CRC is written in hex with the most significant bit omitted. (If you're 21familiar with the IEEE 754 floating-point format, it's the same idea.) 22 23Note that a CRC is computed over a string of *bits*, so you have 24to decide on the endianness of the bits within each byte. To get 25the best error-detecting properties, this should correspond to the 26order they're actually sent. For example, standard RS-232 serial is 27little-endian; the most significant bit (sometimes used for parity) 28is sent last. And when appending a CRC word to a message, you should 29do it in the right order, matching the endianness. 30 31Just like with ordinary division, you proceed one digit (bit) at a time. 32Each step of the division you take one more digit (bit) of the dividend 33and append it to the current remainder. Then you figure out the 34appropriate multiple of the divisor to subtract to being the remainder 35back into range. In binary, this is easy - it has to be either 0 or 1, 36and to make the XOR cancel, it's just a copy of bit 32 of the remainder. 37 38When computing a CRC, we don't care about the quotient, so we can 39throw the quotient bit away, but subtract the appropriate multiple of 40the polynomial from the remainder and we're back to where we started, 41ready to process the next bit. 42 43A big-endian CRC written this way would be coded like: 44for (i = 0; i < input_bits; i++) { 45 multiple = remainder & 0x80000000 ? CRCPOLY : 0; 46 remainder = (remainder << 1 | next_input_bit()) ^ multiple; 47} 48 49Notice how, to get at bit 32 of the shifted remainder, we look 50at bit 31 of the remainder *before* shifting it. 51 52But also notice how the next_input_bit() bits we're shifting into 53the remainder don't actually affect any decision-making until 5432 bits later. Thus, the first 32 cycles of this are pretty boring. 55Also, to add the CRC to a message, we need a 32-bit-long hole for it at 56the end, so we have to add 32 extra cycles shifting in zeros at the 57end of every message, 58 59These details lead to a standard trick: rearrange merging in the 60next_input_bit() until the moment it's needed. Then the first 32 cycles 61can be precomputed, and merging in the final 32 zero bits to make room 62for the CRC can be skipped entirely. This changes the code to: 63 64for (i = 0; i < input_bits; i++) { 65 remainder ^= next_input_bit() << 31; 66 multiple = (remainder & 0x80000000) ? CRCPOLY : 0; 67 remainder = (remainder << 1) ^ multiple; 68} 69 70With this optimization, the little-endian code is particularly simple: 71for (i = 0; i < input_bits; i++) { 72 remainder ^= next_input_bit(); 73 multiple = (remainder & 1) ? CRCPOLY : 0; 74 remainder = (remainder >> 1) ^ multiple; 75} 76 77The most significant coefficient of the remainder polynomial is stored 78in the least significant bit of the binary "remainder" variable. 79The other details of endianness have been hidden in CRCPOLY (which must 80be bit-reversed) and next_input_bit(). 81 82As long as next_input_bit is returning the bits in a sensible order, we don't 83*have* to wait until the last possible moment to merge in additional bits. 84We can do it 8 bits at a time rather than 1 bit at a time: 85for (i = 0; i < input_bytes; i++) { 86 remainder ^= next_input_byte() << 24; 87 for (j = 0; j < 8; j++) { 88 multiple = (remainder & 0x80000000) ? CRCPOLY : 0; 89 remainder = (remainder << 1) ^ multiple; 90 } 91} 92 93Or in little-endian: 94for (i = 0; i < input_bytes; i++) { 95 remainder ^= next_input_byte(); 96 for (j = 0; j < 8; j++) { 97 multiple = (remainder & 1) ? CRCPOLY : 0; 98 remainder = (remainder >> 1) ^ multiple; 99 } 100} 101 102If the input is a multiple of 32 bits, you can even XOR in a 32-bit 103word at a time and increase the inner loop count to 32. 104 105You can also mix and match the two loop styles, for example doing the 106bulk of a message byte-at-a-time and adding bit-at-a-time processing 107for any fractional bytes at the end. 108 109To reduce the number of conditional branches, software commonly uses 110the byte-at-a-time table method, popularized by Dilip V. Sarwate, 111"Computation of Cyclic Redundancy Checks via Table Look-Up", Comm. ACM 112v.31 no.8 (August 1998) p. 1008-1013. 113 114Here, rather than just shifting one bit of the remainder to decide 115in the correct multiple to subtract, we can shift a byte at a time. 116This produces a 40-bit (rather than a 33-bit) intermediate remainder, 117and the correct multiple of the polynomial to subtract is found using 118a 256-entry lookup table indexed by the high 8 bits. 119 120(The table entries are simply the CRC-32 of the given one-byte messages.) 121 122When space is more constrained, smaller tables can be used, e.g. two 1234-bit shifts followed by a lookup in a 16-entry table. 124 125It is not practical to process much more than 8 bits at a time using this 126technique, because tables larger than 256 entries use too much memory and, 127more importantly, too much of the L1 cache. 128 129To get higher software performance, a "slicing" technique can be used. 130See "High Octane CRC Generation with the Intel Slicing-by-8 Algorithm", 131ftp://download.intel.com/technology/comms/perfnet/download/slicing-by-8.pdf 132 133This does not change the number of table lookups, but does increase 134the parallelism. With the classic Sarwate algorithm, each table lookup 135must be completed before the index of the next can be computed. 136 137A "slicing by 2" technique would shift the remainder 16 bits at a time, 138producing a 48-bit intermediate remainder. Rather than doing a single 139lookup in a 65536-entry table, the two high bytes are looked up in 140two different 256-entry tables. Each contains the remainder required 141to cancel out the corresponding byte. The tables are different because the 142polynomials to cancel are different. One has non-zero coefficients from 143x^32 to x^39, while the other goes from x^40 to x^47. 144 145Since modern processors can handle many parallel memory operations, this 146takes barely longer than a single table look-up and thus performs almost 147twice as fast as the basic Sarwate algorithm. 148 149This can be extended to "slicing by 4" using 4 256-entry tables. 150Each step, 32 bits of data is fetched, XORed with the CRC, and the result 151broken into bytes and looked up in the tables. Because the 32-bit shift 152leaves the low-order bits of the intermediate remainder zero, the 153final CRC is simply the XOR of the 4 table look-ups. 154 155But this still enforces sequential execution: a second group of table 156look-ups cannot begin until the previous groups 4 table look-ups have all 157been completed. Thus, the processor's load/store unit is sometimes idle. 158 159To make maximum use of the processor, "slicing by 8" performs 8 look-ups 160in parallel. Each step, the 32-bit CRC is shifted 64 bits and XORed 161with 64 bits of input data. What is important to note is that 4 of 162those 8 bytes are simply copies of the input data; they do not depend 163on the previous CRC at all. Thus, those 4 table look-ups may commence 164immediately, without waiting for the previous loop iteration. 165 166By always having 4 loads in flight, a modern superscalar processor can 167be kept busy and make full use of its L1 cache. 168 169Two more details about CRC implementation in the real world: 170 171Normally, appending zero bits to a message which is already a multiple 172of a polynomial produces a larger multiple of that polynomial. Thus, 173a basic CRC will not detect appended zero bits (or bytes). To enable 174a CRC to detect this condition, it's common to invert the CRC before 175appending it. This makes the remainder of the message+crc come out not 176as zero, but some fixed non-zero value. (The CRC of the inversion 177pattern, 0xffffffff.) 178 179The same problem applies to zero bits prepended to the message, and a 180similar solution is used. Instead of starting the CRC computation with 181a remainder of 0, an initial remainder of all ones is used. As long as 182you start the same way on decoding, it doesn't make a difference. 183