how does ECC for burst error correction work? [clo

2019-03-28 05:59发布

问题:

How does ECC (error correction codes) for burst error correction work (disk drive style)?

It is either a curse or blessing, but often my brain tries to solve technical problems in my dreams. Sometimes it does. Like last night, my brain demanded to understand how to design an ECC algorithm (software program but possibly FPGA circuitry eventually) to implement the kind of ECC appropriate for disk drives. The kind of ECC appropriate for these devices appears to be "burst error detection".

As I understand this, the reason disk drives have errors is due to imperfections on the disk surface (specs or scratches). When the head is reading data bits and passes over a narrow scratch, the circuitry generates a random mix of correct and erroneous bit values over a "burst" of perhaps 1 to 64-ish bits. Therefore, as I understand it, the goal of disk drive ECC is to be able to correct all erroneous bits in any one random burst of errors.

BTW, I don't naturally "think in math", so please don't point me to math papers! I already spent a couple hours trying to read through wikipedia about reed-solomon and various other schemes, but the math in those articles is utterly incomprehensible to me (unless I spend a few weeks studying them... if I'm lucky). Besides, from the text, I don't think any of those schemes apply to disk drives (but maybe CDs / DVDs).

Anyway, I'll describe what my brain dreamed up in my sleep, and ask anyone to explain how this kind of ECC actually should be done, and how much better are conventional approaches. I'm sure my scheme must be less efficient that a technique done by someone who knows what they are doing, and maybe even designed while they were awake! Before I woke up, I was trying to figure out how to handle two bursts per track, but woke up defeated. So I also ask how to achieve that.


My mental image was a 4096 byte sector, which I mentally broke up into 512 chunks of 64-bits each (since I'm used to thinking in 64-bit chunks, and because I'm guessing 64-bit bursts of errors is sufficient for practical purposes. In my application each data-stream will definitely be 4096 to 8192 bytes.

My approach is to compute ten 64-bit ECC codes from the 4096 bytes of data. So the ECC my scheme would write after the last of the 4096 bytes of data would be ten 64-bit code == 80 bytes, which is just short of 2% overhead. I'll call those ten 64-bit ECC codes "code0" to "code9", each of which starts out cleared to zero before each sector is processed. And every 64-bit (8-byte) sequence of data I will call a "chunk" for lack of any better term.

code9 = XOR chunks 000 to 511 == 000 to 1FF : every chunk
code8 = XOR chunks 256 to 511 == 100 to 1FF : every chunk # with bit #8 == 1
code7 = XOR chunks 128 to 255 == 080 to 0FF : every chunk # with bit #7 == 1
        and chunks 384 to 511 == 180 to 1FF
code6 = XOR chunks 064 to 127 == 040 to 07F : every chunk # with bit #6 == 1
        and chunks 192 to 255 == 0C0 to 0FF
        and chunks 320 to 384 == 140 to 17F
        and chunks 448 to 511 == 1C0 to 1FF
code5 = XOR chunks 032 to 063 == 020 to 03F : every chunk # with bit #5 == 1
        and chunks 096 to 127 == 060 to 07F
        and chunks 160 to 191 == 0A0 to 0BF
        and chunks 224 to 255 == 0E0 to 0FF
        and chunks 288 to 319 == 120 to 13F
        and chunks 352 to 383 == 160 to 17F
        and chunks 416 to 447 == 1A0 to 1BF
        and chunks 480 to 511 == 1E0 to 1FF
code4 = XOR chunks 016 to 031 == 010 to 01F : every chunk # with bit #4 == 1
        and chunks 048 to 063 == 030 to 04F
        and chunks 080 to 095 == 050 to 07F
        and so forth
code3 = XOR chunks 008 to 015 == 008 to 00F : every chunk # with bit #3 == 1
        and chunks 024 to 031 == 018 to 01F
        and chunks 040 to 047 == 028 to 02F
        and so forth
code2 = XOR chunks 004 to 007 == 004 to 007 : every chunk # with bit #2 == 1
        and chunks 012 to 015 == 00C to 00F
        and chunks 020 to 023 == 014 to 017
        and so forth
code1 = XOR chunks 002 to 003 == 002 to 003 : every chunk # with bit #1 == 1
        and chunks 006 to 007 == 006 to 007
        and chunks 010 to 011 == 00A to 00B
        and so forth
code0 = XOR chunks 001 to 001 == 001 to 001 : every chunk # with bit #0 == 1
        and chunks 003 to 003 == 003 to 003
        and chunks 005 to 005 == 005 to 005
        and so forth

Okay, I should explain the purpose of this approach. The ECC produced by the algorithm must somehow encode the following information:

#1: What is the correct state of every bit (all 4KB == 32Kb)?

#2: Where in the 4KB == 32Kb stream did the error burst occur?

Now I'll try to explain why my dream (?nightmare?) believed these ten 64-bit ECC codes can detect any one burst of error bits up to 64-bits long anywhere in the 4KB == 32Kb stream.

Let's start slow and consider a simple example. Let's assume that when the disk drive read back a sector, bit #0 in one of the 512 "chunks" was wrong == inverted.

Does the ECC code9 tell us anything? Well, code9 is an XOR of every 64-bit chunk in the sector. Therefore, bit #0 of code9 is the parity of bit #0 of every 64-bit chunk of data written to the sector. Therefore, when we read the sector back, an error in bit #0 of ANY one 64-bit chunk of data will generate an error we can detect with the 64-bit code9 alone (no need for code8, code7... code0). If bit #0 of any 64-bit chunk of data is incorrect, then bit #0 of code9 in the ECC read back from disk will not agree with bit #0 of code9 we compute from the read-back data!

Nice! We detected an error in bit #0 of some 64-bit chunk with only code9. However, we have no idea which of the 511 chunks of data contains an error in its bit #0.

That's what the other eight ECC codes are for (in a manner of speaking).

The other eight ECC codes let us "narrow down" where this error is.

So we ask ourselves, what can code8 tell us? Well, that's totally obvious! code8 only considers chunks 256-511 (last half of the sector), so if the bit #0 error is anywhere in chunks 000-255 (first half of sector), code8 will not detect any error. But wait! If we know the error in bit #0 is NOT in chunks 256-511, then it MUST BE somewhere in chunks 000-255 (first half of sector). So now we know the error is somewhere in chunk 000-255, and not in chunk 256-511. Excellent!

Now we ask ourselves, what can code7 tell us? Well, of the region we are interested in (chunks 000-255), code7 only checks chunks 128-255. So if bit #0 of the code7 ECC we read back from disk differs from the code7 ECC we compute from the read data, we know that bit #0 error is somewhere in chunk 128-255. Sweet! Again we cut the possible location of the error to half the range.

Now what can code6 tell us? How this works is becoming obvious. As before, code6 only detects errors in half of the region we know the error is in. Of the region we narrowed the error down to (chunks 128-255), code6 only checks the second half (chunks 192-255). So when we find no error in bit #0 of code6, we know the bit #0 error is not in chunks 192-255, and thus must be somewhere in chunk 128-191.

When we find an error in bit #0 of code5, we know the error must be somewhere in chunks 160-191.

When we find an error in bit #0 of code4, we know the error must be somewhere in chunks 176-191.

When we find an error in bit #0 of code3, we know the error must be somewhere in chunks 184-191.

When we find NO error in bit #0 of code2, we know the error must be somewhere in chunks 184-187.

When we find NO error in bit #0 of code1, we know the error must be somewhere in chunks 184-185.

When we find an error in bit #0 of code0, we know the error must be in chunk 185.

!!!!! DONE !!!!!

We now know exactly where the error is in our 4096-byte sector --- at bit #0 of 64-bit chunk #185.

And chunk 185 == 0x0B9 == 0 1011 1001

Hmmm. Very interesting! Each zero bit in the chunk # of the error is a code# where we did not find an error, and each one bit in the chunk # of the error is a code# where we did find an error. Which means, we automagically get the chunk # that contains the error in the process of checking the code chunks. When a bit in the readback ECC matches the same bit in the ECC we computed from the read data, we generate a 0, otherwise we generate a 1 (readback-ECC XOR computed-ECC). How simple is that? !!!

With a little extra thought, we see that each bit # in the data-chunks and ECC-chunks is independent. In other words, bit #0 in the ECC chunks is checking bit #0 of every 64-bit chunk of data, while bit #1 in the ECC chunks is checking bit #1 of every 64-bit chunk of data, and so forth. Adjacent bits in the ECC values and data chunks are utterly independent from each other. No bits closer than 64-bits from each other interact in any way. This scheme treats a sector like 64 separate sectors, each only 1/64 as large.

Aha! This must be why this technique can detect and correct any burst of error bits up to 64-bits long --- because each bit is utterly independent (in the ECC chunks) from any bit closer than 64-bits away.

That's it. In my dream state at least, that explains how this works.

What the above doesn't explain is the following (my questions):

#1: Is this how burst ECC is performed on disk drives, etc?

#2: If this technique not conventional, is this technique new?

#3: Is this technique easier for software than conventional techniques?

#4: Do conventional routines have more or less overhead (1.8% for 64-bit bursts in 4K packet)?

#5: How can this be extended to correct two bursts?

PS: My application for ECC is not related to disk drives, but has similar characteristics and requirements. At least initially, my application would do everything in software: compute ECC in software, append to the data-stream (packet), send the data-stream (packet), receive the data-stream (packet), and perform the above process to detect and fix any errors in the data-stream (packet). I suppose this might also turn into circuitry in an FPGA someday, but... not today.