Wednesday, September 09, 2020

Checksums

I want to have checksums on parts of the database in gSuneido. In cSuneido I used Adler32, which is available in Go, but was it the best option? (Checksums are hash functions but designed for error detection, as opposed to hash functions designed for cryptography or for hash tables.)

I didn't find a lot of clear recommendations on the web. Adler32 is a variant of Fletcher and there was some question over which was better.

A Cyclic Redundancy Check (CRC) should be better at error detection than Adler or Fletcher. But most of what I could find talks about single bit errors. I'm more concerned with chunks of data not getting written to disk or getting overwritten. It's unclear how that relates to single bit errors.

I was also wondering if I could get away with a 16 bit (2 byte) checksum. But the Go standard library doesn't have any 16 bit checksums. I could use half the bits of a 32 bit checksum. In theory a good checksum should spread the information over all the bits, so it seemed reasonable. But how would that affect error detection?

One drawback to CRC's is that they don't handle zero bytes very well. And unfortunately, memory mapped files can easily have extra zero bytes (their default value) if data doesn't get written. But in that case the zero bytes would be instead of actual data, which should result in different checksums.

My other concern was speed. Adler and Fletcher are simpler algorithms than CRC so they should be faster. Adler uses a theoretically slower division than Fletcher, but the Go compiler converts division by constants into multiplication by the reciprocal so that shouldn't matter.

I wrote a quick benchmark (using Go's handy benchmark facility) and I was quite surprised to see that crc32 was much faster than adler32. Of course, the results vary depending on the details. This was for 1023 bytes of random data.

BenchmarkAdler32-24              3124998               332 ns/op
BenchmarkCrc32IEEE-24           11993930                91 ns/op
BenchmarkCrc32Cast-24           24993260                47 ns/op

I assume the speed difference is due to hardware (instruction set) support for CRC.

I found third party crc16 implementations, but I don't think they are using hardware support so I assume it will be faster to use half of crc32.

I also found that IEEE slowed down with random block sizes. e.g. a checksum of 1023 bytes was a lot slower than a checksum of 1024 bytes. Castagnoli was better in this respect, and supposedly it's also better at error detection.

I also did a bunch of manual experimenting with error detection. I found using 16 bits of crc32 worked quite well. (missed only one or two more errors out of 100,000 cases) It didn't seem to matter which 16 bits. (Dropping to 8 bits made a bigger difference, not surprisingly.)

So for now I'm going with 16 bits from crc32 Castagnoli.