A while ago I wrote a new decimal floating point implementation when I was looking into a Go implementation of Suneido. The original cSuneido (C++) implementation is ok, but it definitely could be better. I ported the Go implementation to jSuneido (Java) but never got around to porting it to cSuneido, and never ended up putting it into production. I did port it to suneido.js (JavaScript) although that was harder because JavaScript doesn't have integers, let alone 64 bit ones. All it has are floating point numbers which you can use as integers but you only get 53 bits of precision. And the bitwise operations are only 32 bit.
Anyway, I decided to port the implementation to C++. Go and C++ aren't that different for this kind of code so it should have been a simple project. But, of course, I couldn't get into the code without thinking of improvements.
One thing I discovered was that although C++ has 64 bit integers, and CPU's are 64 bit, 32 bit programs (like cSuneido) don't have access to 64 bit instructions :-( On Visual C++ most of the operations turn into calls to runtime routines. (Comparison is inline but still multiple instructions.) One of the drawbacks to going 64 bit is that pointers double in size which can make programs larger and slower. (Even Knuth complained about this) Linux has an x32 mode which uses 32 bit pointers but runs with the 64 bit instruction set. But as far as I know there's nothing comparable on Windows or OS X. (If you're an old timer like me, you may remember the variety of memory models available related to the transition from 16 bit to 32 bit and breaking the 640k limit.)
The first area where I could see room for improvement was "shifting". Sometimes you want the 64 bit coefficient minimized / shifted to the right, so there are no trailing (decimal) zeroes. Other times you want it maximized / shifted to the left so there is a digit in the most significant position. In the Go version I had simple iterative routines that multiplied or divided by 10 repeatedly (up to 20 times).
My first thought was that you could use the number of leading or trailing zero bits to at least estimate how much you had to multiply/divide. When dividing you can use the factor of 2 in 10. i.e. if you have no trailing zero bits, then you know it's not evenly divisible by 10. And if you have 3 trailing zero bits then you might be able to divide evenly by up to 1000. I combined that with a kind of a binary search to take larger steps initially.
I had a harder time with the leading bits. 3 leading zero bits means you can multiply by 8. 4 leading zero bits meaning you can multiply by 16. From that I drew the incorrect conclusion that 4 leading zeroes were required to allow multiplying by 10. It took me way too long to discover my mistake. For example, with 8 bit values, 16 only has three leading zero bits, but 16 * 10 = 160 which safely fits within the maximum 8 bit value of 255.
Eventually I realized what I needed was integer log 10. That can be calculated efficiently using the number of leading zero bits (see Hacker's Delight). Which in turn can be calculated efficiently using compiler intrinsics like _BitScanReverse (Visual C++) or __builtin_clzll (GCC).
I was also wrong in assuming I could improve the division implementation. I thought maybe I could use double floats to do the division but converting back and forth isn't easy. I looked into division by reciprocals calculated by Newton Raphson. And I looked into Goldschmidt. But none of them seemed better than simple long division. Long division is normally slow because you only calculate one digit of the quotient per iteration. But if the divisor is less than the full 64 bit precision (the normal case), then you can get a bunch of digits per iteration. For example, the first iteration of 1/3 calculates 1e19 / 3 which gives 19 digits of precision in one iteration. Another trick I found, was that even if the divisor has full precision, after a few iterations you can reduce the precision (and therefore the iterations) without losing precision in the result. (although I have no formal proof of that)
In comparison, multiplication should be easy - just split the numbers into "halves" and calculate partial products and add them up. But, as the saying goes, the devil is in the details. The first problem is that 19 digits doesn't divide evenly into halves. One approach is to drop one digit and split 9 - 9, but then you lose precision. You could split into three parts, but that adds a bunch of extra multiplies and adds. Unfortunately, 10 digits * 10 digits can overflow 64 bits. But 10 digits * 9 digits is safe, so I realized I could split one of the numbers 10-9 and the other 9-9. The asymmetry complicates the code a little, but gives a bit more precision. Then I realized I could split one 10-9 and the other 9-10. That does lead to one of the partial products being 10 digits * 10 digits, but you only need the most significant digits of the result so you can reduce it to 9 * 9 without losing precision.
The other trick to getting as much precision as possible is that if high * high is less than full precision, then you can use an additional digit of precision from the middle partial products. And then you can use the high digit from low * low.
Getting this all working was way more of a struggle than it should have been. I spent a lot of time staring at 19 digit numbers and trying to figure out why the results were wrong in the last few digits.
Along the way I discovered there were several bugs in the Go implementation. Which isn't surprising considering it was never used. But I obviously need better test cases.
One option I haven't explored is using "magic" numbers to replace division (e.g. by 10) with multiplication and shifting. GCC and Clang compilers do this automatically, Visual C++ does not. However they don't appear to do it in 32 bit code for 64 bit values. (Tip - you can use Compiler Explorer to see/get the magic numbers.)
When I'm debugging this kind of code, I often end up writing it with each sub-expression computed separately and assigned to a separate variable. (reminiscent of SSA) This makes it clearer in the debugger what all the intermediate values are. Unfortunately, it makes the code harder to read so I usually remove it once the code is working. Of course, then I find a new bug and wish I had it back. It would be nice if there were refactoring tools to easily switch between the two forms. Or if the debugger would automatically capture and display the intermediate results.
I also tend to add a bunch of debugging output to print all these intermediate results, which I also usually end up wishing I had soon after removing it. It would be nice if IDE's could toggle showing or hiding this debugging code.
Some tools I find useful for this kind of work:
Hacker's Delight - a great resource. Most of the material can be found on-line these days, but you have to search through a lot of junk to find it.
WolframAlpha - great for things like 2^64 - 1 or multiplying two 20 digit numbers, or converting between number bases. There are other ways to do this stuff, but I find Alpha the easiest. I've often wished I was a Mathematica whiz, but other than bit twiddling I'm not really into math these days.
Compiler Explorer - handy for getting an idea how different compilers translate your code
Big Integer Calculator - not as powerful as WolframAlpha but simpler for e.g. truncating integer division
Anyway, I decided to port the implementation to C++. Go and C++ aren't that different for this kind of code so it should have been a simple project. But, of course, I couldn't get into the code without thinking of improvements.
One thing I discovered was that although C++ has 64 bit integers, and CPU's are 64 bit, 32 bit programs (like cSuneido) don't have access to 64 bit instructions :-( On Visual C++ most of the operations turn into calls to runtime routines. (Comparison is inline but still multiple instructions.) One of the drawbacks to going 64 bit is that pointers double in size which can make programs larger and slower. (Even Knuth complained about this) Linux has an x32 mode which uses 32 bit pointers but runs with the 64 bit instruction set. But as far as I know there's nothing comparable on Windows or OS X. (If you're an old timer like me, you may remember the variety of memory models available related to the transition from 16 bit to 32 bit and breaking the 640k limit.)
The first area where I could see room for improvement was "shifting". Sometimes you want the 64 bit coefficient minimized / shifted to the right, so there are no trailing (decimal) zeroes. Other times you want it maximized / shifted to the left so there is a digit in the most significant position. In the Go version I had simple iterative routines that multiplied or divided by 10 repeatedly (up to 20 times).
My first thought was that you could use the number of leading or trailing zero bits to at least estimate how much you had to multiply/divide. When dividing you can use the factor of 2 in 10. i.e. if you have no trailing zero bits, then you know it's not evenly divisible by 10. And if you have 3 trailing zero bits then you might be able to divide evenly by up to 1000. I combined that with a kind of a binary search to take larger steps initially.
I had a harder time with the leading bits. 3 leading zero bits means you can multiply by 8. 4 leading zero bits meaning you can multiply by 16. From that I drew the incorrect conclusion that 4 leading zeroes were required to allow multiplying by 10. It took me way too long to discover my mistake. For example, with 8 bit values, 16 only has three leading zero bits, but 16 * 10 = 160 which safely fits within the maximum 8 bit value of 255.
Eventually I realized what I needed was integer log 10. That can be calculated efficiently using the number of leading zero bits (see Hacker's Delight). Which in turn can be calculated efficiently using compiler intrinsics like _BitScanReverse (Visual C++) or __builtin_clzll (GCC).
I was also wrong in assuming I could improve the division implementation. I thought maybe I could use double floats to do the division but converting back and forth isn't easy. I looked into division by reciprocals calculated by Newton Raphson. And I looked into Goldschmidt. But none of them seemed better than simple long division. Long division is normally slow because you only calculate one digit of the quotient per iteration. But if the divisor is less than the full 64 bit precision (the normal case), then you can get a bunch of digits per iteration. For example, the first iteration of 1/3 calculates 1e19 / 3 which gives 19 digits of precision in one iteration. Another trick I found, was that even if the divisor has full precision, after a few iterations you can reduce the precision (and therefore the iterations) without losing precision in the result. (although I have no formal proof of that)
In comparison, multiplication should be easy - just split the numbers into "halves" and calculate partial products and add them up. But, as the saying goes, the devil is in the details. The first problem is that 19 digits doesn't divide evenly into halves. One approach is to drop one digit and split 9 - 9, but then you lose precision. You could split into three parts, but that adds a bunch of extra multiplies and adds. Unfortunately, 10 digits * 10 digits can overflow 64 bits. But 10 digits * 9 digits is safe, so I realized I could split one of the numbers 10-9 and the other 9-9. The asymmetry complicates the code a little, but gives a bit more precision. Then I realized I could split one 10-9 and the other 9-10. That does lead to one of the partial products being 10 digits * 10 digits, but you only need the most significant digits of the result so you can reduce it to 9 * 9 without losing precision.
The other trick to getting as much precision as possible is that if high * high is less than full precision, then you can use an additional digit of precision from the middle partial products. And then you can use the high digit from low * low.
Getting this all working was way more of a struggle than it should have been. I spent a lot of time staring at 19 digit numbers and trying to figure out why the results were wrong in the last few digits.
Along the way I discovered there were several bugs in the Go implementation. Which isn't surprising considering it was never used. But I obviously need better test cases.
One option I haven't explored is using "magic" numbers to replace division (e.g. by 10) with multiplication and shifting. GCC and Clang compilers do this automatically, Visual C++ does not. However they don't appear to do it in 32 bit code for 64 bit values. (Tip - you can use Compiler Explorer to see/get the magic numbers.)
When I'm debugging this kind of code, I often end up writing it with each sub-expression computed separately and assigned to a separate variable. (reminiscent of SSA) This makes it clearer in the debugger what all the intermediate values are. Unfortunately, it makes the code harder to read so I usually remove it once the code is working. Of course, then I find a new bug and wish I had it back. It would be nice if there were refactoring tools to easily switch between the two forms. Or if the debugger would automatically capture and display the intermediate results.
I also tend to add a bunch of debugging output to print all these intermediate results, which I also usually end up wishing I had soon after removing it. It would be nice if IDE's could toggle showing or hiding this debugging code.
Some tools I find useful for this kind of work:
Hacker's Delight - a great resource. Most of the material can be found on-line these days, but you have to search through a lot of junk to find it.
WolframAlpha - great for things like 2^64 - 1 or multiplying two 20 digit numbers, or converting between number bases. There are other ways to do this stuff, but I find Alpha the easiest. I've often wished I was a Mathematica whiz, but other than bit twiddling I'm not really into math these days.
Compiler Explorer - handy for getting an idea how different compilers translate your code
Big Integer Calculator - not as powerful as WolframAlpha but simpler for e.g. truncating integer division