Tuesday, March 20, 2018

Bit Twiddling

A while ago I wrote a new decimal floating point implementation when I was looking into a Go implementation of Suneido. The original cSuneido (C++) implementation is ok, but it definitely could be better. I ported the Go implementation to jSuneido (Java) but never got around to porting it to cSuneido, and never ended up putting it into production. I did port it to suneido.js (JavaScript) although that was harder because JavaScript doesn't have integers, let alone 64 bit ones. All it has are floating point numbers which you can use as integers but you only get 53 bits of precision. And the bitwise operations are only 32 bit.

Anyway, I decided to port the implementation to C++. Go and C++ aren't that different for this kind of code so it should have been a simple project. But, of course, I couldn't get into the code without thinking of improvements.

One thing I discovered was that although C++ has 64 bit integers, and CPU's are 64 bit, 32 bit programs (like cSuneido) don't have access to 64 bit instructions :-( On Visual C++ most of the operations turn into calls to runtime routines. (Comparison is inline but still multiple instructions.) One of the drawbacks to going 64 bit is that pointers double in size which can make programs larger and slower. (Even Knuth complained about this) Linux has an x32 mode which uses 32 bit pointers but runs with the 64 bit instruction set. But as far as I know there's nothing comparable on Windows or OS X. (If you're an old timer like me, you may remember the variety of memory models available related to the transition from 16 bit to 32 bit and breaking the 640k limit.)

The first area where I could see room for improvement was "shifting". Sometimes you want the 64 bit coefficient minimized / shifted to the right, so there are no trailing (decimal) zeroes. Other times you want it maximized / shifted to the left so there is a digit in the most significant position. In the Go version I had simple iterative routines that multiplied or divided by 10 repeatedly (up to 20 times).

My first thought was that you could use the number of leading or trailing zero bits to at least estimate how much you had to multiply/divide. When dividing you can use the factor of 2 in 10. i.e. if you have no trailing zero bits, then you know it's not evenly divisible by 10. And if you have 3 trailing zero bits then you might be able to divide evenly by up to 1000. I combined that with a kind of a binary search to take larger steps initially.

I had a harder time with the leading bits. 3 leading zero bits means you can multiply by 8. 4 leading zero bits meaning you can multiply by 16. From that I drew the incorrect conclusion that 4 leading zeroes were required to allow multiplying by 10. It took me way too long to discover my mistake. For example, with 8 bit values, 16 only has three leading zero bits, but 16 * 10 = 160 which safely fits within the maximum 8 bit value of 255.

Eventually I realized what I needed was integer log 10. That can be calculated efficiently using the number of leading zero bits (see Hacker's Delight). Which in turn can be calculated efficiently using compiler intrinsics like _BitScanReverse (Visual C++) or __builtin_clzll (GCC).

I was also wrong in assuming I could improve the division implementation. I thought maybe I could use double floats to do the division but converting back and forth isn't easy. I looked into division by reciprocals calculated by Newton Raphson. And I looked into Goldschmidt. But none of them seemed better than simple long division. Long division is normally slow because you only calculate one digit of the quotient per iteration. But if the divisor is less than the full 64 bit precision (the normal case), then you can get a bunch of digits per iteration. For example, the first iteration of 1/3 calculates 1e19 / 3 which gives 19 digits of precision in one iteration. Another trick I found, was that even if the divisor has full precision, after a few iterations you can reduce the precision (and therefore the iterations) without losing precision in the result. (although I have no formal proof of that)

In comparison, multiplication should be easy - just split the numbers into "halves" and calculate partial products and add them up. But, as the saying goes, the devil is in the details. The first problem is that 19 digits doesn't divide evenly into halves. One approach is to drop one digit and split 9 - 9, but then you lose precision. You could split into three parts, but that adds a bunch of extra multiplies and adds. Unfortunately, 10 digits * 10 digits can overflow 64 bits. But 10 digits * 9 digits is safe, so I realized I could split one of the numbers 10-9 and the other 9-9. The asymmetry complicates the code a little, but gives a bit more precision. Then I realized I could split one 10-9 and the other 9-10. That does lead to one of the partial products being 10 digits * 10 digits, but you only need the most significant digits of the result so you can reduce it to 9 * 9 without losing precision.

The other trick to getting as much precision as possible is that if high * high is less than full precision, then you can use an additional digit of precision from the middle partial products. And then you can use the high digit from low * low.

Getting this all working was way more of a struggle than it should have been. I spent a lot of time staring at 19 digit numbers and trying to figure out why the results were wrong in the last few digits.

Along the way I discovered there were several bugs in the Go implementation. Which isn't surprising considering it was never used. But I obviously need better test cases.

One option I haven't explored is using "magic" numbers to replace division (e.g. by 10) with multiplication and shifting. GCC and Clang compilers do this automatically, Visual C++ does not. However they don't appear to do it in 32 bit code for 64 bit values. (Tip - you can use Compiler Explorer to see/get the magic numbers.)

When I'm debugging this kind of code, I often end up writing it with each sub-expression computed separately and assigned to a separate variable. (reminiscent of SSA) This makes it clearer in the debugger what all the intermediate values are. Unfortunately, it makes the code harder to read so I usually remove it once the code is working. Of course, then I find a new bug and wish I had it back. It would be nice if there were refactoring tools to easily switch between the two forms. Or if the debugger would automatically capture and display the intermediate results.

I also tend to add a bunch of debugging output to print all these intermediate results, which I also usually end up wishing I had soon after removing it. It would be nice if IDE's could toggle showing or hiding this debugging code.

Some tools I find useful for this kind of work:

Hacker's Delight - a great resource. Most of the material can be found on-line these days, but you have to search through a lot of junk to find it.

WolframAlpha - great for things like 2^64 - 1 or multiplying two 20 digit numbers, or converting between number bases. There are other ways to do this stuff, but I find Alpha the easiest. I've often wished I was a Mathematica whiz, but other than bit twiddling I'm not really into math these days.

Compiler Explorer - handy for getting an idea how different compilers translate your code

Big Integer Calculator - not as powerful as WolframAlpha but simpler for e.g. truncating integer division

Tuesday, January 16, 2018

Small Size Data Structure Optimization

The usual approach for a expandable "vector" data structure is to use a dynamically allocated array that doubles in size as necessary. That's how C++ std::vector and Java ArrayList and Go slice/append work.

But that means the time and space overhead of a heap allocation. And it means indirection and poor locality for CPU caches. (The array itself is contiguous, but you have to chase a pointer to get to it.)

If you have a lot of small lists, one possible optimization is to "embed" a small array inside the list structure. That's not really feasible in Java, but easily done in C++ or Go. One example of this is the small string optimization (SSO) in C++.

But what happens when you grow beyond that small embedded array? The simplest solution is to just stop using it. But that seems wasteful, both in space and in locality.

Another option is to treat the embedded array as the start of the larger overall array, at the cost of complicating the indexing a little. That fixes the space wastage, but we still don't get much benefit from locality since the additions to a vector go at the end.

What if we treat the embedded array as following the used elements of the larger dynamically allocated array? Additions go into the embedded array until it fills up, and then we spill the contents into the larger dynamically allocated array.

If the embedded array holds e.g. 4 elements, then 3/4 of the adds only access the embedded array with good locality. Every 4th addition will block move the entire contents of the embedded array (4 elements) into the larger dynamic array (which will occasionally double in size). So the embedded array acts as a "buffer".

Merge trees and buffered Btrees have a somewhat similar pattern in that most of the activity takes place in a small area which periodically spills into the rest of the data structure.

I did some quick benchmarking using Go. My initial result was that it was slightly slower, but did about half as many allocations (because of the embedded small array). The speed was a little disappointing, but I was comparing against a simple Go slice append so my code was admittedly more complex.

Of course, the locality part of it is harder to measure. A small test that fits entirely in cache will not benefit much from locality. Sure enough, if I made the test create many more vectors, then the new version beat a simple Go slice append by about 30%.

The allocation advantage is entirely dependent on how big the vectors are. If they are small enough to fit in the embedded array, then there are zero allocations. If the vectors are very large then the number of allocations approaches the same as the slice append. Of course, the whole reason for the embedded array is that you have a lot of little objects.

One drawback of the embedded array is that it makes the data structure larger. A vector can be as small as a pointer and two integers or pointers which is reasonable to pass by value. Once you add an embedded array, you probably don't want to pass it by value or store it by value in other vectors or maps. If that means you have to allocate it on the heap, then you won't get as much advantage. However, it you're using it locally or embedding it in a larger data structure then it's fine.

Sunday, January 07, 2018

A Tale of Two Hash Tables

For some reason I started thinking about hash tables again. I realize only a geek would think so, but I find them pretty cool. How can you not like a data structure whose lookup time is O(1) i.e. independent of size. The other fascinating part is that there are a huge number of different designs for hash tables, each with its own set of tradeoffs.

To beat the critics to the punch ... Did I have a pressing problem that I needed to solve? Nope. Did I benchmark the old code or the new code? Nope. Would it make more sense to just use std::unordered_map and quit reinventing the wheel? Probably. Was it a fun challenge that would give incremental improvements? I think so. I could say it's because I'm semi-retired and can afford to mess around. But I've been doing it all along. What's Suneido but the biggest example of reinventing the wheel. Right or wrong, it’s what I enjoy and the way I roll, and I think it’s worked out pretty well. I try not to take it too far . Originally I wrote my own garbage collectors for cSuneido. I'm not talking about yet-another-reference-counting-class, this was a full blown conservative bit mapped garbage collector. A fun challenge and probably some of my best work. But when the obviously better Boehm collector came along I switched over and threw out my own version.

cSuneido uses a pretty standard traditional hash table design - an array of pointers to nodes with separate chaining for collisions. It has worked fine for 20 years. But it has some drawbacks - every insert requires an allocation, there's considerable space overhead (pointers, separately allocated blocks, padding), and all the indirection means it's not cache friendly.

The design of the Go hash table is quite interesting. It uses an array of buckets, each holding e.g. 8 entries. Within a bucket there are separate arrays for keys and values to eliminate padding. Although the initial buckets are allocated in one large array, overflow buckets are chained. It also caches the high byte of hashes to speed up searches.

I reimplemented the Go hash table when I was working on gSuneido. (I couldn't just use the built-in one because it doesn't allow arbitrary types for keys like Suneido needs.) At that point the Go one was written in C, nowadays it's written in Go (albeit unsafe).

So I thought I'd rewrite cSuneido's hash table using the Go design. It took me a day or two to code and debug and it worked well.

But I remembered an article I'd read with the provocative title of "I Wrote the Fastest Hashtable". It uses open addressing, linear probing, robin hood hashing, prime sizing, and a probe limit. One drawback was that it was using a traditional array-of-structs model, which means wasted space for padding, especially since it includes a one byte value per slot which is pretty much guaranteed to get padded. But that issue is easy to overcome with an approach similar to the Go design, by storing the table in "blocks", and within each block using separate arrays for the keys and values (and the one byte "distance"). However, unlike the Go design, the table is still treated as a single large flat array. The tradeoff is that it slows down indexing slightly.

So I took another day or two and implemented this design. I love the simplicity of the lookup code:

int i = index_for_key(k);
for (int d = 0; d <= distance(i); ++d, ++i)
    if (HE::keyeq(k, key(i)))
        return i;
return -1;

You can't get much simpler than that. (Another nice feature of this design is that bounds checks are eliminated by adding extra slots on the end, which is feasible because of the probe limit.)

This design trades off slightly more expensive inserts and deletes for faster lookups. For the usual case of more lookups than updates, that's a good tradeoff. One negative aspect is that resizing the hash table will be slightly more expensive. But that's probably not a large factor because when you're growing a hash table, the new table is sparse and therefore collisions are less frequent.

One feature that makes me a little nervous is that the table grows if an insert exceeds the probe limit, with the probe limit set to log2(n). In most hash tables if all the keys hash to the same value you end up with a linear search. In this design the table would grow until you ran out of memory or hit a size limit. (hash table denial of service vulnerability). But in practice, the prime table sizes make it almost impossible for this to happen.

It took me a bit longer to debug than I expected with the simple code. The insert and delete are a little trickier with robin hood hashing because elements get moved around. C++ templates also gave me a bit of grief, as usual.

Testing with (pseudo) random values I got the expected good results:
- average probes for lookup was 1.3
- average load at resize was .75
- average swaps per insert was 1.3 (although max was 55!)

An average load at resize of .75 means the overall average load is .56 (half full). That's fairly typical for hash tables. It's the cost you pay for the performance. On the positive side, this design has only 1 byte of overhead per slot, and no overhead for pointers or padding or separate allocation, so it's better than many hash tables.

Here's a snapshot of the code. It'll end up in the cSuneido repo at some point. Note - it's not a full STL container since that requires a lot of extra code that I don't need. It also assumes that keys and values are simple objects (like integers or pointers) that are fast to assign and copy. See the article linked above for a more general purpose implementation.

Friday, January 05, 2018

A Bittersweet Triumph

I've been fascinated with the game of Go since I was young. As a kid I played a fair bit of chess, never seriously, but I could usually beat my peers. That was balanced by regularly getting thrashed by a 90 year old in my neighborhood that I was conscripted to play with. After I discovered Go I pretty much gave up on Chess. My business partner and I used to play a lot of Go in the early days of our company. It was a good way to occupy our time on business trips. We were never very good, and almost never played with anyone else, but it was a fun mental challenge. I still play regularly on my iPad against SmartGo Kifu and Crazy Stone (which claims to use deep learning), but only small 9x9 games. (and I'm still not any good!)

One of the things that attracted me to Go was that it wasn't amenable to the same techniques that were used in software to conquer Chess. Of course, I don't believe there is any magic in human brains, so presumably there was a way for computers to handle it. Until AlphaGo, most people believed it would be a long time, perhaps a decade, before computers could beat the best human players. So it was a bit of a shock to everyone when AlphaGo recently beat the worlds best human players.

I highly recommend the full length (90 min) documentary about AlphaGo - a Go playing computer program developed by DeepMind, a company owned by Google. (Don't be scared off by "AI". The documentary isn't technical, it's more about the people.) It's available on Netflix, iTunes, Google Play, and Amazon.

For more of the technical details, see the Nature articles. They're behind a paywall on the Nature web site, but they seem to be available on Scribd - Mastering the game of Go with deep neural networks and tree search and Mastering the game of Go without human knowledge

I've never been involved in AI, but it's always fascinated me. It was one of the original reasons I got into computers. In the early days there was a lot of hype about neural networks. But they didn't live up to their hype and there was a backlash that meant they were downplayed for many years. Recently they've seen a huge comeback and are now living up to much of those original hopes. Partly that's because of advances in hardware like GPU's.

The first AlphaGo was trained partly on human games. Its successor, AlphaGo Zero, learned completely on its own, by playing itself. It surpassed the earlier version in a matter of days. Not long after, the AlphaGo project was ended and the team was disbanded. I guess it's a done deal.

Tuesday, December 19, 2017

Concurrency Again

It started with long update transactions. Mostly we ignored them - some of our customers have slow servers, and sometimes they try to do transactions that are too large.

jSuneido aborts long update transactions as a defense mechanism. Its optimistic concurrency control has to track overlapping transactions, so if a transaction takes a long time, it can lead to a lot of overlaps.

But some of these long update transactions didn't make much sense. They appeared to be slow in the final commit stage. This stage just writes to the memory mapped files and normally is very fast, nowhere near the 20 second transaction limit.

I started going through the code, trying to see what might explain this. I wondered if it might have something to do with the background (once a minute) task that calls force on the MappedByteBuffers to flush them to external storage. Trying to ensure that files are durably written is tricky these days with so many levels of caching etc. but it seems worth trying.

I would have thought that force itself would be fast since most of the work would be done at the operating system level. But the documentation says that it doesn't return till it finishes writing to storage. I added some logging and found that force is usually fast (a millisecond) but occasionally slow (10's of seconds). There didn't seem to be a particular reason for the slow ones. I assume Windows is doing something that blocks.

The slow force would be enough to cause long update transactions, but force was supposed to be happening in a background thread so it shouldn't have blocked anything.

Crap! The force method was synchronized. And so was a method that transactions needed to use to commit. So the force would block the commit and could cause the problem. The problem was relatively rare because the force was only once a minute and usually fast, and it would have to hit exactly when a transaction was committing.

Did force really need to be synchronized? Did the other methods? Back to the age old problems of concurrency with shared memory and locking.

I removed the synchronized. But then I worried about accessing the array of MappedByteBuffers without any locking. I could make the array reference volatile but not the array elements. So I changed it to use AtomicReferenceArray.

At first I thought it was working ok. I ran tests with a bunch of threads hammering on the database and it seemed to work ok. But then my database got corrupted. Was my change failing? I ran the tests again and everything was fine. Hmmm. I changed the force to run every 10 seconds instead of every minute. Sure enough, now the database would get corrupted every time I ran the test.

It still didn't make sense to me that removing synchronized from force would corrupt the data. I noticed that my force method was also writing a few zeros to the end of the file to force the last modified date of the file to get updated. I could see that having concurrency issues. I changed it to rewrite the first few bytes of the file header instead. That fixed it, no more corruption. Hopefully it's smart enough to see that nothing has actually changed and not physically write to storage.

Everything looked good according to my tests, but that's generally not sufficient when it comes to concurrency. I also feel like I should convince myself that it is "safe", even if only informally.

Writing to the database (the final stage of an update transaction) was protected by a lock. So there could only be a single writer at a time. (Yes, that's a bottleneck, but it's a small part of the overall process so it doesn't reduce concurrency too much.)

But there are multiple reader threads. So the question is visibility, not exclusion. The Java Memory Model makes no guarantees of visibility without either volatile or locking. To ensure visibility I needed writers to release a lock when they were finished, and readers to acquire that same lock when they were starting.

For database transactions, that seemed to be covered. Transactions called synchronized methods (on the same object) at the beginning and end.

But there were other readers besides database transactions. The database checking process was one. It also turned out to be ok because it acquired the commit lock briefly at the start. (And writers released it at the end.)

That left the force process. Part of the problem was that it's low level code. It doesn't (and shouldn't) "know" about transactions or commit locks etc. I needed something at the same level. Luckily, writers called a particular low level method when they ended so I could synchronize that method along with the force method. Oops! Just about repeated the original mistake of force holding a lock while it was working. It didn't need to hold it since the goal wasn't exclusion. I just needed to acquire (and then release) the lock (to ensure visibility). So I made a dummy "sync" method that was synchronized and called that at the start of the force method.

I think I've convinced myself that it is "safe". I wouldn't say I'm one hundred percent sure, but reasonably confident.

Another question is fragility. Would future changes be likely to break this? Minor changes seem pretty safe. The things I'm depending on are reasonably fundamental. If I made bigger changes it could be dangerous, but presumably if I'm making bigger changes I would reconsider concurrency. And I did document it.

So if visibility and exclusion are covered, do I need to use an AtomicReferenceArray? Presumably not, so I reverted those changes and went back to a plain old array.

I did another round of testing on both OS X and Windows and didn't encounter any problems.

The problem with this kind of concurrency issue is that you can't just throw locks at it. The problems are just as likely to come from too many locks as from not enough. I wish you could just lock everything and be done with it. But if you do that, you'll probably have deadlocks, and you'll be back to single threaded performance.

Thursday, November 23, 2017

Suneido Bind

In functional programming you have the idea of "partial application" of functions, where you supply some of the arguments to a function and the result is a new function that takes the remaining arguments. This is also referred to as "binding" some of the arguments, for example C++ std::bind.

We've had a version of this in Suneido for a long time, called "Curry". This is actually not the correct name, although it's related. Technically, currying f(a,b) would give you a function you could call as g(a)(b), where g(a) would give you a function you could call as h(b).

Suneido's Curry similarly allows you to supply some of the leading arguments, creating something you can call later with the remaining arguments. For example:

f = Curry(Print, 'hello')
    => hello world

The code is fairly simple, using Suneido's variable number of argument facilities (similar to Javascript's spread and rest)

Recently, I was thinking that it would be nice if you could simply write: (like Scala)

f = Print('hello', _)

That would require compiler changes, but I realized I could do something similar to try out the idea:

f = Bind(Print, 'hello', Bind)

I needed something to stand for unsupplied arguments and using Bind itself seemed suitable.

To make it fast, I had the idea that I could generate a custom class. That sounds expensive, but you only need a class per "shape" of bind, not the particular values. And the classes could be cached. The code is a bit more complicated but not bad.

As hoped, the functions returned by Bind were faster than the ones from Curry, about four times faster. That was nice. And although Bind didn't handle variable numbers of arguments like Curry, it didn't require the supplied arguments to be at the beginning.

Of course, the other way to do this is using Suneido blocks (like closures or lambdas in other languages but with a syntax that came from Smalltalk):

f = {|b| Print('hello', b) }

It turned out that was slightly faster than my fancy new Bind. Which means we shouldn't be using Curry either since it's even slower. (jSuneido is even smart enough to see that's not actually a closure and optimize it to just be a normal function, avoiding the overhead of capturing the environment.)

It would still be nicer to write: Print('hello', _) but that could simply be syntactic sugar for the equivalent block. That wouldn't be too hard to implement in jSuneido since it compiles via an AST but it would be harder in cSuneido which compiles in a single pass directly from source to byte code.

Of course, if you had that, then it would also be nice to be able to write e.g. Map(2 * _) as a shorthand for: Map({|x| 2 * x })

Saturday, September 02, 2017

Go Interfaces and Immediate Integers

One of the unique things about the Go language is how it handles "dynamic" references compared to a language like Java.

In Java we have:

Whereas in Go we have:

In Go an interface is a "fat pointer". Your first thought might be, six of one, half a dozen of the other. What difference does it make where you put the type. However, there are significant tradeoffs.

Next, you might think, I could have multiple references to the same value, in which case Go will use more space because it will duplicate the type. But if we have a large array of values (in which case we're unlikely to have references to all of them), then Go would save space because it wouldn't need the type on each value.

The other big saving is if you have statically (compile time) typed references in Go they will look like:

And again, you save space because there's no need to store the type at runtime.

Given that background, the problem I was thinking about was how to handle integers in a Go version of Suneido. In Java you have to used boxed (on the heap) integers for generic references.

Older versions of Go were perfect for this purpose since they stored integers right in the interface in the same place as the pointer.

But sadly (for me) when they improved the garbage collector they removed immediate values:
The implementation of interface values has been modified. In earlier releases, the interface contained a word that was either a pointer or a one-word scalar value, depending on the type of the concrete object stored. This implementation was problematical for the garbage collector, so as of 1.4 interface values always hold a pointer. In running programs, most interface values were pointers anyway, so the effect is minimal, but programs that store integers (for example) in interfaces will see more allocations.
So now if you store an integer in an interface it gets allocated on the heap, effectively "boxing" it like Java does, although with less overhead.

For something like implementing Suneido, this really sucks. It adds a huge amount of overhead (relatively) on simple integer operations.

You could make an "extra fat" value that had an interface and an integer, but that's ugly too.

In C++ you can use a union to handle immediate integers and that's what cSuneido does. But Go doesn't have unions in the C/C++ sense.

But the union in cSuneido basically reserves a portion of the address space for use by immediate integers. i.e. a pointer falling in that range of addresses is assumed to be and is treated as an integer.

So why not do something similar in Go. Here's my proof of concept:

var space [1024 * 1024 * 1024]int8
var base = uintptr(unsafe.Pointer(&space[0]))

func ExampleSmi() {
   n := 123456789   
   x := &space[n]
   i := uintptr(unsafe.Pointer(x)) - base
   // Output:   
   // 123456789

It does use the unsafe package, but only in a read-only, relatively safe way. (IMO) Pointers will always point into valid memory so the garbage collector will be happy. And it should be fast since the unsafe functions are primarily a compile time feature, they don't do much at runtime.

This example handles 30 bit integers (1 gb). (Unsigned here, but easily modified to handle signed.) I tried to allocate a 4 gb array to handle 32 bit integers but the compiler appears to have a limit of 2 gb. (Note: 4gb is an insignificant portion of a 64 bit address space.) That's too bad because I could probably ignore overflow from math on 32 bit integers, but not 30 bit integers.

We want to declare the "space" statically so that "base" is a constant. One question I had was whether this would add 1gb to the executable. With Go 1.9 on OS X it doesn't. (It appears to go in SNOPTRBSS which I translate as "no pointer uninitialized/zero memory") My other question is whether it's smart enough to allocate zeroed virtual memory and avoid actually zeroing the memory. I'm not sure how to check this, but there doesn't seem to be any lag in startup. Also, looking at the process in the OS X Activity Monitor is shows real memory usage under 2mb.  Presumably if the virtual memory is never modified it will never use any real memory. Also, importantly, the array is of type int8 so it won't need to be scanned by the garbage collector.

Even better, we can make our own type based on int8 (e.g, type Smi int8) and then define methods on this type. As long as we don't make the Smi type public and are careful to only make instances within the array then it will work like any other type. We basically have the old behavior of Go, albeit with a smaller range of integers.

It's a bit of kludge but it should work quite well and it's all standard (albeit "unsafe") Go. And it's no worse than tagged pointers  or nan boxing.