The Software Life: 2018

Tuesday, October 02, 2018

Precedence Climbing Expression Parsing

Both cSuneido (original C++ implementation) and jSuneido (newer Java implementation) use classic recursive descent parsers. And that's what I implemented once again with gSuneido (in progress Go implementation).

But classic recursive descent is ugly for expressions, especially when you have lots of levels of precedence (because Suneido followed the C/C++ pattern). The code is verbose because there is a function for every level of precedence. And it's slow because during parsing you have to call through all those functions. For example, an expression consisting of a single identifier requires passing through 15 levels of precedence.

I've been aware of top down operator parsing (TDOP) aka Pratt parsing for a while (see my previous blog post). And more recently I came across precedence climbing which is basically a simplified version of TDOP. I started and one of my programmers finished (thanks Hao!) a TDOP parser for the Suneido language, written in the Suneido language. We use it, for example, for search and replace by expression.

But I had always found TDOP somewhat hard to follow. The original Pratt paper is not the clearest. Crockford's version is better but still uses Pratt's cryptic "nud" (null denotation) and "led" (left denotation) terminology.

Recently I read Writing An Interpreter In Go and Writing A Compiler In Go by Thorsten Ball. (recommended, although I found myself skipping over the masses of test code) They use a TDOP parser, with the less opaque terminology of "prefix" instead of "nud" and "infix" instead of "led".

The main advantage of TDOP is that it is modular and easily extendible. But I had a more or less fixed target. So I decided to try rewriting the gSuneido expression parser using precedence climbing instead. (gSuneido is a good place to try out ideas because it's not in production and I don't have to worry about breaking anything.)

The basics are easy. It took a bit of experimentation to get things like assignment operators and ?: working but it wasn't too bad.

One thing that puzzles me is that assignment is normally thought of as the lowest precedence. But I had to make it the highest precedence in order to get things like "false is x = f()" to work. In my old classic recursive descent implementation assignment was high precedence as well.

With my classic recursive descent, it was hard coded into the parser that only certain expressions (lvalues e.g. x or v[0] or a.b.c) were allowed on the left of an assignment (or ++ or --). But that wasn't the case with this version so I had to add checking for that.

The end result (initial version or latest) is smaller than the old code, and runs quite a bit faster. In some ways it's not as clear, for example, figuring out how the minimum precedence argument works. But in other ways it's clearer, for example, there's an explicit precedence table (instead of the order of recursive calls). However, I'm not sure if it's enough better to warrant going back and changing the cSuneido or jSuneido versions.

The parallels between precedence climbing and TDOP are easy to see in the code. The pcExpr method has a switch with the led/infix handling, and the atom method has a switch with the nud/prefix handling. It would be easy to refactor to a table driven style (like Pratt) or an object oriented style (like Crockford). I didn't do this because it would have made the code slower (due to the indirect function calls) and larger.

References:
Top Down Operator Precedence, Vaughan Pratt
Top Down Operator Precedence, Douglas Crockford
Parsing Expressions by Recursive Descent, Theodore Norvell
From Precedence Climbing to Pratt Parsing, Theodore Norvell
Parsing expressions by precedence climbing, Eli Bendersky
Pratt Parsers: Expression Parsing Made Easy, Bob Nystrom

Thursday, September 13, 2018

A Better Hash Table

Reading A new fast hash table in response to Google’s new fast hash table got me thinking about hash tables again. I wrote an implementation of his last hash table design back in January. It worked well, but memory usage was higher than I would like, and this new design promised to improve that.

However, this time he didn't explain the design, just shared the code. So it took me a little bit to figure out exactly how it worked. The basics are straightforward but there are a few tricky parts.

The data layout is the same as my last version - a flat array of blocks, each holding separate small arrays (e.g. 8 entries) for keys, data, and control information. The advantage of this layout is that it prevents wasting space for padding, while still maintaining locality. It can also be allocated with a single request.

This hash table is chained (i.e. collisions are handled with a linked list) but using empty slots rather than separate allocations. And the chains are not linked with a pointer or an index, but with an index into a set of jump sizes. So we can't link to an arbitrary slot, but only to a particular set of slots relative to the origin. In practice this seems sufficient.

Simple open addressing (e.g. linear probing) ends up with "chains" (probe sequences) containing different hash indexes. This leads to longer searches. We could do something similar, but it's better if we ensure that the chain starting at a certain index only contains entries with that same hash index. That means if you want to insert and the slot is part of another chain then you have to move the other chain from that point on. (If the chains were actual linked lists, you could just move the one element and then link to the rest of the chain. But since we only support certain specific jumps, we can't do that in the general case. And so we need to move the entire rest of the chain.)

I also used Fibonacci hashing which adds an additional hashing step to spread out values more than mod would.

Here's an illustration of how it works. For simplicity I'll use numeric keys and use of hash index of the first (tens) digit (without the additional Fibonacci hashing).

Inserting non-colliding elements is simple.

If we insert a colliding element, we "chain" it.

The chaining might have to skip occupied slots. If we insert 11, 12, and 31 we'd get:

Note: Chaining starts out looking like linear probing but when the chain gets longer the jumps get larger.

Note: 10 and 30 are "direct hits", whereas 11, 12, and 31 are "chain" slots. We use one bit of the meta data to record this.

When we insert 20 we find its slot is used for a chain so we move the chain from that point on (11,12) so we can insert 20 where it "belongs".

This allows lookups to take advantage of the direct/chain distinction. If we look for 60 we find a chain slot (11) and we know immediately that it's not in the table. Whereas, if we look for 15 (same hash as 10,11,12), we find a direct slot so then we need to follow the chain to determine if the value is present.

When we delete an item from anywhere but the end of the chain, we move the end of the chain into the deleted item's slot, shortening the chain by one without moving all the elements. So if we delete 10 we get:

Note: Deleting 20 would not undo the chain move that was done when we inserted 20.

I started with a Go implementation, although I'll probably port it to C++ as well. The initial version of the code is on Github.

After some tweaking, the performance is about 10% faster than the previous version, which is not huge, but considering the previous version was very fast, that's not bad. The best part is that it used about half as much memory - exactly the benefit that was promised.

Tuesday, March 20, 2018

Bit Twiddling

A while ago I wrote a new decimal floating point implementation when I was looking into a Go implementation of Suneido. The original cSuneido (C++) implementation is ok, but it definitely could be better. I ported the Go implementation to jSuneido (Java) but never got around to porting it to cSuneido, and never ended up putting it into production. I did port it to suneido.js (JavaScript) although that was harder because JavaScript doesn't have integers, let alone 64 bit ones. All it has are floating point numbers which you can use as integers but you only get 53 bits of precision. And the bitwise operations are only 32 bit.

Anyway, I decided to port the implementation to C++. Go and C++ aren't that different for this kind of code so it should have been a simple project. But, of course, I couldn't get into the code without thinking of improvements.

One thing I discovered was that although C++ has 64 bit integers, and CPU's are 64 bit, 32 bit programs (like cSuneido) don't have access to 64 bit instructions :-( On Visual C++ most of the operations turn into calls to runtime routines. (Comparison is inline but still multiple instructions.) One of the drawbacks to going 64 bit is that pointers double in size which can make programs larger and slower. (Even Knuth complained about this) Linux has an x32 mode which uses 32 bit pointers but runs with the 64 bit instruction set. But as far as I know there's nothing comparable on Windows or OS X. (If you're an old timer like me, you may remember the variety of memory models available related to the transition from 16 bit to 32 bit and breaking the 640k limit.)

The first area where I could see room for improvement was "shifting". Sometimes you want the 64 bit coefficient minimized / shifted to the right, so there are no trailing (decimal) zeroes. Other times you want it maximized / shifted to the left so there is a digit in the most significant position. In the Go version I had simple iterative routines that multiplied or divided by 10 repeatedly (up to 20 times).

My first thought was that you could use the number of leading or trailing zero bits to at least estimate how much you had to multiply/divide. When dividing you can use the factor of 2 in 10. i.e. if you have no trailing zero bits, then you know it's not evenly divisible by 10. And if you have 3 trailing zero bits then you might be able to divide evenly by up to 1000. I combined that with a kind of a binary search to take larger steps initially.

I had a harder time with the leading bits. 3 leading zero bits means you can multiply by 8. 4 leading zero bits meaning you can multiply by 16. From that I drew the incorrect conclusion that 4 leading zeroes were required to allow multiplying by 10. It took me way too long to discover my mistake. For example, with 8 bit values, 16 only has three leading zero bits, but 16 * 10 = 160 which safely fits within the maximum 8 bit value of 255.

Eventually I realized what I needed was integer log 10. That can be calculated efficiently using the number of leading zero bits (see Hacker's Delight). Which in turn can be calculated efficiently using compiler intrinsics like _BitScanReverse (Visual C++) or __builtin_clzll (GCC).

I was also wrong in assuming I could improve the division implementation. I thought maybe I could use double floats to do the division but converting back and forth isn't easy. I looked into division by reciprocals calculated by Newton Raphson. And I looked into Goldschmidt. But none of them seemed better than simple long division. Long division is normally slow because you only calculate one digit of the quotient per iteration. But if the divisor is less than the full 64 bit precision (the normal case), then you can get a bunch of digits per iteration. For example, the first iteration of 1/3 calculates 1e19 / 3 which gives 19 digits of precision in one iteration. Another trick I found, was that even if the divisor has full precision, after a few iterations you can reduce the precision (and therefore the iterations) without losing precision in the result. (although I have no formal proof of that)

In comparison, multiplication should be easy - just split the numbers into "halves" and calculate partial products and add them up. But, as the saying goes, the devil is in the details. The first problem is that 19 digits doesn't divide evenly into halves. One approach is to drop one digit and split 9 - 9, but then you lose precision. You could split into three parts, but that adds a bunch of extra multiplies and adds. Unfortunately, 10 digits * 10 digits can overflow 64 bits. But 10 digits * 9 digits is safe, so I realized I could split one of the numbers 10-9 and the other 9-9. The asymmetry complicates the code a little, but gives a bit more precision. Then I realized I could split one 10-9 and the other 9-10. That does lead to one of the partial products being 10 digits * 10 digits, but you only need the most significant digits of the result so you can reduce it to 9 * 9 without losing precision.

The other trick to getting as much precision as possible is that if high * high is less than full precision, then you can use an additional digit of precision from the middle partial products. And then you can use the high digit from low * low.

Getting this all working was way more of a struggle than it should have been. I spent a lot of time staring at 19 digit numbers and trying to figure out why the results were wrong in the last few digits.

Along the way I discovered there were several bugs in the Go implementation. Which isn't surprising considering it was never used. But I obviously need better test cases.

One option I haven't explored is using "magic" numbers to replace division (e.g. by 10) with multiplication and shifting. GCC and Clang compilers do this automatically, Visual C++ does not. However they don't appear to do it in 32 bit code for 64 bit values. (Tip - you can use Compiler Explorer to see/get the magic numbers.)

When I'm debugging this kind of code, I often end up writing it with each sub-expression computed separately and assigned to a separate variable. (reminiscent of SSA) This makes it clearer in the debugger what all the intermediate values are. Unfortunately, it makes the code harder to read so I usually remove it once the code is working. Of course, then I find a new bug and wish I had it back. It would be nice if there were refactoring tools to easily switch between the two forms. Or if the debugger would automatically capture and display the intermediate results.

I also tend to add a bunch of debugging output to print all these intermediate results, which I also usually end up wishing I had soon after removing it. It would be nice if IDE's could toggle showing or hiding this debugging code.

Some tools I find useful for this kind of work:

Hacker's Delight - a great resource. Most of the material can be found on-line these days, but you have to search through a lot of junk to find it.

WolframAlpha - great for things like 2^64 - 1 or multiplying two 20 digit numbers, or converting between number bases. There are other ways to do this stuff, but I find Alpha the easiest. I've often wished I was a Mathematica whiz, but other than bit twiddling I'm not really into math these days.

Compiler Explorer - handy for getting an idea how different compilers translate your code

Big Integer Calculator - not as powerful as WolframAlpha but simpler for e.g. truncating integer division

Tuesday, January 16, 2018

Small Size Data Structure Optimization

The usual approach for a expandable "vector" data structure is to use a dynamically allocated array that doubles in size as necessary. That's how C++ std::vector and Java ArrayList and Go slice/append work.

But that means the time and space overhead of a heap allocation. And it means indirection and poor locality for CPU caches. (The array itself is contiguous, but you have to chase a pointer to get to it.)

If you have a lot of small lists, one possible optimization is to "embed" a small array inside the list structure. That's not really feasible in Java, but easily done in C++ or Go. One example of this is the small string optimization (SSO) in C++.

But what happens when you grow beyond that small embedded array? The simplest solution is to just stop using it. But that seems wasteful, both in space and in locality.

Another option is to treat the embedded array as the start of the larger overall array, at the cost of complicating the indexing a little. That fixes the space wastage, but we still don't get much benefit from locality since the additions to a vector go at the end.

What if we treat the embedded array as following the used elements of the larger dynamically allocated array? Additions go into the embedded array until it fills up, and then we spill the contents into the larger dynamically allocated array.

If the embedded array holds e.g. 4 elements, then 3/4 of the adds only access the embedded array with good locality. Every 4th addition will block move the entire contents of the embedded array (4 elements) into the larger dynamic array (which will occasionally double in size). So the embedded array acts as a "buffer".

Merge trees and buffered Btrees have a somewhat similar pattern in that most of the activity takes place in a small area which periodically spills into the rest of the data structure.

I did some quick benchmarking using Go. My initial result was that it was slightly slower, but did about half as many allocations (because of the embedded small array). The speed was a little disappointing, but I was comparing against a simple Go slice append so my code was admittedly more complex.

Of course, the locality part of it is harder to measure. A small test that fits entirely in cache will not benefit much from locality. Sure enough, if I made the test create many more vectors, then the new version beat a simple Go slice append by about 30%.

The allocation advantage is entirely dependent on how big the vectors are. If they are small enough to fit in the embedded array, then there are zero allocations. If the vectors are very large then the number of allocations approaches the same as the slice append. Of course, the whole reason for the embedded array is that you have a lot of little objects.

One drawback of the embedded array is that it makes the data structure larger. A vector can be as small as a pointer and two integers or pointers which is reasonable to pass by value. Once you add an embedded array, you probably don't want to pass it by value or store it by value in other vectors or maps. If that means you have to allocate it on the heap, then you won't get as much advantage. However, it you're using it locally or embedding it in a larger data structure then it's fine.

Sunday, January 07, 2018

A Tale of Two Hash Tables

For some reason I started thinking about hash tables again. I realize only a geek would think so, but I find them pretty cool. How can you not like a data structure whose lookup time is O(1) i.e. independent of size. The other fascinating part is that there are a huge number of different designs for hash tables, each with its own set of tradeoffs.

To beat the critics to the punch ... Did I have a pressing problem that I needed to solve? Nope. Did I benchmark the old code or the new code? Nope. Would it make more sense to just use std::unordered_map and quit reinventing the wheel? Probably. Was it a fun challenge that would give incremental improvements? I think so. I could say it's because I'm semi-retired and can afford to mess around. But I've been doing it all along. What's Suneido but the biggest example of reinventing the wheel. Right or wrong, it’s what I enjoy and the way I roll, and I think it’s worked out pretty well. I try not to take it too far . Originally I wrote my own garbage collectors for cSuneido. I'm not talking about yet-another-reference-counting-class, this was a full blown conservative bit mapped garbage collector. A fun challenge and probably some of my best work. But when the obviously better Boehm collector came along I switched over and threw out my own version.

cSuneido uses a pretty standard traditional hash table design - an array of pointers to nodes with separate chaining for collisions. It has worked fine for 20 years. But it has some drawbacks - every insert requires an allocation, there's considerable space overhead (pointers, separately allocated blocks, padding), and all the indirection means it's not cache friendly.

The design of the Go hash table is quite interesting. It uses an array of buckets, each holding e.g. 8 entries. Within a bucket there are separate arrays for keys and values to eliminate padding. Although the initial buckets are allocated in one large array, overflow buckets are chained. It also caches the high byte of hashes to speed up searches.

I reimplemented the Go hash table when I was working on gSuneido. (I couldn't just use the built-in one because it doesn't allow arbitrary types for keys like Suneido needs.) At that point the Go one was written in C, nowadays it's written in Go (albeit unsafe).

So I thought I'd rewrite cSuneido's hash table using the Go design. It took me a day or two to code and debug and it worked well.

But I remembered an article I'd read with the provocative title of "I Wrote the Fastest Hashtable". It uses open addressing, linear probing, robin hood hashing, prime sizing, and a probe limit. One drawback was that it was using a traditional array-of-structs model, which means wasted space for padding, especially since it includes a one byte value per slot which is pretty much guaranteed to get padded. But that issue is easy to overcome with an approach similar to the Go design, by storing the table in "blocks", and within each block using separate arrays for the keys and values (and the one byte "distance"). However, unlike the Go design, the table is still treated as a single large flat array. The tradeoff is that it slows down indexing slightly.

So I took another day or two and implemented this design. I love the simplicity of the lookup code:

int i = index_for_key(k);
for (int d = 0; d <= distance(i); ++d, ++i)
if (HE::keyeq(k, key(i)))
return i;
return -1;

You can't get much simpler than that. (Another nice feature of this design is that bounds checks are eliminated by adding extra slots on the end, which is feasible because of the probe limit.)

This design trades off slightly more expensive inserts and deletes for faster lookups. For the usual case of more lookups than updates, that's a good tradeoff. One negative aspect is that resizing the hash table will be slightly more expensive. But that's probably not a large factor because when you're growing a hash table, the new table is sparse and therefore collisions are less frequent.

One feature that makes me a little nervous is that the table grows if an insert exceeds the probe limit, with the probe limit set to log2(n). In most hash tables if all the keys hash to the same value you end up with a linear search. In this design the table would grow until you ran out of memory or hit a size limit. (hash table denial of service vulnerability). But in practice, the prime table sizes make it almost impossible for this to happen.

It took me a bit longer to debug than I expected with the simple code. The insert and delete are a little trickier with robin hood hashing because elements get moved around. C++ templates also gave me a bit of grief, as usual.

Testing with (pseudo) random values I got the expected good results:
- average probes for lookup was 1.3
- average load at resize was .75
- average swaps per insert was 1.3 (although max was 55!)

An average load at resize of .75 means the overall average load is .56 (half full). That's fairly typical for hash tables. It's the cost you pay for the performance. On the positive side, this design has only 1 byte of overhead per slot, and no overhead for pointers or padding or separate allocation, so it's better than many hash tables.

Here's a snapshot of the code. It'll end up in the cSuneido repo at some point. Note - it's not a full STL container since that requires a lot of extra code that I don't need. It also assumes that keys and values are simple objects (like integers or pointers) that are fast to assign and copy. See the article linked above for a more general purpose implementation.

Friday, January 05, 2018

A Bittersweet Triumph

I've been fascinated with the game of Go since I was young. As a kid I played a fair bit of chess, never seriously, but I could usually beat my peers. That was balanced by regularly getting thrashed by a 90 year old in my neighborhood that I was conscripted to play with. After I discovered Go I pretty much gave up on Chess. My business partner and I used to play a lot of Go in the early days of our company. It was a good way to occupy our time on business trips. We were never very good, and almost never played with anyone else, but it was a fun mental challenge. I still play regularly on my iPad against SmartGo Kifu and Crazy Stone (which claims to use deep learning), but only small 9x9 games. (and I'm still not any good!)

One of the things that attracted me to Go was that it wasn't amenable to the same techniques that were used in software to conquer Chess. Of course, I don't believe there is any magic in human brains, so presumably there was a way for computers to handle it. Until AlphaGo, most people believed it would be a long time, perhaps a decade, before computers could beat the best human players. So it was a bit of a shock to everyone when AlphaGo recently beat the worlds best human players.

I highly recommend the full length (90 min) documentary about AlphaGo - a Go playing computer program developed by DeepMind, a company owned by Google. (Don't be scared off by "AI". The documentary isn't technical, it's more about the people.) It's available on Netflix, iTunes, Google Play, and Amazon.

For more of the technical details, see the Nature articles. They're behind a paywall on the Nature web site, but they seem to be available on Scribd - Mastering the game of Go with deep neural networks and tree search and Mastering the game of Go without human knowledge

I've never been involved in AI, but it's always fascinated me. It was one of the original reasons I got into computers. In the early days there was a lot of hype about neural networks. But they didn't live up to their hype and there was a backlash that meant they were downplayed for many years. Recently they've seen a huge comeback and are now living up to much of those original hopes. Partly that's because of advances in hardware like GPU's.

The first AlphaGo was trained partly on human games. Its successor, AlphaGo Zero, learned completely on its own, by playing itself. It surpassed the earlier version in a matter of days. Not long after, the AlphaGo project was ended and the team was disbanded. I guess it's a done deal.