Saturday, April 19, 2014

CharMatcher in Go

The Guava library for Java has a CharMatcher that provides a way of composing character matching predicates plus functions that use those predicates.

For example, AnyOf("\r\n").Negate() creates a CharMatcher that matches any character except return or newline. You can then do things like cm.IndexIn(str) or cm.CountIn(str)

Some of this you can do directly with Go libraries. The unicode package provides some standard "matchers" like IsDigit and IsLetter. And the strings package has functions like IndexFunc and TrimFunc that take predicates.

But they don't do everything that CharMatcher does, so as an exercise I thought I'd try implementing something like CharMatcher in Go.

My first approach was basically an object-oriented style like I'd use in Java with CharMatch as an interface.

type CharMatch interface {
Match(c rune) bool
}
type is_cm struct {
c rune
}
func (cm is_cm) Match(c rune) bool {
return c == cm.c
}
func Is(c rune) CharMatch {
return is_cm{c}
}
view raw charmatch0.go hosted with ❤ by GitHub
But when I started adding more matchers it seemed excessive to have to define three pieces for each - a struct, a match method for the struct, and a function to construct the struct.

My next thought was to get rid of the interface and have a generic struct containing a matching function as a member. This uses closures to store the matcher parameters rather than structs.

type CharMatch struct {
fn func(rune) bool
}
func (cm CharMatch) Match(c rune) bool {
return cm.fn(c)
}
func Is(c rune) CharMatch {
return CharMatch{ func (c2 rune) bool { return c == c2 }}
}
view raw charmatch1.go hosted with ❤ by GitHub
I was stuck on the idea of a struct so that I could define methods like Negate and IndexIn on it. Then I realized that in Go I could make CharMatch just a function, and still define methods on it. That led to this version:

type CharMatch func(rune) bool
// Predefined CharMatch's
var (
SPACE CharMatch = AnyOf(" \t\r\n")
DIGIT CharMatch = InRange('0', '9')
LETTER CharMatch = unicode.IsLetter
LOWER CharMatch = unicode.IsLower
UPPER CharMatch = unicode.IsUpper
)
// Match returns true if the character matches, otherwise false
func (cm CharMatch) Match(c rune) bool {
return cm(c)
}
// Is returns a CharMatch that matches a specific character
func Is(c1 rune) CharMatch {
return func(c2 rune) bool { return c1 == c2 }
}
// AnyOf returns a CharMatch that matches any character in a string
func AnyOf(s string) CharMatch {
return func(c rune) bool { return strings.ContainsRune(s, c) }
}
// InRange returns a CharMatch that matches any character in the range (inclusive)
func InRange(from rune, to rune) CharMatch {
return func(c rune) bool { return from <= c && c <= to }
}
// Negate returns a CharMatch that matches any character not matched by this one
func (cm CharMatch) Negate() CharMatch {
return func(c rune) bool { return !cm.Match(c) }
}
// Or returns a CharMatch that matches any character that matches either CharMatch
func (cm1 CharMatch) Or(cm2 CharMatch) CharMatch {
return func(c rune) bool { return cm1.Match(c) || cm2.Match(c) }
}
// CountIn returns the number of characters in the string that match
func (cm CharMatch) CountIn(s string) int {
n := 0
for _, c := range s {
if cm.Match(c) {
n++
}
}
return n
}
// IndexIn returns the index of the first character that matches
// or -1 if no match is found
func (cm CharMatch) IndexIn(s string) int {
return strings.IndexFunc(s, cm)
}
view raw charmatch2.go hosted with ❤ by GitHub
I used InRange for DIGIT and AnyOf for SPACE as examples, these could also use the unicode package equivalents.

IndexIn is an example of a method that just wraps a strings package function, whereas CountIn has no strings equivalent.

The tests give some examples of how it's used.

One potential drawback of this approach is that the matcher parameters are "buried" in closures. This makes it impossible to do any processing or optimization (like the Guava CharMatcher precomputed method). For example, Is('a').Or(Is('b')) could be folded into AnyOf('ab'). If you wanted to do this, I think you'd have to go back to using structs (like my first approach).

Friday, April 18, 2014

More Concatenation

I did some quick benchmarks in Go. (I really like how that ability is part of the standard tools.) Here are some results. As with any benchmark, don't take them as exact. Changing the parameters of the benchmarks gives varying numbers but with the same overall result.

Buffer37322 ns/op51104 B/op10 allocs/op
Array49456 ns/op53632 B/op17 allocs/op
Linked122558 ns/op54047 B/op1010 allocs/op
Merge311005 ns/op323552 B/op1998 allocs/op
Naive2225408 ns/op5371680 B/op999 allocs/op


An "op" in this case was appending 10 characters, 1000 times.

Some observations:
  • Naive concatenation is indeed bad, both in speed and memory
  • Almost anything is much better than the naive approach
  • As expected, a buffer is the best in both speed and memory
  • An array of substrings (without any merging) does surprisingly well
  • For an immutable option, a linked list isn't too bad
  • A merge tree was not such a good idea

Thursday, April 17, 2014

Overlooking the Simple Solution

For no particular reason I've been thinking about concatenating strings. (I know, get a life, but this is at least part of my life.) It was partly prompted by thinking about Go and how it might work to implement different facets of Suneido. 

Most languages warn you about the poor performance of building a large string by repeated concatenation. They recommend using a StringBuilder (Java) or its equivalent. 

But Suneido explicitly optimizes repeated concatenation internally so the programmer doesn't have to worry about when to concatenate and when to switch to "building". 

In cSuneido (the original C++ implementation) concatenation just creates a linked list of substrings, deferring the actual allocation and copying. 

Originally, I ported that approach to jSuneido (the Java version). But I cut a few corners that I thought were safe to cut. That came back to haunt me. Rather than fix the problems I looked for a better solution. (There are some issues when the linked list gets too big.) I considered some kind of merge tree but decided that was more complex than necessary.

Instead of a linked list I used an array of pieces which was expanded as needed. If the array got big it would merge small pieces. That has been working fine. 

Analyzing the problem from a more theoretical basis I figured the worst approach is repeated naive concatenation and the best (?) is something like StringBuilder that uses a buffer that expands e.g. by doubling in size. My array approach is somewhere in between, as is a merge tree. 

At that point it struck me that I could just use a StringBuilder rather than my array approach. Duh!

That eliminated about a hundred lines of code and ran about 20% faster. 

I feel stupid for not thinking of this sooner. It seems so obvious (now!) But I was stuck on the idea of deferring concatenating. 

And now I feel even more stupid because I just searched my own blog and found that I did consider a buffer approach but decided it required too much copying. (String Building Internals) Using a StringBuilder does mean copying into it and then eventually copying the result back out. But considering that Java compiles string concatenation into using StringBuilder, the overhead can't be too big. (Go, on the other hand, compiles string concatenation into calls to a runtime function that allocates a new string and copies directly into it, without any intermediate buffer.)
One advantage of my original linked list approach is that everything is immutable and therefore threadsafe without locking. That's attractive. Both the array and the StringBuilder are mutable and require locking. That's not a big deal on Java since the locks will almost always be uncontested and therefore very fast. And the locking is at the leaves of the call tree so they should be safe from issues like deadlock. 

But in Go, locking is less acceptable and an immutable solution would be nice. I have an idea for an immutable merge tree approach - stay tuned :-)

Saturday, April 12, 2014

Java 8, Eclipse Kepler, and Infinitest

When I updated my Mac to Java 8, Infinitest quit working. I've been hoping it would update and start working but it didn't happen.

I went looking and I found that the Eclipse Marketplace version of Infinitest is several years old. Had the project been abandoned?

I found an Infinitest web site which linked to the Github page. The readme there gave an update site of http://update.improvingworks.com/ and when I installed/updated from that site Infinitest started working again.

The web page gives an update site of http://infinitest.github.io which appears to point to the same version (5.1.110). I'm not sure which is the "correct" choice.

Now I'm just waiting for Proguard to be updated for Java 8.

Wednesday, April 09, 2014

Lexing in Go

One of my standard exercises when I'm looking at a new language is to implement the lexical scanner for Suneido. I've done this in quite a variety of languages - C++, Java, C#, D, and now Go. The scanner is simple, and generally the implementations are similar.

The Go code is longer (in lines) than most of the other implementations for a couple of reasons. One is that Go doesn't have ?: and you have to use if-else. Another is that gofmt puts enum type constants one per line.

Go only supports simple numeric constants to implement enums. That works ok, but it's awkward for debugging because if you print them, you just get a number.

One interesting thing about the Go implementation is that it handles unicode without really worrying about it too much.

I debated over whether to return the results as a struct or just as multiple return values. But that really depends on which is easier for the calling code.

There is a good talk by Rob Pike about Lexical Scanning in Go. If you're not interested enough to watch the video, you can skim the slides. I didn't need to use his fancier concurrent state machine design but it's an interesting example of using Go. You can see a full implementation in the Go template package.

Here's the code: (or view it on GitHub)

// Package lex implements the lexical scanner for Suneido
package lex
import (
"bytes"
"strings"
"unicode"
"unicode/utf8"
)
type lexer struct {
src string
si int
start int
width int
value string
}
type Item struct {
token Token
keyword Token
value string
}
func Lexer(src string) *lexer {
return &lexer{src: src}
}
func (lxr *lexer) Next() Item {
token := lxr.next()
var value string
if lxr.value != "" {
value = lxr.value
} else {
value = lxr.src[lxr.start:lxr.si]
}
keyword := NIL
if token == IDENTIFIER && lxr.peek() != ':' {
keyword = Keyword(value)
}
return Item{token, keyword, value}
}
func (lxr *lexer) next() Token {
lxr.start = lxr.si
c := lxr.read()
switch c {
case eof:
return EOF
case '#':
return HASH
case '(':
return L_PAREN
case ')':
return R_PAREN
case ',':
return COMMA
case ';':
return SEMICOLON
case '?':
return Q_MARK
case '@':
return AT
case '[':
return L_BRACKET
case ']':
return R_BRACKET
case '{':
return L_CURLY
case '}':
return R_CURLY
case '~':
return BITNOT
case ':':
if lxr.match(':') {
return RANGELEN
} else {
return COLON
}
case '=':
if lxr.match('=') {
return IS
} else {
if lxr.match('~') {
return MATCH
} else {
return EQ
}
}
case '!':
if lxr.match('=') {
return ISNT
} else {
if lxr.match('~') {
return MATCHNOT
} else {
return NOT
}
}
case '<':
if lxr.match('<') {
if lxr.match('=') {
return LSHIFTEQ
} else {
return LSHIFT
}
} else if lxr.match('>') {
return ISNT
} else if lxr.match('=') {
return LTE
} else {
return LT
}
case '>':
if lxr.match('>') {
if lxr.match('=') {
return RSHIFTEQ
} else {
return RSHIFT
}
} else if lxr.match('=') {
return GTE
} else {
return GT
}
case '|':
if lxr.match('|') {
return OR
} else if lxr.match('=') {
return BITOREQ
} else {
return BITOR
}
case '&':
if lxr.match('&') {
return AND
} else if lxr.match('=') {
return BITANDEQ
} else {
return BITAND
}
case '^':
if lxr.match('=') {
return BITXOREQ
} else {
return BITXOR
}
case '-':
if lxr.match('-') {
return DEC
} else if lxr.match('=') {
return SUBEQ
} else {
return SUB
}
case '+':
if lxr.match('+') {
return INC
} else if lxr.match('=') {
return ADDEQ
} else {
return ADD
}
case '/':
if lxr.match('/') {
return lxr.lineComment()
} else if lxr.match('*') {
return lxr.spanComment()
} else if lxr.match('=') {
return DIVEQ
} else {
return DIV
}
case '*':
if lxr.match('=') {
return MULEQ
} else {
return MUL
}
case '%':
if lxr.match('=') {
return MODEQ
} else {
return MOD
}
case '$':
if lxr.match('=') {
return CATEQ
} else {
return CAT
}
case '`':
return lxr.rawString()
case '"':
case '\'':
return lxr.quotedString(c)
case '.':
if lxr.match('.') {
return RANGETO
} else if unicode.IsDigit(lxr.peek()) {
return lxr.number()
} else {
return DOT
}
case '0', '1', '2', '3', '4', '5', '6', '7', '8', '9':
return lxr.number()
default:
if isSpace(c) {
return lxr.whitespace(c)
} else if unicode.IsLetter(c) || c == '_' {
return lxr.identifier()
}
}
return ERROR
}
func (lxr *lexer) whitespace(c rune) Token {
result := WHITESPACE
for ; isSpace(c); c = lxr.read() {
if c == '\n' || c == '\r' {
result = NEWLINE
}
}
lxr.backup()
return result
}
func (lxr *lexer) lineComment() Token {
for c := lxr.read(); c != eof && c != '\n'; c = lxr.read() {
}
return COMMENT
}
func (lxr *lexer) spanComment() Token {
lxr.matchUntil(func() bool { return strings.HasSuffix(lxr.src[:lxr.si], "*/") })
return COMMENT
}
func (lxr *lexer) rawString() Token {
for c := lxr.read(); c != eof && c != '`'; c = lxr.read() {
}
lxr.value = lxr.src[lxr.start+1 : lxr.si-1]
return STRING
}
func (lxr *lexer) quotedString(quote rune) Token {
var buf bytes.Buffer
lxr.match(quote)
for c := lxr.read(); c != eof && c != quote; c = lxr.read() {
buf.WriteRune(lxr.doesc(c))
}
lxr.value = buf.String()
return STRING
}
func (lxr *lexer) doesc(c rune) rune {
if c != '\\' {
return c
}
save := lxr.si
c = lxr.read()
switch c {
case 'n':
return '\n'
case 't':
return '\t'
case 'r':
return '\r'
case 'x':
dig1 := digit(lxr.read(), 16)
dig2 := digit(lxr.read(), 16)
if dig1 != -1 && dig2 != -1 {
return rune(16*dig1 + dig2)
}
case '\\':
case '"':
case '\'':
return c
default:
dig1 := digit(lxr.read(), 8)
dig2 := digit(lxr.read(), 8)
dig3 := digit(lxr.read(), 8)
if dig1 != -1 && dig2 != -1 && dig3 != -1 {
return rune(64*dig1 + 8*dig2 + dig3)
}
}
lxr.si = save
return '\\'
}
func digit(c rune, radix int) int {
n := 99
if isDigit(c) {
n = int(c - '0')
} else if isHexDigit(c) {
n = int(10 + unicode.ToLower(c) - 'a')
}
if n < radix {
return n
} else {
return -1
}
}
func isDigit(r rune) bool {
return '0' <= r && r <= '9'
}
func isHexDigit(r rune) bool {
return strings.ContainsRune(hexDigits, r)
}
func (lxr *lexer) number() Token {
lxr.matchOneOf("+-")
// Is it hex?
digits := "0123456789"
if lxr.match('0') && lxr.matchOneOf("xX") {
digits = hexDigits
}
lxr.matchRunOf(digits)
if lxr.match('.') {
lxr.matchRunOf(digits)
}
if lxr.matchOneOf("eE") {
lxr.matchOneOf("+-")
lxr.matchRunOf("0123456789")
}
return NUMBER
}
func (lxr *lexer) identifier() Token {
lxr.matchWhile(isIdentChar)
if !lxr.match('?') {
lxr.match('!')
}
return IDENTIFIER
}
const eof = -1
func (lxr *lexer) read() rune {
if lxr.si >= len(lxr.src) {
lxr.width = 0
return eof
}
c, w := utf8.DecodeRuneInString(lxr.src[lxr.si:])
lxr.si += w
lxr.width = w
return c
}
func (lxr *lexer) backup() {
lxr.si -= lxr.width
}
func (lxr *lexer) peek() rune {
c := lxr.read()
lxr.backup()
return c
}
func (lxr *lexer) match(c rune) bool {
if c == lxr.read() {
return true
}
lxr.backup()
return false
}
func (lxr *lexer) matchOneOf(valid string) bool {
if strings.ContainsRune(valid, lxr.read()) {
return true
}
lxr.backup()
return false
}
func (lxr *lexer) matchRunOf(valid string) {
for strings.ContainsRune(valid, lxr.read()) {
}
lxr.backup()
}
func (lxr *lexer) matchWhile(f func(c rune) bool) {
for c := lxr.read(); f(c); c = lxr.read() {
}
lxr.backup()
}
func (lxr *lexer) matchUntil(f func() bool) {
for c := lxr.read(); c != eof && !f(); c = lxr.read() {
}
}
func isIdentChar(r rune) bool {
return r == '_' || unicode.IsLetter(r) || unicode.IsDigit(r)
}
const hexDigits = "0123456789abcdefABCDEF"
func isSpace(c rune) bool {
return c == ' ' || c == '\t' || c == '\r' || c == '\n'
}
view raw lex.go hosted with ❤ by GitHub
package lex
func Keyword(s string) Token {
return keywords[s]
}
func (t Token) IsInfix() bool {
return infix[t]
}
func (t Token) String() string {
return tostring[t]
}
var tostring = map[Token]string{
NIL: "NIL",
EOF: "EOF",
WHITESPACE: "WHITESPACE",
COMMENT: "COMMENT",
NEWLINE: "NEWLINE",
}
type Token int
const (
NIL Token = iota
EOF
ERROR
IDENTIFIER
NUMBER
STRING
WHITESPACE
COMMENT
NEWLINE
// operators and punctuation
HASH
COMMA
COLON
SEMICOLON
Q_MARK
AT
DOT
L_PAREN
R_PAREN
L_BRACKET
R_BRACKET
L_CURLY
R_CURLY
IS
ISNT
MATCH
MATCHNOT
LT
LTE
GT
GTE
NOT
INC
DEC
BITNOT
ADD
SUB
CAT
MUL
DIV
MOD
LSHIFT
RSHIFT
BITOR
BITAND
BITXOR
EQ
ADDEQ
SUBEQ
CATEQ
MULEQ
DIVEQ
MODEQ
LSHIFTEQ
RSHIFTEQ
BITOREQ
BITANDEQ
BITXOREQ
RANGETO
RANGELEN
// language keywords
AND
BOOL
BREAK
BUFFER
CALLBACK
CASE
CATCH
CHAR
CLASS
CONTINUE
CREATE
DEFAULT
DLL
DO
DOUBLE
ELSE
FALSE
FLOAT
FOR
FOREVER
FUNCTION
GDIOBJ
HANDLE
IF
IN
INT64
LONG
NEW
OR
RESOURCE
RETURN
SHORT
STRUCT
SWITCH
SUPER
THIS
THROW
TRUE
TRY
VOID
WHILE
// query keywords
ALTER
AVERAGE
CASCADE
COUNT
DELETE
DROP
ENSURE
EXTEND
HISTORY
INDEX
INSERT
INTERSECT
INTO
JOIN
KEY
LEFTJOIN
LIST
MAX
MIN
MINUS
PROJECT
REMOVE
RENAME
REVERSE
SET
SORT
SUMMARIZE
SVIEW
TIMES
TO
TOTAL
UNION
UNIQUE
UPDATE
UPDATES
VIEW
WHERE
// for AST
ARG
ASSIGNOP
BINARYOP
BLOCK
CALL
DATE
FOR_IN
MEMBER
METHOD
OBJECT
POSTINCDEC
PREINCDEC
RECORD
RVALUE
SELFREF
SUBSCRIPT
SYMBOL
)
var keywords = map[string]Token{
"and": AND,
"bool": BOOL,
"break": BREAK,
"buffer": BUFFER,
"callback": CALLBACK,
"case": CASE,
"catch": CATCH,
"char": CHAR,
"class": CLASS,
"continue": CONTINUE,
"default": DEFAULT,
"dll": DLL,
"do": DO,
"double": DOUBLE,
"else": ELSE,
"false": FALSE,
"float": FLOAT,
"for": FOR,
"forever": FOREVER,
"function": FUNCTION,
"gdiobj": GDIOBJ,
"handle": HANDLE,
"if": IF,
"in": IN,
"int64": INT64,
"is": IS,
"isnt": ISNT,
"long": LONG,
"new": NEW,
"not": NOT,
"or": OR,
"resource": RESOURCE,
"return": RETURN,
"short": SHORT,
"string": STRING,
"struct": STRUCT,
"super": SUPER,
"switch": SWITCH,
"this": THIS,
"throw": THROW,
"true": TRUE,
"try": TRY,
"void": VOID,
"while": WHILE,
"xor": ISNT,
}
var infix = map[Token]bool{
AND: true,
OR: true,
Q_MARK: true,
MATCH: true,
MATCHNOT: true,
ADD: true,
SUB: true,
CAT: true,
MUL: true,
DIV: true,
MOD: true,
LSHIFT: true,
RSHIFT: true,
BITOR: true,
BITAND: true,
BITXOR: true,
}
view raw tokens.go hosted with ❤ by GitHub

Saturday, April 05, 2014

Hamcrest Style Matchers in Go

I've grown quite accustomed to using Hamcrest style matchers in Java. So I looked for something similar in Go. I found github.com/rdrdr/hamcrest, but the last activity was three years ago and when I tried to use it I got errors. (Go has changed) I also found Gomega but for some reason it didn't attract me.

I started to write my own, then stopped myself from getting side tracked from what I was doing. But I kept thinking about it, and ended up writing something very simple.

What I came up with allows you to write assertions like:

Assert(t).That(..., Equals(expected))

and to add more information to the error messages with:

Assert(t).That(..., Equals(expected).Comment("..."))

Where t is the *testing.T that Go's testing framework supplies.

Equals returns a tester function (a closure capturing the expected value). A tester returns "" on success, or else an error message.

That is a method that takes a value of any type (i.e. interface{}) and a tester function, and calls the tester with the value. If the tester returns an error message it calls t.Error

Comment is a method on a tester (taking advantage of Go's ability to define methods on any type, not just on classes). It returns a new tester (a closure capturing the message) that passes success ("") through unchanged, but appends it's message to any error messages.

Taking advantage of Go interfaces, I didn't make the code depend on Go's testing.T type. Instead I defined my own interface with a single Error method (matching the one in testing.T) and made Assert wrap that. So it will work with anything that has a suitable Error method. I didn't have any particular usage in mind for that, but it's a nice way to avoid dependencies.

Initially I wrote Equals using "==". That worked for simple values but not for things like slices. I ended up using reflect.DeepEqual which seems to work. I'm not sure if this is the best approach. Obviously it won't work for things like less than or greater than.

One of the problems I had was that errors would be reported as always occurring on the same line of my hamcrest.go file where I called Error rather than the relevant line in my test. This is a more general problem whenever tests use any kind of helper function that ends up calling Error. Maybe that's not the normal style, but I tend to do it a lot. I found the code where it does this in the decorate method in testing.go but there doesn't appear to be any way to override it. It would be easy enough to modify testing.go, but I'm not sure how I'd get "go test" to use it, short of building a custom version of Go which doesn't seem like a good solution. Ideally Go test would report the portion of the call stack within my code.

I ended up just adding my own reporting. Rather than hard coding how far back in the call stack to go (as testing.go does), I looked for the first call after the testing framework, i.e. the actual top level test function. So errors have the correct location on the end:

hamcrest.go:38: expected 5790 but got 579 {dbldisp_test.go:16}

Obviously, this is not a complete implementation of Hamcrest style matchers. It was a good exercise to explore some of Go's features like interfaces, function value, and closures. I've been using it to write tests but I'm not sure if I'll do more on it and use it longer term or find something else to use.

UPDATE: Something I forgot to mention is that when I'm using this I'm doing:

import . "hamcrest"

The dot allows you to use the exported names from hamcrest (e.g. Assert and Equals) without requiring the package name prefix. This is discouraged, but in this case it seems preferable to writing:

hamcrest.Assert(t).That(..., hamcrest.Equals(expected))

Here's the code on GitHub:

/*
hamcrest implements very basic hamcrest style asserts
for example:
func TestStuff(t *testing.T) {
Assert(t).That(2 * 4, Equals(6))
}
*/
package hamcrest
import "fmt"
import "reflect"
import "runtime"
import "strings"
type testing interface {
Error(err ...interface{})
}
type Asserter struct {
t testing
}
func Assert(t testing) Asserter {
return Asserter{t}
}
type tester func(interface{}) string
func (a Asserter) That(actual interface{}, test tester) {
err := test(actual)
if err != "" {
a.Fail(err)
}
}
func (a Asserter) Fail(err string) {
file, line := getLocation()
a.t.Error(err + fmt.Sprintf(" {%s:%d}", file, line))
}
func getLocation() (file string, line int) {
i := 1
for ; i < 9; i++ {
_, file, _, ok := runtime.Caller(i)
if !ok || strings.Contains(file, "testing/testing.go") {
break
}
}
_, file, line, ok := runtime.Caller(i - 1)
if ok && i > 1 && i < 9 {
// Truncate file name at last file name separator.
if index := strings.LastIndex(file, "/"); index >= 0 {
file = file[index+1:]
} else if index = strings.LastIndex(file, "\\"); index >= 0 {
file = file[index+1:]
}
} else {
file = "???"
line = 1
}
return file, line
}
// Equals checks that the actual value is equal to the expected value
// using reflect.DeepEquals
func Equals(expected interface{}) tester {
return func(actual interface{}) string {
if reflect.DeepEqual(expected, actual) {
return ""
}
return fmt.Sprintf("expected %v but got %v", expected, actual)
}
}
// Comment decorates a tester to add extra text to error messages
func (test tester) Comment(comment string) tester {
return func(actual interface{}) string {
err := test(actual)
if err == "" {
return ""
} else {
return err + " (" + comment + ")"
}
}
}
view raw hamcrest.go hosted with ❤ by GitHub

Thursday, April 03, 2014

Decimal Floating Point Arithmetic

If you're doing things like financial calculations, you have to be careful about using conventional binary floating point because it can't represent decimal fractions exactly.

One approach is to use "scaled" numbers, e.g. represent your dollar amount in cents or hundredths of cents so you are always working in integers. And it requires big integers, 32 bits is only about 9 decimal digits and the 52 bits of double floats is about 15. You really need 64 bit integers which are about 19 digits. (10 bits ~ 3 decimal digits) But that still doesn't give you the ability to deal with general purpose floating point.

So Suneido has always had a decimal floating point numeric type. (Internally, for performance, it also uses plain integers when possible.) Another advantage of a decimal type is that it is simple and quick to convert to and from string form.

Back when I first wrote Suneido (~ 15 years ago) there were no 64 bit integers in C++ compilers ("long" was 32 bits) and 32 bits wasn't sufficient precision. So I had to use multiple values to hold the coefficient. Since I had to use multiple integers anyway, to simplify overflow (by using 32 bit ints for intermediate results) I used four 16 bit ints, each one holding four decimal digits for an overall precision of 16 decimal digits. (To simplify "shifting" the exponent is in terms of the 16 bit ints, i.e. it jumps 4 decimals at a time. This "granularity" causes problems with precision. Depending on the exponent, in the worst case you get as few as 10 decimal digits of precision.)

Of course, having to use multiple integers and trying to get decent performance complicated the code, especially division. I won't claim it's the greatest code, but nevertheless it's worked reasonably well for a long time.

When I implemented jSuneido, I used Java's BigDecimal. Because of the different implementation there were a few minor differences, but they were mostly edge cases that didn't matter in practical usage. (Unfortunately I had made the external dump format for numbers mirror cSuneido's internal representation so it's a little awkward converting to and from BigDecimals.)

Recently, we've started to run into issues with using API's that deal with 64 bit integers, because we don't have enough precision to store them. In jSuneido it would be easy to bump up the BigDecimal precision to 20 digits. In theory I could do the same with cSuneido, but unfortunately, the code is fairly specific to the current precision. e.g. loops are unrolled. The thought of making this change is not pleasant :-(

The other problem is that some of the code assumes that you can convert to and from 64 bit integers losslessly. But 20 decimal digits won't always fit in a 64 bit integer.

Now that we have 64 bit integer types, the obvious answer seems to be to use a 64 bit integer for the coefficient. This will be faster and simpler than using multiple small integers, and probably faster than BigDecimal since it handles arbitrary precision. And if I used the same approach in both cSuneido and jSuneido this would ensure consistent results.

Since I'm in the middle of playing with Go, I figured I'd try writing a Go version first. It should be relatively easy to port to C++ and Java if I decide to.

It took me a couple of days to write it. One of the challenges is detecting overflow when calculating with 64 bit integers, since you don't have a larger type to use for intermediate calculations. Hacker's Delight provided a few tips for this. Another useful reference was General Decimal Arithmetic.

It's about 500 lines for add, subtract, multiply, divide, and conversion to and from strings. (That's about half the size of the cSuneido C++ code.) Since I'm new to Go, it may not be the most idiomatic code. And I have only done basic testing and refactoring. "float10" isn't the greatest name. Maybe "decimal" or even "dec"? (in keeping with Go's predilection for short names) I'm open to suggestions...

I chose to pass and return by value rather than by pointer. I'm not sure if this is the best choice for a 10 byte struct. Would it be faster to pass by pointer? Returning by pointer forces heap allocation for intermediate results which isn't ideal. Pass and return by value is a good fit for immutable values which are my preference.

Go makes it really easy to benchmark so I checked the speed of division (the slowest operation). Micro-benchmarks are always dubious, but it gave me a rough idea. It showed about 400 ns per divide (on my iMac). I don't have comparable benchmarks for cSuneido or jSuneido, but that seems pretty good. I'm pretty sure it's better than cSuneido. (Of course, it's nowhere near as fast as native binary floating point done in hardware. The same benchmark with float64 gives about 7 ns per divide, although this is so small that it's even less likely to be accurate.)

As far as evaluating Go, so far I like it. Of course, it's well suited to low level code like this. Sublime Text + GoSublime works well. (The only issues have been with learning Sublime since I haven't used it much.) I might have broken out the debugger a couple of times if I'd been working in Java or C++, but I get the impression the debugger story for Go isn't that great. I managed easily enough with old school prints :-) I plan to give Eclipse + GoClipse a try at some point since I'm already familiar with Eclipse.

The code is embedded below but it's probably easier to read (or download) on GitHub.

// Package float10 implements decimal floating point numbers
// using uint64 to hold the coefficient.
package float10
import "strconv"
import "errors"
import "strings"
import "math"
// value is -1^sign * coef * 10^exp
// zeroed value = 0
type Float10 struct {
coef uint64
sign int8
exp int8
}
const (
POSITIVE = 0
NEGATIVE = 1
INF_EXP = math.MaxInt8
)
var (
Zero = Float10{}
Inf = Float10{exp: INF_EXP}
MinusInf = Float10{sign: NEGATIVE, exp: INF_EXP}
)
// Parse convert a string to a Float10
func Parse(s string) (Float10, error) {
if len(s) < 1 {
return Zero, errors.New("cannot convert empty string to Float10")
}
if s == "0" {
return Zero, nil
}
var f Float10
i := 0
if s[i] == '+' {
i++
} else if s[i] == '-' {
f.sign = NEGATIVE
i++
}
before := spanDigits(s[i:])
i += len(before)
after := ""
if i < len(s) && s[i] == '.' {
i++
after = spanDigits(s[i:])
i += len(after)
}
after = strings.TrimRight(after, "0")
coef, err := strconv.ParseUint(before+after, 10, 64)
if err != nil {
return Zero, errors.New("invalid number (" + s + ")")
}
f.coef = coef
exp := 0
if i < len(s) && (s[i] == 'e' || s[i] == 'E') {
i++
e, err := strconv.ParseInt(s[i:], 10, 8)
if err != nil {
return Zero, errors.New("invalid exponent (" + s + ")")
}
exp = int(e)
}
if coef == 0 {
return Zero, nil
}
exp -= len(after)
if exp < -127 || exp >= 127 {
return Zero, errors.New("exponent out of range (" + s + ")")
}
f.exp = int8(exp)
return f, nil
}
// spanDigits returns the number of leading digits
func spanDigits(s string) string {
i := 0
for i < len(s) && '0' <= s[i] && s[i] <= '9' {
i++
}
return s[:i]
}
// String converts a Float10 to a string.
// It will avoid scientific notation
// adding up to 4 zeroes at the end or 3 zeroes at the beginning.
// If the exponent is 0 it will print the number as an integer.
func (f Float10) String() string {
if f == Zero {
return "0"
}
sign := ""
if f.sign == NEGATIVE {
sign = "-"
}
if f.exp == INF_EXP {
return sign + "inf"
}
sexp := ""
exp := int(f.exp)
digits := strconv.FormatUint(f.coef, 10)
if 0 <= exp && exp <= 4 {
// decimal to the right
digits += strings.Repeat("0", exp)
} else if -len(digits)-4 < exp && exp <= -len(digits) {
// decimal to the left
digits = "." + strings.Repeat("0", -exp-len(digits)) + digits
digits = strings.TrimRight(digits, "0.")
} else if -len(digits) < exp && exp <= -1 {
// decimal within
i := len(digits) + exp
digits = digits[:i] + "." + digits[i:]
digits = strings.TrimRight(digits, "0.")
} else {
// use exponent
sexp = "e" + strconv.FormatInt(int64(exp), 10)
}
return sign + digits + sexp
}
// operations ------------------------------------------------------------------
func (f Float10) Neg() Float10 {
if f == Zero {
return Zero
} else {
return Float10{f.coef, f.sign ^ 1, f.exp}
}
}
func Cmp(x Float10, y Float10) int {
switch {
case x == y:
return 0
case x == MinusInf, y == Inf:
return -1
case x == Inf, y == MinusInf:
return 1
}
switch d := Sub(x, y); {
case d == Zero:
return 0
case d.sign == NEGATIVE:
return -1
default:
return +1
}
}
func Add(x Float10, y Float10) Float10 {
switch {
case x == Zero:
return y
case y == Zero:
return x
case x == Inf:
if y == MinusInf {
return Zero
} else {
return Inf
}
case x == MinusInf:
if y == Inf {
return Zero
} else {
return MinusInf
}
case y == Inf:
return Inf
case y == MinusInf:
return MinusInf
case x.sign != y.sign:
return usub(x, y)
default:
return uadd(x, y)
}
}
func Sub(x Float10, y Float10) Float10 {
switch {
case x == Zero:
return y.Neg()
case y == Zero:
return x
case x == Inf:
if y == Inf {
return Zero
} else {
return Inf
}
case x == MinusInf:
if y == MinusInf {
return Zero
} else {
return MinusInf
}
case y == Inf:
return MinusInf
case y == MinusInf:
return Inf
case x.sign != y.sign:
return uadd(x, y)
default:
return usub(x, y)
}
}
func uadd(x Float10, y Float10) Float10 {
sign := x.sign
align(&x, &y)
if x.coef == 0 {
return Float10{y.coef, sign, y.exp}
}
coef := x.coef + y.coef
if coef < x.coef || coef < y.coef { // overflow
x.shiftRight()
y.shiftRight()
coef = x.coef + y.coef
}
return result(coef, sign, int(x.exp))
}
func align(x *Float10, y *Float10) (flipped int8) {
if x.exp == y.exp {
return
}
if x.exp > y.exp {
*x, *y = *y, *x // swap
flipped = 1
}
for y.exp > x.exp && y.shiftLeft() {
}
for y.exp > x.exp && x.shiftRight() {
}
return
}
// returns true if it was able to shift (losslessly)
func (f *Float10) shiftLeft() bool {
if !mul10safe(f.coef) {
return false
}
f.coef *= 10
// don't decrement past min
if f.exp > math.MinInt8 {
f.exp--
}
return true
}
const HI4 = 0xf << 60
func mul10safe(n uint64) bool {
return (n & HI4) == 0
}
// NOTE: may lose precision and round
// returns false only if coef is 0
func (f *Float10) shiftRight() bool {
if f.coef == 0 {
return false
}
roundup := (f.coef % 10) >= 5
f.coef /= 10
if roundup {
f.coef++ // can't overflow
}
// don't increment past max
if f.exp < math.MaxInt8 {
f.exp++
}
return true
}
func result(coef uint64, sign int8, exp int) Float10 {
switch {
case exp >= INF_EXP:
return inf(sign)
case exp < math.MinInt8 || coef == 0:
return Zero
default:
return Float10{coef, sign, int8(exp)}
}
}
func usub(x Float10, y Float10) Float10 {
sign := x.sign
sign ^= align(&x, &y)
if x.coef < y.coef {
x, y = y, x
sign ^= 1 // flip sign
}
return result(x.coef-y.coef, sign, int(x.exp))
}
func Mul(x Float10, y Float10) Float10 {
sign := x.sign ^ y.sign
switch {
case x == Zero || y == Zero:
return Zero
case x.IsInf() || y.IsInf():
return result(0, sign, INF_EXP)
}
x.minCoef()
y.minCoef()
if bits.Nlz(x.coef)+bits.Nlz(y.coef) >= 64 {
// coef won't overflow
if int(x.exp)+int(y.exp) >= INF_EXP {
return result(0, sign, INF_EXP)
}
return result(x.coef*y.coef, sign, int(x.exp)+int(y.exp))
}
// drop 5 least significant bits off x and y
// 59 bits < 18 decimal digits
// 32 bits > 9 decmal digits
// so we can split x & y into two pieces
// and multiply separately guaranteed not to overflow
xlo, xhi := x.split()
ylo, yhi := y.split()
exp := x.exp + y.exp
r1 := result(xlo*ylo, sign, int(exp))
r2 := result(xlo*yhi, sign, int(exp)+9)
r3 := result(xhi*ylo, sign, int(exp)+9)
r4 := result(xhi*yhi, sign, int(exp)+18)
return Add(r1, Add(r2, Add(r3, r4)))
}
// makes coef as small as possible (losslessly)
// i.e. trim trailing zero decimal digits
func (f *Float10) minCoef() {
for f.coef > 0 && f.coef%10 == 0 {
f.shiftRight()
}
}
// makes coef as large as possible (losslessly)
// i.e. trim leading zero decimal digits
func (f *Float10) maxCoef() {
for f.shiftLeft() {
}
}
func (f *Float10) split() (lo, hi uint64) {
const HI5 = 0x1f << 59
for f.coef&HI5 != 0 {
f.shiftRight()
}
const NINE = 1000000000
return f.coef % NINE, f.coef / NINE
}
func Div(x Float10, y Float10) Float10 {
sign := x.sign ^ y.sign
switch {
case x == Zero:
return Zero
case y == Zero || x.IsInf():
return inf(sign)
case y.IsInf():
return Zero
}
if x.coef%y.coef == 0 {
// divides evenly
return result(x.coef/y.coef, sign, int(x.exp)-int(y.exp))
}
x.maxCoef()
y.minCoef()
if x.coef%y.coef == 0 {
// divides evenly
return result(x.coef/y.coef, sign, int(x.exp)-int(y.exp))
}
return longDiv(x, y)
}
func longDiv(x Float10, y Float10) Float10 {
// shift y so it is just less than x
xdiv10 := x.coef / 10
for y.coef < xdiv10 && y.shiftLeft() {
}
exp := int(x.exp) - int(y.exp)
rem := x.coef
ycoef := y.coef
coef := uint64(0)
// each iteration calculates one digit of the result
for rem != 0 && mul10safe(coef) {
// shift so y is less than the remainder
for ycoef > rem {
rem, ycoef = shift(rem, ycoef)
coef *= 10
exp--
}
if ycoef == 0 {
break
}
// subtract repeatedly
for rem >= ycoef {
rem -= ycoef
coef++
}
}
// round final digit
if 2*rem >= ycoef {
coef++
}
return result(coef, x.sign^y.sign, exp)
}
// shift x left (preferably) or y right
func shift(x, y uint64) (x2, y2 uint64) {
if mul10safe(x) {
x *= 10
} else {
roundup := (y % 10) >= 5
y /= 10
if roundup {
y++
}
}
return x, y
}
func (f Float10) IsInf() bool {
return f.exp == INF_EXP
}
func inf(sign int8) Float10 {
switch sign {
case POSITIVE:
return Inf
case NEGATIVE:
return MinusInf
default:
panic("invalid sign")
}
}
// Nlz returns the number of leading zero bits
// Algorithm from Hacker's Delight by Henry Warren
func Nlz(x uint64) int {
if x == 0 {
return 64
}
n := 1
if (x >> 32) == 0 {
n = n + 32
x = x << 32
}
if (x >> 48) == 0 {
n = n + 16
x = x << 16
}
if (x >> 56) == 0 {
n = n + 8
x = x << 8
}
if (x >> 60) == 0 {
n = n + 4
x = x << 4
}
if (x >> 62) == 0 {
n = n + 2
x = x << 2
}
n = n - int(x>>63)
return n
}
view raw float10.go hosted with ❤ by GitHub

Wednesday, April 02, 2014

TortoiseSVN + TortoiseHg Problem

I use Subversion (SVN) for cSuneido (for historical reasons) and Mercurial (Hg) for jSuneido, both on SourceForge (again for historical reasons).

On Windows I use TortoiseSVN (1.8.5) and TortoiseHg (2.11.2) with Pageant (part of PuTTY, but supplied with TortoiseHg) so I don't have to type a password all the time. This combination has worked well for a long time.

I came into work this morning and TortoiseSVN kept popping up a Plink dialog asking for my password. That's what Pageant is supposed to avoid, especially since SourceForge needs an SSH key, not a password.

TortoiseHg was working fine, which meant Pageant was ok.

I used TortoiseSVN a few days ago. As far as I can recall I didn't change anything since then. But possibly I updated it. There are so many updates going by these days that it's hard to remember.

I searched the web but didn't find anything that seemed to be related.

I tried rebooting. I tried changing my path to put TortoiseHg and TortoiseSVN in the opposite order. Didn't help.

After some digging I found TortoiseHg was using older versions of TortoisePlink and pageant (both from 2012) whereas TortoiseSVN had a new TortoisePlink (from 2014). I wasn't sure it was a good idea, but I tried replacing the new TortoisePlink with the old one, thinking that maybe it needed to match the version of pageant.

That worked! Or at least appears to work. (I even rebooted to make sure the problem wouldn't come back.) It's probably going to break next time I update TortoiseSVN, and I'll probably forget the fix, but at least I'll have this blog post to jog my memory :-) And hopefully in the long run this will get sorted out. I can't be the only person running both. I'm not sure why TortoiseHg has such old versions. There seem to have been similar version issues a few years ago.