Sunday, August 05, 2012

D-tours

I've been continuing to spend a little time here and there playing with the D language.  For a sample project I decided I would try implementing Suneido's lexical scanner. It's pretty simple, just a little string and character manipulation so I thought it would be easy.

I based the D version on the Java version since it's newer and cleaner than the C++ version.

The first thing I ran into was that the lexer returns "tokens", which in the Java code are not just integer enums, but actual enum classes with fields and methods. Java went from no enums in early versions, to quite sophisticated enum capabilities.

In typical new user fashion, I tried to reproduce the Java style token enum in D. (It's hard not to avoid "cutting against the grain" when learning a new language.) D does allow you to use structs for enums so it seemed this would work. But if you use structs then you lose the automatic assignment of consecutive integer values, and D struct's are value types so they aren't necessarily unique.

Then I went off an a tangent looking into D's "mixin" ability. D has the ability to run D code at compile time, generate a string, and insert that string into your source code. This is known as CTFE (compile time function evaluation). For example, you can write:
mixin(myfn(arguments...);
where myfn returns a string containing source code to be compiled into your code in place of the mixin "call". This is easier than trying to use C++ templates to do compile time calculations, but not as seamless as Lisp macros because you have to work at the level of strings.

An example of the power of this is the Pegged library for D. This allows you to include a grammar specification in your source file, have a parser generated from it, use it to parse DSL code also included in your source file, and then use the resulting parse tree to generate D code - all of this at compile time, without requiring any external tools.

It was fun looking at what you could do with mixin but it was overkill for what I was trying to do. I realized I could use another feature of D to use simple integer enum tokens, but still work with them as if they were objects with properties.
enum Token { IF, ELSE, AND, OR, ...
bool isKeyword(T token) { ...
D allows you to call functions like isKeyword as if they were methods on their first argument, so you can say token.isKeyword() even though token is an int, not an object.

This also led me to discover D doesn't have a "set" data structure, which surprised me a little. It does have hash maps built-in to the language, which can be used to emulate sets, and it has a red-black-tree implementation which could also be used. But it doesn't have an actual "Set" class. It was easy enough to throw together my own Set so I could make a set of keywords. (Which I didn't end up using because what I really needed was a map of keyword strings to tokens.)

What I initially wrote was:
static auto keywords = new Set!Token(IF, ELSE, ...);
But that won't work because static initializers can only be simple literals. However, you can write:
static Set!Token keywords;
static this() { keywords = new Set!Token(IF, ELSE, ...); }
I'm not sure why the compiler can't do this same translation as simple syntactic sugar. It's extra annoying because you can't use the "auto" type inference.

You could do it with a mixin like:
mixin(stat("keywords", "new Set!Token(...);");
but that's not exactly pretty.

Another minor annoyance was the lack of an equivalent to Java's static import. If I want to refer to the tokens as just IF or ELSE I could make them "anonymous". But then they don't have a specific type. Or I can give the enum a named type, but then I have to always reference them with Token.IF . In Java I could say import static Token.* and then just use the bare names.

I ended up going back to a mixin approach with each Token a static class. I used a class rather than a struct because I wanted to pass tokens around by reference, not by value. See token.d

I also ended up writing multiple versions of the lexer. As with most rewrites, I found a number of improvements I could make. Most of them aren't specific to D, I could apply them to the other versions as well.

After getting it working, I decided to go back and write it in a functional (rather than object oriented style). In this version the lexer is a function that takes a string and returns a struct with the token, its string value, and the remainder of the source string. This version should work with Unicode UTF8 strings, not just ASCII, because I only access the source string with front and popFront.

Along the way I also started a simple version of Hamcrest style asserts. I almost expected something like this to be available, but I couldn't find anything.

One of the features I was excited about with D was immutability. But when I tried to make Token immutable, I ran into all kinds of problems (as others have also). One of them being that you can't store immutable objects in a map! Hopefully this is something they'll figure out. I eventually gave up on immutable and managed to get const to work, although even that was a struggle.

I set up a dsuneido GitHub repository for my experimenting. I haven't used GitHub before, but I keep hearing how great it is so I thought I better give it a try. I used their Mac GUI app, which made it quite painless. However, I'm not sure it's the smartest idea to be using three different version control systems - Subversion for cSuneido, Mercurial for jSuneido, and Git for dSuneido. I chose Mercurial for jSuneido because they had better Windows support than Git at that time, but now that Eclipse comes with built-in Git support, I wonder if I made the right choice.

I'm not sure where I'm going with this, so far I'm just playing and learning D. So far, I like it, but it certainly has some rough edges.

6 comments:

Anonymous said...

Hi Andrew,
a interesting book regarding D templates . by Phillipe Sigaud (Pegged) I've contributed a small chapter mixin templates...
https://github.com/PhilippeSigaud/D-templates-tutorial/blob/master/dtemplates.pdf
Since you miss a container/collection library :
http://www.dsource.org/projects/dcollections

contains sets, arraylist, maps etc.
you should use the D2 branch.
Hope that helps a bit, Bjoern

Andrew McKinlay said...

Thanks! The book looks good. And I will check out the collection library.

Alex Rønne Petersen said...

> I'm not sure why the compiler can't do this same translation as simple syntactic sugar.

This is because the initializer would have to be CTFE'able. Initializers don't have to be simple literals, they just have to be executable at compile time.

Andrew McKinlay said...

I've seen that argument before, but I disagree - the compiler could translate to a static this() block the same way you can by hand. (The equivalent to what Java does.)

I can see this might not be ideal because it changes the semantics slightly, but it's not ideal having to write ugly static this blocks manually either.

Anonymous said...

> Another minor annoyance was the lack of an equivalent to Java's
> static import. If I want to refer to the tokens as just IF or
> ELSE I could make them "anonymous". But then they don't have a
> specific type. Or I can give the enum a named type, but then I
> have to always reference them with Token.IF . In Java I could
> say import static Token.* and then just use the bare names.<

This helps:

with(Token) {
// lot and lot of code here
}

>One of them being that you can't store immutable objects in a
>map! Hopefully this is something they'll figure out. I
>eventually gave up on immutable and managed to get const to
>work, although even that was a struggle.<

Maybe rebindable helps:


import std.typecons;
immutable class A {
this() immutable {}
}
void main() {
Rebindable!(immutable(A))[int] values;
values[0] = new A;
}
http://dlang.org/phobos/std_typecons.html#Rebindable
Hints by bearophile, a fellow D programmer..

Regarding your lexer implementation :
I think you'll enjoy this one :
https://github.com/dawgfoto/lexer
A lexer for the D programming language. The source should answere most of your questions regarding immutable. You'll further find some excellent examples on using D Ranges (extended C++ iterator).
HTH..
Bjoern
PS previous D blog .. of course the Windows API is available. COM programming is, compared to C++, much more comfortable.

Andrew McKinlay said...

Thanks again!