[Grace-core] Unicode

Michael Homer mwh at ecs.vuw.ac.nz
Thu May 5 23:48:24 PDT 2011


Hi,
As I mentioned in my previous post, I've been working on
implementation of a Grace-like language, and will be working on an
implementation of Grace itself. In that process I've come up with some
questions about the specification from an implementor's point of view.
I was going to put them all into one post, but just this part is long
enough already.

The first question stems from the statement right near the start that
Grace programs are "written in Unicode". Which encoding? James has
suggested UTF-8, which is sensible, but is it mandated, or will UTF-16
or UTF-32 (or even UTF-7, or others) be possible as well?

Onwards from there, what are the semantics of strings? Are they
bytestrings in the surrounding text's encoding, or abstract Unicode
(i.e., char[] vs int[])? The simple answer is that you shouldn't be
able to tell, but the meaning of equality isn't clear. Are two strings
equal when they're bit-for-bit the same, so the input encoding is
important, or codepoint-for-codepoint the same? In the latter case
encoding doesn't matter, but normalisation may: does "ō" (U+014D LATIN
SMALL LETTER O WITH MACRON) equal "ō" (U+006F LATIN SMALL LETTER O +
U+304 COMBINING MACRON)?

Consistency with the definition of equality via the EGAL predicate
suggests that it shouldn't, but that means that two strings with no
distinguishable difference either visibly or semantically are unequal.
If they are equal, then what are their lengths?

Unicode provides two canonical normalisation forms, fully composed
(NFC) and fully decomposed (NFD), both of which put combining
characters into a fixed order, so text comparisons can be made
byte-for-byte between two normalised strings. Are all strings
normalised into one or other of these forms automatically? If not,
then what? If so, which, and is it possible to express another form?

That's well for strings, but what about method names? Can I have two
methods apparently named "tīmata", where which one I call depends on
which form my editor put the macron in?
-Michael


More information about the Grace-core mailing list