[Grace-core] Unicode

Fri May 13 16:18:18 PDT 2011

On Sat, May 7, 2011 at 9:52 AM, Andrew P. Black <black at cs.pdx.edu> wrote:
>> On 5 May 2011, at 23:48 , Michael Homer wrote:
>>
>>> Hi,
>>> As I mentioned in my previous post, I've been working on
>>> implementation of a Grace-like language, and will be working on an
>>> implementation of Grace itself. In that process I've come up with some
>>> questions about the specification from an implementor's point of view.
>>> I was going to put them all into one post, but just this part is long
>>> enough already.
>>>
>>> The first question stems from the statement right near the start that
>>> Grace programs are "written in Unicode". Which encoding? James has
>>> suggested UTF-8, which is sensible, but is it mandated, or will UTF-16
>>> or UTF-32 (or even UTF-7, or others) be possible as well?
>>
>> That's an implementation question, not a language question, but it's an important one in
>> practice.  Whatever IDE we have, there will need to be an agreed interchange format
>> whereby users can exchange Grace programs in files.
I think I was unclear about what I was asking in my original message.
The questions weren't, I don't think, ones of implementation but ones
of specification. "That's implementation-defined" is a perfectly
reasonable answer to many of them, though.

The thrust of this and some of the other questions was "So just how
much Unicode processing do I have to do, and when?". If the answer to
the source encoding question is that it's implementation-defined then
the answer to that is virtually none here, but "written in Unicode" is
pretty broad. Realistically, if UTF-8 compliance is mandatory then
that's all I'm going to support, but if something else is required
then I will have to process that as well.

>>> Onwards from there, what are the semantics of strings?
>>
>> The /semantics/ of a string is a sequence of Unicode code points.
>>> Are they
>>> bytestrings in the surrounding text's encoding, or abstract Unicode
>>> (i.e., char[] vs int[])?
>>
>> I think that you are asking how you should implement them.  It's up to you.  I think that the
>> answer is as Cord trees, and you can grab my implementation of those, in Fortress, from the
>> fortress source tree.  Cords give us constant-time concatenation, which is important since
>> Strings are immutable.  The leaves of my implementation re-used Java strings, which was
>> reasonable since Fortress runs on the JVM.  If you are running on Parrot, I think that you
>> should probably use Parrot strings at the leaves.
I'm not concerned with the concrete implementation, but conceptually
whether they're sequences of codepoints (as in Python 3, and to a
large extent Java) or bytestrings with an attached encoding (as in
Ruby). I prefer codepoints (or a fixed internal encoding, which works
out the same), but there is an argument to be made the other way as
well.

So here: "Can you get two strings in different encodings (from
different modules, or user input), and if so, can you tell from inside
the program?".

If strings are meant to be able to hold binary data as well then
codepoints don't cut it, but if they're purely textual then either
approach can work.
>>
>>> The simple answer is that you shouldn't be
>>> able to tell, but the meaning of equality isn't clear. Are two strings
>>> equal when they're bit-for-bit the same, so the input encoding is
>>> important, or codepoint-for-codepoint the same?
>>
>> The latter, of course.   Two trees representing the same sequence of Unicode code points
>> will not in general be the same, and in any case we need to allow tree-rebalancing.

Even in the case of a tree implementation there is still a linear
sequence that the tree represents and which could be compared
bit-for-bit in the encoding of the string. The concrete implementation
under the hood doesn't need to make a difference so long as the
operations are defined the right way, which is what I'm trying to sort
out. Certainly my internal implementation is likely to change, but I
don't expect that to be visible to the programs. I don't know what to
do about multiple implementations coexisting.
>>> That's well for strings, but what about method names? Can I have two
>>> methods apparently named "tīmata", where which one I call depends on
>>> which form my editor put the macron in?
>>
>> Yes you could.  You could also have two methods, one called ll and one called l1 and
>> whether you can tell the difference depends on what font you choose to render your text in.  We >> can't legislate against stupidity; stupid programmers are too clever!
I'm not sure that's right. The two methods may come from different
sources, one in a library or instructor-provided code, or the
copied-off-a-website example, and one from a student who can't know
what form the original was in without reading a hexdump. I don't think
you can chalk that up to programmer stupidity, unless it's stupidity
to be using non-ASCII identifiers in the first place. It's not the
same as l1 vs ll, which are readily distinguishable and not likely to
be produced in favour of the other when typing.

> Maybe the real question hiding here is: when comparing method names for equality in the
> dispatch mechanism, should the names be normalized first?  The answer to this is, perhaps,
> yes, whereas for string literals it seems clearly no: the programmer gets what the programmer
> puts.

That was my instinct, but the inconsistency there is what prompted the
question. Implicit normalisation for strings seems clearly the wrong
thing (but conceivably appropriate for a teaching language?), if only
because addressing the other form becomes impossible, but allowing
distinct-but-identical method names seems like asking for obfuscation
and error. Having different behaviour for the two, or different
canonicalisation for the source code depending on syntactic position,
seems confusing. It may still be right, though.
-Michael