[Grace-core] Unicode

James Noble kjx at ecs.vuw.ac.nz
Sun May 15 19:11:46 PDT 2011


here's my opinions on a few of Michael's questions:

> The thrust of this and some of the other questions was "So just how
> much Unicode processing do I have to do, and when?".

I think some of the answers here are - unfortunately - we need to find
an unicode expert and ask them.  

> I'm not concerned with the concrete implementation, but conceptually
> whether they're sequences of codepoints (as in Python 3, and to a
> large extent Java) or bytestrings with an attached encoding (as in
> Ruby). I prefer codepoints (or a fixed internal encoding, which works
> out the same), but there is an argument to be made the other way as
> well.

I think codepoints - or "characters" - makes the most sense (are these the same?)

> So here: "Can you get two strings in different encodings (from
> different modules, or user input), and if so, can you tell from inside
> the program?".

I think the answer here should be: "not normally"
(if we want to do string processing in grace, we have to allow
this kind of situation to arise, right?)

> If strings are meant to be able to hold binary data as well then
> codepoints don't cut it, but if they're purely textual then either
> approach can work.

I think Grace needs some other data structure(s) to handle this.
In the age of unicode, a string is a character string;
a byte-sequence or byte-stream is something different.

> I'm not sure that's right. The two methods may come from different
> sources, one in a library or instructor-provided code, or the
> copied-off-a-website example, and one from a student who can't know
> what form the original was in without reading a hexdump. I don't think
> you can chalk that up to programmer stupidity, unless it's

> stupidity to be using non-ASCII identifiers in the first place.

let's not open that can of worms! 

> That was my instinct, but the inconsistency there is what prompted the
> question. Implicit normalisation for strings seems clearly the wrong
> thing (but conceivably appropriate for a teaching language?),

I'm sure I must be missing something here...

> if only
> because addressing the other form becomes impossible, but allowing
> distinct-but-identical method names seems like asking for obfuscation
> and error.

absolutely! 

> Having different behaviour for the two, or different
> canonicalisation for the source code depending on syntactic position,
> seems confusing. It may still be right, though.

I can't see why we'd want different behaviour for strings in different places.
Wouldn't strings work as strings - 
strings are already odd because "characters" in grace are single "element" strings -
and for issues of different encodings, etc, we use another type?

James


More information about the Grace-core mailing list