[Grace-core] Unicode

Fri May 6 14:52:36 PDT 2011

On 6 May 2011, at 14:08 , Michael Homer wrote:

> Did you mean to send this to the list? 

Yes, I should have done so.  I've taken the opportunity to fix some typos and add a sentence.

> 

On Fri, May 6, 2011 at 8:51 PM, Andrew P. Black <black at cs.pdx.edu> wrote:
> Michael,
> 
> It's good that you have done so much!  And your questions are provoking.  Here are my answers; others may answer differently!
> 
> 
> On 5 May 2011, at 23:48 , Michael Homer wrote:
> 
>> Hi,
>> As I mentioned in my previous post, I've been working on
>> implementation of a Grace-like language, and will be working on an
>> implementation of Grace itself. In that process I've come up with some
>> questions about the specification from an implementor's point of view.
>> I was going to put them all into one post, but just this part is long
>> enough already.
>> 
>> The first question stems from the statement right near the start that
>> Grace programs are "written in Unicode". Which encoding? James has
>> suggested UTF-8, which is sensible, but is it mandated, or will UTF-16
>> or UTF-32 (or even UTF-7, or others) be possible as well?
> 
> That's an implementation question, not a language question, but it's an important one in practice.  Whatever IDE we have, there will need to be an agreed interchange format whereby users can exchange Grace programs in files.
> 
> It seems to me that UTF-8 is the right interchange format, because I think that most of the characters in a Grace program will be in ISO-8859 — but then, I speak English!
> 
> Internally, the representation is your own choice, but KDE, .NET, Java, and MacOS all use 16-bit encodings.  My implementation, gestating in Pharo, uses Pharo strings, which uses a variable-width representation.
> 
> A related question is how to represent line-endings in the interchange format.  The obvious choice is to accept them all, but to output only U+2028  (line separator).
> 
> 
>> 
>> Onwards from there, what are the semantics of strings?
> 
> The /semantics/ of a string is a sequence of Unicode code points.
> 
>> Are they
>> bytestrings in the surrounding text's encoding, or abstract Unicode
>> (i.e., char[] vs int[])?
> 
> I think that you are asking how you should implement them.  It's up to you.  I think that the answer is as Cord trees, and you can grab my implementation of those, in Fortress, from the fortress source tree.  Cords give us constant-time concatenation, which is important since Strings are immutable.  The leaves of my implementation re-used Java strings, which was reasonable since Fortress runs on the JVM.  If you are running on Parrot, I think that you should probably use Parrot strings at the leaves.
> 
> 
>> The simple answer is that you shouldn't be
>> able to tell, but the meaning of equality isn't clear. Are two strings
>> equal when they're bit-for-bit the same, so the input encoding is
>> important, or codepoint-for-codepoint the same?
> 
> The latter, of course.   Two trees representing the same sequence of Unicode code points will not in general be the same, and in any case we need to allow tree-rebalancing.
> 
> BTW, James was at one point arguing that the user should not be able to re-define =.  Strings are a good example of why object-specific definition of = is really necessary.
> 
>> In the latter case
>> encoding doesn't matter, but normalisation may: does "ō" (U+014D LATIN
>> SMALL LETTER O WITH MACRON) equal "ō" (U+006F LATIN SMALL LETTER O +
>> U+304 COMBINING MACRON)?
> 
> No, because they are different codepoints.    Of course, we will probably need a normalization operation on Strings.  I'm not a unicode expert, though, and this may be a place to get advice from someone who is.  Still the situation seems to me be the same as one we have tolerated for a while: multiple representation of line end.  \n and \r are not equal, but there is probably a normalization that converts them both to U+2028.
> 
>> 
>> Consistency with the definition of equality via the EGAL predicate
>> suggests that it shouldn't, but that means that two strings with no
>> distinguishable difference either visibly or semantically are unequal.
>> If they are equal, then what are their lengths?
> 
> As I said, they are not equal;  they have different lengths.
> 
>> 
>> Unicode provides two canonical normalisation forms, fully composed
>> (NFC) and fully decomposed (NFD), both of which put combining
>> characters into a fixed order, so text comparisons can be made
>> byte-for-byte between two normalised strings. Are all strings
>> normalised into one or other of these forms automatically? If not,
>> then what? If so, which, and is it possible to express another form?
> 
> I think that normalization, like any other conversion, should be explicit.  However, I may be wrong.  Note, however, that while this may become an important point in the design of the String library as the language evolves, it's not a big issue in the language design itself.
> 
> The language design does have to say something about how to represent Unicode escapes in string literals.  My syntax doe snot at present do so.  The "obvious" escape to me is \U+<four to six hex digits>; because that is what the standard says, modulo the leading \ and the trailing ;
> 
>> 
>> That's well for strings, but what about method names? Can I have two
>> methods apparently named "tīmata", where which one I call depends on
>> which form my editor put the macron in?
> 
> Yes you could.  You could also have two methods, one called ll and one called l1 and whether you can tell the difference depends on what font you choose to render your text in.  We can't legislate against stupidity; stupid programmers are too clever!

Maybe the real question hiding here is: when comparing method names for equality in the dispatch mechanism, should the names be normalized first?  The answer to this is, perhaps, yes, whereas for string literals it seems clearly no: the programmer gets what the programmer puts.

> 
> A more interesting issue, to me, is whether we mandate an input method.  This is more important because, while it's really easy to convert a file from utf-16 to utf-8, it's really hard to retrain my fingers.   Presumably we will accept whatever the platform's conventions for Unicode input are, but we should probably have some of our own for characters that reasonable people might want to type in grace programs, for example TeX-like escapes for greek letters and math symbols.
> 
> Maybe input methods are what you are asking about in your last question?  It might be reasonable for us to advise IDE implementors to create programs in the fully-composed normalization everywhere EXCEPT in string literals.   However, I have to admit having no experience with this at all.  We really need some advice from people who work daily in accented character sets.
> 
>        Andrew

>>