Pre Delphi 2008-9 Unicode Do's and Dont's

This is a discussion on Pre Delphi 2008-9 Unicode Do's and Dont's within the Delphi forums in Programming Languages category; Remy, Thorsten, et. al, > "Char" will now be an alias for WideChar, ... > Thus SizeOf(Char) will now be 2. Thanks for that info. >> How can you encode 100,000 characters in only 2-bytes >> when 2^(2*8) = 65536? > The new UnicodeString type will use UTF-16 (just like WideString does) in > order to match how Windows implements Unicode. > In UTF-16, Unicode code points (logical characters) less than $10000 can be > encoded using their original value as-is in a single WideChar. Unicode > codepoints above $10000, inclusive, have to be encoded as two WideChars > working ...

Go Back   Application Development Forum > Programming Languages > Delphi

Object Mix

Register FAQ Calendar Search Today's Posts Mark Forums Read
  #21  
Old 07-21-2008, 05:39 PM
John Herbster
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's

Remy, Thorsten, et. al,

> "Char" will now be an alias for WideChar, ...
> Thus SizeOf(Char) will now be 2.


Thanks for that info.

>> How can you encode 100,000 characters in only 2-bytes
>> when 2^(2*8) = 65536?


> The new UnicodeString type will use UTF-16 (just like WideString does) in
> order to match how Windows implements Unicode.


> In UTF-16, Unicode code points (logical characters) less than $10000 can be
> encoded using their original value as-is in a single WideChar. Unicode
> codepoints above $10000, inclusive, have to be encoded as two WideChars
> working together (known as a "surrogate pair"). The use of surrogate pairs
> allows UTF-16 to support up to 2,097,152 Unicode codepoints. Anything more
> than that requires UTF-32 instead. Which Tiburon will also support, via a
> separate UCS4String (and UCS4Char) data type, which are 32-bit.


Then for "surrogate pairs" which require two WideChars for their
representation, it seems to be that "exactly as before" character
indexing will require sometimes stepping over two WideChars instead
of one.

Are the individual WideChars stored big or little endian?
If little endian in Intel RAM, how are they stored in disk "text"
files and communicated over wires?

What about the surrogate pairs? Is the low or high part of the pair
at the lower address? And ditto for disk files and communications?

> UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000-D7FF and E000-FFFF, lead units the range D800-DBFF and trail units the range DC00-DFFF. The lead and trail units, called in Unicode terminology high surrogates and low surrogates respectively, map 1024×1024 or 1,048,576 numbers, making for a maximum of possible 1,114,112 codepoints in Unicode.

Retrieved from "http://en.wikipedia.org/wiki/Variable-width_encoding"

Does that mean that UTF-16 characters are limited to 4-bytes?

TIA for the education, JohnH
Reply With Quote
  #22  
Old 07-21-2008, 05:55 PM
Thorsten Engler [NexusDB]
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's

John Herbster wrote:

> Then for "surrogate pairs" which require two WideChars for their
> representation, it seems to be that "exactly as before" character
> indexing will require sometimes stepping over two WideChars instead
> of one.


UTF16 has the huge advantage that the values for singeltons and leading
and trailing surrogate pairs do not overlap:

"In UTF-16, singletons have the range 0000-D7FF and E000-FFFF, lead
units the range D800-DBFF and trail units the range DC00-DFFF".

As a result of this, for code like "split this string into individual
strings at each \" and a lot of other string processing that's
happening on a per character basis, you don't have to worry about the
surrogate pairs because the the trailing unit can never be mistaken for
some other valid character.

> Are the individual WideChars stored big or little endian?

In memory, usually whatever your current hardware platform perfers.

> If little endian in Intel RAM, how are they stored in disk "text"
> files and communicated over wires?


That's what a BOM is for: http://en.wikipedia.org/wiki/Byte_Order_Mark

All UTF16 strings that go "over the wire" or onto disk should be
prefixed by a BOM. Either U+FFFE or U+FEFF, depending on the byte order
of the following data.

> What about the surrogate pairs? Is the low or high part of the pair
> at the lower address? And ditto for disk files and communications?

The order of the surrogate pairs always remains the same, the leading
one comes before the trailing one.

> Does that mean that UTF-16 characters are limited to 4-bytes?

That's why they are called "surrogate pairs" and not "surrogate
sequences" or something like that. You either have a singelton or a
pair of a leading and trailing surrogate.


--

Reply With Quote
  #23  
Old 07-21-2008, 06:01 PM
Pieter Zijlstra
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's

John Herbster wrote:

> Then for "surrogate pairs" which require two WideChars for their
> representation, it seems to be that "exactly as before" character
> indexing will require sometimes stepping over two WideChars instead
> of one.


It is the same as before where multiple bytes where needed to display
one character in for instance Asian windows versions. Most of the time
you don't care you just read/write a number of bytes (with Unicode,
words) and leave it to the Windows API how this is displayed.

--
Pieter
Reply With Quote
  #24  
Old 07-21-2008, 06:43 PM
John Herbster
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's

(Correction)

"Thorsten Engler [NexusDB]" <thorsten.engler@nexusdb.com> wrote

> UTF16 has the huge advantage that the values for singletons and leading
> and trailing surrogate pairs do not overlap:


I see!

>> Are the individual WideChars stored big or little endian?


> ... usually whatever your current hardware platform prefers.


Am I correct that "U+" is just a prefix indicating for Unicode
representation in hexadecimal. Is a surrogate pair written
U+D801, U+DC01?

>> If little endian in Intel RAM, how are they stored in disk
>> "text" files and communicated over wires?


> That's what a BOM is for: http://en.wikipedia.org/wiki/Byte_Order_Mark


> All UTF16 strings that go "over the wire" or onto disk should be
> prefixed by a BOM. Either U+FFFE or U+FEFF, depending on the
> byte order of the following data.


I do not understand this. If I have MyAnsiString = 'AB' and assign
it to MyWideString in RAM on a PS with an Intel CPU, then I presume
that I have in increasing memory addresses $41, $00, $42, and $00,
or if you please U+0041 and U+0042.

Now if I sent this to a file, is this byte sequence valid?
Big-endian: $FE, $FF, $00, $41, $00, $42
And this one valid, too?
Little-endian: $FF, $FE, $41, $00, $42, $00
And if so, wouldn't the U+ representation in either case be
U+FEFF, U+0041, U+0042.

TIA, JohnH
Reply With Quote
  #25  
Old 07-21-2008, 06:55 PM
Thorsten Engler [NexusDB]
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's

John Herbster wrote:

> Am I correct that "U+" is just a prefix indicating for Unicode
> representation in hexadecimal. Is a surrogate pair written
> U+D801, U+DC01?

Yes.

> I do not understand this. If I have MyAnsiString = 'AB' and assign
> it to MyWideString in RAM on a PS with an Intel CPU, then I presume
> that I have in increasing memory addresses $41, $00, $42, and $00,
> or if you please U+0041 and U+0042.

Yes.

> Now if I sent this to a file, is this byte sequence valid?
> Big-endian: $FE, $FF, $00, $41, $00, $42

Yes.

> And this one valid, too?
> Little-endian: $FF, $FE, $41, $00, $42, $00

Yes.

> And if so, wouldn't the U+ representation in either case be
> U+FEFF, U+0041, U+0042.

Yes, I was mistaken. It is always U+FEFF, which can be FF FE or FE FF
depending on the endianess, I should't have used U+FFFE except to say
that "The Unicode value U+FFFE is guaranteed never to be assigned as a
Unicode character; this implies that in a Unicode context the 0xFF,
0xFE byte pattern can only be interpreted as the U+FEFF character
expressed in little-endian byte order (since it could not be a U+FFFE
character expressed in big-endian byte order)."

--

Reply With Quote
  #26  
Old 07-21-2008, 07:10 PM
Ivan
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's

>
> UTF16 has the huge advantage that the values for singeltons and leading
> and trailing surrogate pairs do not overlap:


Now the advantage over utf8 is finally becoming clear. Thanks so much Thorsten, very helpful as usual.
Reply With Quote
  #27  
Old 07-21-2008, 07:35 PM
Remy Lebeau \(TeamB\)
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's


"John Herbster" <herb-sci1_AT_sbcglobal.net> wrote in message
news:48850208$1@newsgroups.borland.com...

> Then for "surrogate pairs" which require two WideChars for
> their representation, it seems to be that "exactly as before"
> character indexing will require sometimes stepping over
> two WideChars instead of one.


Potentially. But that requirement has existed since WideString was
introduced. It does not change now that UnicodeString is being added. If
you don't need to act on individual codepoints in your code, then you don't
have to worry about treating surrogates separately. Otherwise, you
generally would have to convert from UTF-16 to UTF-32 before you could work
with codepoints correctly anyway.

> Are the individual WideChars stored big or little endian?


WideString and UnicodeString use Big Endian, as that is the default endian
for Intel platforms.

> If little endian in Intel RAM, how are they stored in disk "text"
> files and communicated over wires?


It is the coder's responsibility to handle endian issues in those cases.
That is nothing new.

> What about the surrogate pairs? Is the low or high part of the pair
> at the lower address?


The High surrogate always appears in front of the Low surrogate in the
string, but each individual surrogate in the pair is affected by the endian
used for the entire string. This is clearly described in RFC 2781.

> And ditto for disk files and communications?


That is also the coder's responsibility to handle.

> Does that mean that UTF-16 characters are limited to 4-bytes?


Unicode itself is limited to 4 bytes per codepoint (encoded using UTF-32
and/or UCS4). There is no codepoint defined above $7FFFFFFF yet.

However, UTF-16 is limited to 3-byte codepoints, since the highest codepoint
it can handle is $10FFFF.


Gambit


Reply With Quote
  #28  
Old 07-21-2008, 07:44 PM
Remy Lebeau \(TeamB\)
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's


"John Herbster" <herb-sci1_AT_sbcglobal.net> wrote in message
news:488510ec$1@newsgroups.borland.com...
(Correction)

> Am I correct that "U+" is just a prefix indicating for
> Unicode representation in hexadecimal.


Yes.

> If I have MyAnsiString = 'AB' and assign it to MyWideString
> in RAM on a PS with an Intel CPU, then I presume that I have
> in increasing memory addresses $41, $00, $42, and $00


Yes. That would be UTF-16 in Little Endian.

> Now if I sent this to a file, is this byte sequence valid?

Big-endian: $FE, $FF, $00, $41, $00, $42

Yes.

> And this one valid, too?
> Little-endian: $FF, $FE, $41, $00, $42, $00


Yes.

> And if so, wouldn't the U+ representation in either case
> be
> U+FEFF, U+0041, U+0042.


Yes, it would.


Gambit


Reply With Quote
  #29  
Old 07-21-2008, 07:56 PM
Thorsten Engler [NexusDB]
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's

> WideString and UnicodeString use Big Endian, as that is the default
> endian for Intel platforms.

Eh. Little-endian is default on x86. Lowest byte first.
Which is why U+0041 will be $41 $00 in memory. But it doesn't really
matter much either way because in most cases you are not going to
access unicode strings byte by byte.


--

Reply With Quote
  #30  
Old 07-22-2008, 12:13 AM
Chad Z. Hower aka Kudzu
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's

http://www.kudzuworld.com/blogs/tech/20080722A.aspx

--
Keep up to date - read the IntraWeb blog!
http://www.atozed.com/intraweb/blog/


"Lee Jenkins" <lee@nospam.net> wrote in message
news:4884bbe9$1@newsgroups.borland.com...
>
> Has anyone posted information concerning do's and dont's for Unicode
> support in upcoming Delphi versions?
>
> It recent threads concerning Delphi/Unicode, I think the topic of being
> prepared for Unicode has not been addressed so much, at least as far as I
> can see.
>
> On one side, we have applications that have already been written whose
> authors are rightfully concerned about compatibility.
>
> On the other side, we have applications which are yet to be written and do
> not have much threat of being
>
> In the middle, we have applications which are currently being written
> (raises hand) which could benefit from some suggestions on best practices
> to give the applications currently being written to have a chance of being
> ported more easily when D2008/9 is finally released.
>
> --
> Warm Regards,
>
> Lee



Reply With Quote
Reply


Thread Tools
Display Modes


All times are GMT -5. The time now is 09:33 AM.


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
vB Ad Management by =RedTyger=

In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.