Pre Delphi 2008-9 Unicode Do's and Dont's

This is a discussion on Pre Delphi 2008-9 Unicode Do's and Dont's within the Delphi forums in Programming Languages category; >> What is the SizeOf() for a Unicode char variable? "Nick Hodges (Embarcadero)" <nick.hodges @ codegear.com> wrote > SizeOf(Char) is now 2. Not up to 4 bytes? How can you encode 100,000 characters in only 2-bytes when 2^(2*8) = 65536? See http://en.wikipedia.org/wiki/Unicode If you mean UTF-8, why not call it UTF-8? --JohnH...

Go Back   Application Development Forum > Programming Languages > Delphi

Object Mix

Register FAQ Calendar Search Today's Posts Mark Forums Read
  #11  
Old 07-21-2008, 03:01 PM
John Herbster
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's

>> What is the SizeOf() for a Unicode char variable?

"Nick Hodges (Embarcadero)" <nick.hodges@codegear.com> wrote
> SizeOf(Char) is now 2.


Not up to 4 bytes?
How can you encode 100,000 characters in only 2-bytes when 2^(2*8) = 65536?
See
http://en.wikipedia.org/wiki/Unicode

If you mean UTF-8, why not call it UTF-8?

--JohnH
Reply With Quote
  #12  
Old 07-21-2008, 03:16 PM
Thorsten Engler [NexusDB]
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's

John Herbster wrote:

> May I presume like this?
> p := @MyString[1];
> Inc(p);
> where MyStr: string; and p: PChar;
>
> And how expensive are these operations during CPU execution?


The Inc(p) used to add 1, now it adds 2 to the pointer. What difference
in performance compared to AnsiString/PAnsiChar do you expect?

--

Reply With Quote
  #13  
Old 07-21-2008, 03:16 PM
Tim Young [Elevate Software]
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's

John,

<< If you mean UTF-8, why not call it UTF-8? >>

It's UTF-16 (Word-sized characters), the same as with Windows 2000 and
later. It covers most of the character sets out there, but requires
surrogate pairs for more extensive character sets.

--
Tim Young
Elevate Software
www.elevatesoft.com


Reply With Quote
  #14  
Old 07-21-2008, 03:20 PM
John Herbster
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's


"Tim Young [Elevate Software]" <timyoung@elevatesoft.com> wrote

> It's UTF-16 (Word-sized characters), the same as with Windows 2000 and


Tim,

Let's try to pin some definitions down.

According to http://en.wikipedia.org/wiki/UTF-16

"UTF-16 (16-bit Unicode Transformation Format) is a variable-length
character encoding for Unicode"

If Windows and the new Delphi really do use UTF-16, how do they
handle the variable-length character encodings?

Rgds, JohnH
Reply With Quote
  #15  
Old 07-21-2008, 03:33 PM
Thorsten Engler [NexusDB]
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's

John Herbster wrote:

> "UTF-16 (16-bit Unicode Transformation Format) is a variable-length
> character encoding for Unicode"
>
> If Windows and the new Delphi really do use UTF-16, how do they
> handle the variable-length character encodings?


In pretty much the same way that windows and delphi handle MBCS ANSI
codepages currently.

See http://en.wikipedia.org/wiki/Multi-byte_character_set

"UTF-16 was devised to break free of the 65,536-character limit of the
original Unicode (1.x) without breaking compatibility with the 16-bit
encoding. In UTF-16, singletons have the range 0000-D7FF and E000-FFFF,
lead units the range D800-DBFF and trail units the range DC00-DFFF. The
lead and trail units, called in Unicode terminology high surrogates and
low surrogates respectively, map 1024×1024 or 1,048,576 numbers, making
for a maximum of possible 1,114,112 codepoints in Unicode."

--

Reply With Quote
  #16  
Old 07-21-2008, 04:08 PM
Ian Boyd
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's

>> Show us how to iterate through a string of characters with pointers.
> Exactly as before -- but don't assume a character is of size 1.


p: Pointer;
p := @MyString[1];
Inc(p);

?



Reply With Quote
  #17  
Old 07-21-2008, 04:34 PM
Nick Hodges (Embarcadero)
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's

Ian Boyd wrote:

> p: Pointer;
> p := @MyString[1];
> Inc(p);


That will behave differently, since it (appears) to be assuming the
SizeOf(Char) = SizeOf(Pointer), which is no longer true.

--
Nick Hodges
Delphi Product Manager - Embarcadero
http://blogs.codegear.com/nickhodges
Reply With Quote
  #18  
Old 07-21-2008, 04:35 PM
Thorsten Engler [NexusDB]
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's

Ian Boyd wrote:

> p: Pointer;
> p := @MyString[1];
> Inc(p);


You can't do pointer math with untyped pointers. Never worked before.
Not going to start working suddenly. "Pointer" does not know how big
whatever it points to is.

--

Reply With Quote
  #19  
Old 07-21-2008, 04:58 PM
Remy Lebeau \(TeamB\)
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's


"John Herbster" <herb-sci1_AT_sbcglobal.net> wrote in message
news:4884dcfb$1@newsgroups.borland.com...

> Not up to 4 bytes?


No. "Char" will now be an alias for WideChar, wheras it was an alias for
AnsiChar in previous versions. Thus SizeOf(Char) will be 2 now.

> How can you encode 100,000 characters in only 2-bytes when
> 2^(2*8) = 65536?


The new UnicodeString type will use UTF-16 (just like WideString does) in
order to match how Windows implements Unicode.

In UTF-16, Unicode code points (logical characters) less than $10000 can be
encoded using their original value as-is in a single WideChar. Unicode
codepoints above $10000, inclusive, have to be encoded as two WideChars
working together (known as a "surrogate pair"). The use of surrogate pairs
allows UTF-16 to support up to 2,097,152 Unicode codepoints. Anything more
than that requires UTF-32 instead. Which Tiburon will also support, via a
separate UCS4String (and UCS4Char) data type, which are 32-bit.


Gambit


Reply With Quote
  #20  
Old 07-21-2008, 05:23 PM
Remy Lebeau \(TeamB\)
Guest
 
Default Re: Pre Delphi 2008-9 Unicode Do's and Dont's


"Remy Lebeau (TeamB)" <no.spam@no.spam.com> wrote in message
news:4884f8ca$1@newsgroups.borland.com...

> The use of surrogate pairs allows UTF-16 to support up to
> 2,097,152 Unicode codepoints.


Correction: UTF-16 supports 1,112,064 Unicode codepoints ($00000000 -
$0010FFFF, minus $0000D800 - $0000DFFF which are reserved).


Gambit


Reply With Quote
Reply


Thread Tools
Display Modes


All times are GMT -5. The time now is 06:38 AM.


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
vB Ad Management by =RedTyger=

In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.