| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#11
| |||
| |||
| >> What is the SizeOf() for a Unicode char variable? "Nick Hodges (Embarcadero)" <nick.hodges@codegear.com> wrote > SizeOf(Char) is now 2. Not up to 4 bytes? How can you encode 100,000 characters in only 2-bytes when 2^(2*8) = 65536? See http://en.wikipedia.org/wiki/Unicode If you mean UTF-8, why not call it UTF-8? --JohnH |
|
#12
| |||
| |||
| John Herbster wrote: > May I presume like this? > p := @MyString[1]; > Inc(p); > where MyStr: string; and p: PChar; > > And how expensive are these operations during CPU execution? The Inc(p) used to add 1, now it adds 2 to the pointer. What difference in performance compared to AnsiString/PAnsiChar do you expect? -- |
|
#13
| |||
| |||
| John, << If you mean UTF-8, why not call it UTF-8? >> It's UTF-16 (Word-sized characters), the same as with Windows 2000 and later. It covers most of the character sets out there, but requires surrogate pairs for more extensive character sets. -- Tim Young Elevate Software www.elevatesoft.com |
|
#14
| |||
| |||
| "Tim Young [Elevate Software]" <timyoung@elevatesoft.com> wrote > It's UTF-16 (Word-sized characters), the same as with Windows 2000 and Tim, Let's try to pin some definitions down. According to http://en.wikipedia.org/wiki/UTF-16 "UTF-16 (16-bit Unicode Transformation Format) is a variable-length character encoding for Unicode" If Windows and the new Delphi really do use UTF-16, how do they handle the variable-length character encodings? Rgds, JohnH |
|
#15
| |||
| |||
| John Herbster wrote: > "UTF-16 (16-bit Unicode Transformation Format) is a variable-length > character encoding for Unicode" > > If Windows and the new Delphi really do use UTF-16, how do they > handle the variable-length character encodings? In pretty much the same way that windows and delphi handle MBCS ANSI codepages currently. See http://en.wikipedia.org/wiki/Multi-byte_character_set "UTF-16 was devised to break free of the 65,536-character limit of the original Unicode (1.x) without breaking compatibility with the 16-bit encoding. In UTF-16, singletons have the range 0000-D7FF and E000-FFFF, lead units the range D800-DBFF and trail units the range DC00-DFFF. The lead and trail units, called in Unicode terminology high surrogates and low surrogates respectively, map 1024×1024 or 1,048,576 numbers, making for a maximum of possible 1,114,112 codepoints in Unicode." -- |
|
#16
| |||
| |||
| >> Show us how to iterate through a string of characters with pointers. > Exactly as before -- but don't assume a character is of size 1. p: Pointer; p := @MyString[1]; Inc(p); ? |
|
#17
| |||
| |||
| Ian Boyd wrote: > p: Pointer; > p := @MyString[1]; > Inc(p); That will behave differently, since it (appears) to be assuming the SizeOf(Char) = SizeOf(Pointer), which is no longer true. -- Nick Hodges Delphi Product Manager - Embarcadero http://blogs.codegear.com/nickhodges |
|
#18
| |||
| |||
| Ian Boyd wrote: > p: Pointer; > p := @MyString[1]; > Inc(p); You can't do pointer math with untyped pointers. Never worked before. Not going to start working suddenly. "Pointer" does not know how big whatever it points to is. -- |
|
#19
| |||
| |||
| "John Herbster" <herb-sci1_AT_sbcglobal.net> wrote in message news:4884dcfb$1@newsgroups.borland.com... > Not up to 4 bytes? No. "Char" will now be an alias for WideChar, wheras it was an alias for AnsiChar in previous versions. Thus SizeOf(Char) will be 2 now. > How can you encode 100,000 characters in only 2-bytes when > 2^(2*8) = 65536? The new UnicodeString type will use UTF-16 (just like WideString does) in order to match how Windows implements Unicode. In UTF-16, Unicode code points (logical characters) less than $10000 can be encoded using their original value as-is in a single WideChar. Unicode codepoints above $10000, inclusive, have to be encoded as two WideChars working together (known as a "surrogate pair"). The use of surrogate pairs allows UTF-16 to support up to 2,097,152 Unicode codepoints. Anything more than that requires UTF-32 instead. Which Tiburon will also support, via a separate UCS4String (and UCS4Char) data type, which are 32-bit. Gambit |
|
#20
| |||
| |||
| "Remy Lebeau (TeamB)" <no.spam@no.spam.com> wrote in message news:4884f8ca$1@newsgroups.borland.com... > The use of surrogate pairs allows UTF-16 to support up to > 2,097,152 Unicode codepoints. Correction: UTF-16 supports 1,112,064 Unicode codepoints ($00000000 - $0010FFFF, minus $0000D800 - $0000DFFF which are reserved). Gambit |
![]() |
| Thread Tools | |
| Display Modes | |
In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.