[Date Index] [Thread Index] [Author Index]
Re: Re: Unicode Support
No, they cannot be coded in 2, 3, or 4 bytes. They can be coded in 2 or 4 bytes. The 4 byte encodings use "surrogate pairs" of 2-byte values within ranges which are specifically designated for such use, and have no meaning individually. Please read the Unicode standard on the subject (search unicode.org for "surrogate pair")...it's very clearly explained there. Sincerely, John Fultz jfultz at wolfram.com User Interface Group Wolfram Research, Inc. On Wed, 30 Mar 2005 03:22:09 -0500 (EST), dh wrote: > Hello John, > > You write: > "Saying that Mathematica uses 16-bit Unicode characters is equivalent to > saying that Mathematica uses UTF-16. > and Mathematica Help: > "MathLink strings and symbols can contain characters with codes ranging > from 0 to 65535?that is, characters that can be represented by unsigned > 16-bit integers." > > Now, the Unicode standard defines 1'114'111 (hex 10ffff) characters. > This is more than what is mentioned in the Help (65535). These 1'114'111 > characters, coded in UTF16, need 2 3 or 4 Bytes. > > Has Wolfram truncated the available characters to thouse that are > represented by 2 Bytes in UTF16? > > Please clarify. > > Sincerely, Daniel > > > John Fultz wrote: >> On Sat, 26 Mar 2005 02:39:43 -0500 (EST), Zhu Chongkai wrote: >> >>> Hi all, >>> >>> The Mathematica Book says that Mathematica support Unicode >>> Characters. >>> And the MathLink tells that a Unicode character in Mathematica is a >>> 16-bit. But the latest Unicode Standard uses 32-bit to encode a >>> character. It seems to me that Mathematica's Unicode support is >>> outdated, based on an old version of Unicode Standard, which only >>> contains lass than 65536 characters. Will next version of Mathematica >>> use 32-bit encoding? Or am I wrong? >>> >>> Cheers, >>> Zhu Chongkai >>> http://www.neilvandyke.org/mrmathematica/ >>> >> >> Saying that Mathematica uses 16-bit Unicode characters is equivalent to >> saying that Mathematica uses UTF-16. UTF-16 can represent any Unicode >> character, and has been able to do so since at least Unicode 2.0 (and >> quite >> possibly earlier). It does so by using a reserved block of 16-bit >> values >> to represent non-plane 0 Unicode characters as a pair of values (known >> as a >> surrogate pair...see section 5.4 of the Unicode standard for more >> info). >> So, there is no need to change from a 16-bit encoding in order to >> support >> characters outside of the plane 0 range. >> >> MathLink supports this now. It's still just a stream of 16-bit >> characters. >> Mathematica can also represent the characters as surrogate pairs, but >> doesn't yet treat them as unitary characters for the purpose of string >> manipulation and text drawing operations. That's something we'll add >> to a >> future release. >> >> Sincerely, >> >> John Fultz >> jfultz at wolfram.com >> User Interface Group >> Wolfram Research, Inc.