Mathematica 9 is now available
Services & Resources / Wolfram Forums / MathGroup Archive
-----

MathGroup Archive 2011

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Import, ReadList, and Unicode

  • To: mathgroup at smc.vnet.net
  • Subject: [mg115170] Re: Import, ReadList, and Unicode
  • From: "Hans Michel" <hmichel at cox.net>
  • Date: Mon, 3 Jan 2011 03:57:14 -0500 (EST)

EO:

Without more information it becomes difficult examine your problem. 

To set the CharacterEncoding and SystemCharacterEncoding for UTF-8 the
string value is "UTF8" no dash.
I will assume this is what you did and writing about it now is just an
oversight.

Nevertheless, what other items that may be the source of your phenomena?

Without knowing what OS you are using I would say that you may have a Byte
Order Mark (BOM) issue.

http://msdn.microsoft.com/en-us/library/dd374101(VS.85).aspx

Some applications add a BOM to the beginning of a file or stream. It is up
to the consuming application to know how to handle the BOM. So including a
BOM is not technically wrong.

Can you provide more information on the file for example how it was saved,
what structure are your expecting, what are the RecordSeperators (default)?

Hans

-----Original Message-----
From: eros olmi [mailto:erosolmiz at hotmail.com] 
Sent: Sunday, January 02, 2011 5:23 AM
To: mathgroup at smc.vnet.net
Subject: [mg115170] [mg115151] Import, ReadList, and Unicode

In Mathematica v8 i am using this convoluted way to read the contents of a
unicode file saved in utf-8 format
txt = Import["file.txt",CharacterEncoding -> "UTF-8"]
w = ReadList[StringToStream[txt], Record, RecordLists -> True]
the output like this:
{{unicode chars},{unicode chars},{unicode chars}}
the letters displayed correctly even if i don't use CharacterEncoding ->
"UTF-8"
but using
ReadList["file.txt", Record]
will return the file as a  garbage characters , and setting
$SystemCharacterEncoding = "UTF-8"
$CharacterEncoding = $SystemCharacterEncoding
does not cure the problem since ReadList can't accept CharacterEncoding ->
"UTF-8" in its syntax unlike Import.
are there some cure to this phenomena.
thanks
eros



  • Prev by Date: Re: pattern bugs and comment on intuitive syntax for the
  • Next by Date: Re: pattern bugs and comment on intuitive syntax for the New Year
  • Previous by thread: Import, ReadList, and Unicode
  • Next by thread: Re: Import, ReadList, and Unicode