> Programming Languages > Python
Various Topics Home | Disclaimer | Report Adult Posts

Various Topics on Python



Python - "Trouble saving unicode text to file" in Programming Languages


Old 05-07-2005   #1
..e..gle..
 
Default Trouble saving unicode text to file

I'm working on a program that is supposed to save
different information to text files.

Because the program is in swedish i have to use
unicode text for ÅÄÖ letters.

When I run the following testscript I get an error message.

# -*- coding: cp1252 -*-

titel = "åäö"
titel = unicode(titel)

print "Titel type", type(titel)

fil = open("testfil.txt", "w")
fil.write(titel)
fil.close()


Traceback (most recent call last):
File "D:\Do***ents and
Settings\Daniel\Desktop\Programmering\aaotest\aaot est2\aaotest2.pyw",
line 5, in ?
titel = unicode(titel)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0:
ordinal not in range(128)


I need to have the titel variable in unicode format because when I
write
åäö in a entry box in Tkinkter it makes the value to a unicode
format
automaticly.

Are there anyone who knows an easy way to save this unicode format text
to a file?

 
Old 05-08-2005   #2
.... ..ntana..
 
Default Re: Trouble saving unicode text to file


Svennglenn> Traceback (most recent call last):
Svennglenn> File "D:\Do***ents and
Svennglenn> Settings\Daniel\Desktop\Programmering\aaotest\aaot est2\aaotest2.pyw",
Svennglenn> line 5, in ?
Svennglenn> titel = unicode(titel)
Svennglenn> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0:
Svennglenn> ordinal not in range(128)

Try:

import codecs

titel = "åäö"
titel = unicode(titel, "iso-8859-1")
fil = codecs.open("testfil.txt", "w", "iso-8859-1")
fil.write(titel)
fil.close()

Skip
 
Old 05-08-2005   #3
.... ..ch..
 
Default Re: Trouble saving unicode text to file

On 7 May 2005 14:22:56 -0700, "Svennglenn" <Danielnord15@yahoo.se>
wrote:

>I'm working on a program that is supposed to save
>different information to text files.
>
>Because the program is in swedish i have to use
>unicode text for ÅÄÖ letters.


"program is in Swedish": to the extent that this means "names of
variables are in Swedish", this is quite irrelevant. The variable
names could be in some other language, like Slovak, Slovenian, Swahili
or Strine. Your problem(s) (PLURAL) arise from the fact that your text
data is in Swedish, the representation of which uses a few non-ASCII
characters. Problem 1 is the representation of Swedish in text
constants in your program; this is causing the exception you show
below but curiously didn't ask for help with.

>
>When I run the following testscript I get an error message.
>
># -*- coding: cp1252 -*-
>
>titel = "åäö"
>titel = unicode(titel)


You should use titel = u"åäö"
Works, and saves wear & tear on your typing fingers.

>
>print "Titel type", type(titel)
>
>fil = open("testfil.txt", "w")
>fil.write(titel)
>fil.close()
>
>
>Traceback (most recent call last):
> File "D:\Do***ents and
>Settings\Daniel\Desktop\Programmering\aaotest\aao test2\aaotest2.pyw",
>line 5, in ?
> titel = unicode(titel)
>UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0:
>ordinal not in range(128)
>
>
>I need to have the titel variable in unicode format because when I
>write
>åäö in a entry box in Tkinkter it makes the value to a unicode
>format
>automaticly.


The general rule in working with Unicode can be expressed something
like "work in Unicode all the time i.e. decode legacy text as early as
possible; encode into legacy text (if absolutely required) as late as
possible (corollary: if forced to communicate with another
Unicode-aware system over an 8-bit wide channel, encode as utf-8, not
cp666)"

Applying this to Problem 1 is, as you've seen, trivial: To the extent
that you have text constants at all in your program, they should be in
Unicode.

Now after all that, Problem 2: how to save Unicode text to a file?

Which raises a question: who or what is going to read your file? If a
Unicode-aware application, and never a human, you might like to
consider encoding the text as utf-16. If Unicode-aware app plus
(occasional human developer or not CJK and you want to save space),
try utf-8. For general use on Windows boxes in the Latin1 subset of
the universe, you'll no doubt want to encode as cp1252.

>
>Are there anyone who knows an easy way to save this unicode format text
>to a file?


Read the docs of the codecs module -- skipping over how to register
codecs, just concentrate on using them.

Try this:

# -*- coding: cp1252 -*-
import codecs
titel = u"åäö"
print "Titel type", type(titel)
f1 = codecs.open('titel.u16', 'wb', 'utf_16')
f2 = codecs.open('titel.u8', 'w', 'utf_8')
f3 = codecs.open('titel.txt', 'w', 'cp1252')
# much later, maybe in a different function
# maybe even in a different module
f1.write(titel)
f2.write(titel)
f3.write(titel)
# much later
f1.close()
f2.close()
f3.close()

Note: doing it this way follows the "encode as late as possible" rule
and do***ents the encoding for the whole file, in one place. Other
approaches which might use the .encode() method of Unicode strings and
then write the 8-bit-string results at different times and in
different functions/modules are somewhat less clean and more prone to
mistakes.

HTH,
John
 
Old 05-08-2005   #4
.... ..ch..
 
Default Re: Trouble saving unicode text to file

On Sat, 7 May 2005 17:25:28 -0500, Skip Montanaro <skip@pobox.com>
wrote:

>
> Svennglenn> Traceback (most recent call last):
> Svennglenn> File "D:\Do***ents and
> Svennglenn> Settings\Daniel\Desktop\Programmering\aaotest\aaot est2\aaotest2.pyw",
> Svennglenn> line 5, in ?
> Svennglenn> titel = unicode(titel)
> Svennglenn> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0:
> Svennglenn> ordinal not in range(128)
>
>Try:
>
> import codecs
>
> titel = "åäö"
> titel = unicode(titel, "iso-8859-1")
> fil = codecs.open("testfil.txt", "w", "iso-8859-1")
> fil.write(titel)
> fil.close()
>


I tried that, with this result:

C:\junk>python skip.py
sys:1: DeprecationWarning: Non-ASCII character '\xe5' in file skip.py
on line 3, but no encoding declared; see http://www.python.org
/peps/pep-0263.html for details

1. An explicit PEP 263 declaration (which the OP already had!) should
be used, rather than relying on the default, which doesn't work in
general if you substituted say Polish or Russian for Swedish.

2. My bet is that 'cp1252' is more likely to be appropriate for the OP
than 'iso-8859-1'. The encodings are quite different in range(0x80,
0xA0). They coincidentally give the same result for the OP's limited
sample. However if for example the OP needs to use the euro character
which is 0x80 in cp1252, it wouldn't show up as a problem in the
limited scripts we've been playing with so far, but 0x80 in the script
is sure not going to look like a euro in Tkinter if it's being decoded
via iso-8859-1. Your rationale for using iso-8859-1 when the OP had
already mentioned cp1252 was ... what?



 
Old 05-08-2005   #5
.... ..n ..ningh..
 
Default Re: Trouble saving unicode text to file

Hi All--

John Machin wrote:
>
>
> The general rule in working with Unicode can be expressed something
> like "work in Unicode all the time i.e. decode legacy text as early as
> possible; encode into legacy text (if absolutely required) as late as
> possible (corollary: if forced to communicate with another
> Unicode-aware system over an 8-bit wide channel, encode as utf-8, not
> cp666)"
>


+1 QOTW

And true, too.

<i-especially-like-the-cp666-part>-ly y'rs,
Ivan
----------------------------------------------
Ivan Van Laningham
God N Locomotive Works
http://www.andi-holmes.com/
http://www.foretec.com/python/worksh...oceedings.html
Army Signal Corps: Cu Chi, Cl*** of '70
Author: Teach Yourself Python in 24 Hours
 
Old 05-08-2005   #6
..rt.. .. ..w..
 
Default Re: Trouble saving unicode text to file

Svennglenn wrote:
> # -*- coding: cp1252 -*-
>
> titel = "åäö"
> titel = unicode(titel)


Instead of this, just write

# -*- coding: cp1252 -*-

titel = u"åäö"

> fil = open("testfil.txt", "w")
> fil.write(titel)
> fil.close()


Instead of this, write

import codecs
fil = codecs.open("testfil.txt", "w", "cp1252")
fil.write(titel)
fil.close()

Instead of cp1252, consider using ISO-8859-1.

Regards,
Martin
 
Old 05-08-2005   #7
.... ..ch..
 
Default Re: Trouble saving unicode text to file

On Sun, 08 May 2005 11:23:49 +0200, "Martin v. Löwis"
<martin@v.loewis.de> wrote:

>Svennglenn wrote:
>> # -*- coding: cp1252 -*-
>>
>> titel = "åäö"
>> titel = unicode(titel)

>
>Instead of this, just write
>
># -*- coding: cp1252 -*-
>
>titel = u"åäö"
>
>> fil = open("testfil.txt", "w")
>> fil.write(titel)
>> fil.close()

>
>Instead of this, write
>
>import codecs
>fil = codecs.open("testfil.txt", "w", "cp1252")
>fil.write(titel)
>fil.close()
>
>Instead of cp1252, consider using ISO-8859-1.


Martin, I can't guess the reason for this last suggestion; why should
a Windows system use iso-8859-1 instead of cp1252?

Regards,
John


 
Old 05-08-2005   #8
..rt.. .. ..w..
 
Default Re: Trouble saving unicode text to file

John Machin wrote:
> Martin, I can't guess the reason for this last suggestion; why should
> a Windows system use iso-8859-1 instead of cp1252?


Windows users often think that windows-1252 is the same thing as
iso-8859-1, and then exchange data in windows-1252, but declare them
as iso-8859-1 (in particular, this is common for HTML files).
iso-8859-1 is more portable than windows-1252, so it should be
preferred when the data need to be exchanged across systems.

Regards,
Martin
 
Old 05-09-2005   #9
.... ..ch..
 
Default Re: Trouble saving unicode text to file

On Sun, 08 May 2005 19:49:42 +0200, "Martin v. Löwis"
<martin@v.loewis.de> wrote:

>John Machin wrote:
>> Martin, I can't guess the reason for this last suggestion; why should
>> a Windows system use iso-8859-1 instead of cp1252?

>
>Windows users often think that windows-1252 is the same thing as
>iso-8859-1, and then exchange data in windows-1252, but declare them
>as iso-8859-1 (in particular, this is common for HTML files).
>iso-8859-1 is more portable than windows-1252, so it should be
>preferred when the data need to be exchanged across systems.


Martin, it seems I'm still a long way short of enlightenment; please
bear with me:

Terminology disambiguation: what I call "users" wouldn't know what
'cp1252' and 'iso-8859-1' were. They're not expected to know. They
just type in whatever characters they can see on their keyboard or
find in the charmap utility. It's what I'd call 'admins' and
'developers' who should know better, but often don't.

1. When exchanging data across systems, should not utf-8 be
preferred???

2. If the Windows *users* have been using characters that are in
cp1252 but not in iso-8859-1, then attempting to convert to iso-8859-1
will cause an exception.

>>> euro_win = chr(128)
>>> euro_uc = euro_win.decode('cp1252')
>>> euro_uc

u'\u20ac'
>>> unicodedata.name(euro_uc)

'EURO SIGN'
>>> euro_iso = euro_uc.encode('iso-8859-1')

Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u20ac'
in position 0: ordinal not in range(256)
>>>


I find it a bit hard to imagine that the euro sign wouldn't get a fair
bit of usage in Swedish data processing even if it's not their own
currency.

3. How portable is a character set that doesn't include the euro sign?

Regards,
John
 

Thread Tools
Display Modes





Powered by vBulletin®
Copyright ©2000 - 2009, Jelsoft Enterprises Ltd.
Search Engine Friendly URLs by vBSEO 3.3.0