encode/decode misunderstanding - Python
This is a discussion on encode/decode misunderstanding - Python ; Hi, I'm beginning to understand the encode/decode string methods, but I'd
like confirmation that I'm still thinking in the right direction:
I have a file of latin1 encoded text. Let's say I put one line of that file
into a ...
-
encode/decode misunderstanding
Hi, I'm beginning to understand the encode/decode string methods, but I'd
like confirmation that I'm still thinking in the right direction:
I have a file of latin1 encoded text. Let's say I put one line of that file
into a string variable 'tocline', as follows:
tocline = 'Ficha Datos de p\xe9rdida AND acci\xf3n'
import codecs
tocFile = codecs.open('mytoc.htm','wb',encoding='utf8',errors='replace')
tocline = tocline.decode('latin1','replace')
tocFile.write(tocline)
tocFile.close()
What I think is that tocFile is wrapped to insure that anything written to
it is in utf8
I decode the latin1 string into python's internal unicode encoding and that
gets written out as utf8.
Questions:
what exactly is the tocline when it's read in with that \xe9 and \xed in the
string? A latin1 encoded string?
Is my method the right way to write such a line out to a file with utf8
encoding?
If I read in the latin1 file using
codecs.open(filename,encoding='latin1') and write out the utf8 file by
opening with
codecs.open(othername,encoding='utf8'), would I no longer have a problem --
I could just read in latin1 and write out utf8 with no more worries about
encoding?
thanks,
--Tim
-
Re: encode/decode misunderstanding
> If I read in the latin1 file using
> codecs.open(filename,encoding='latin1') and write out the utf8 file by
> opening with
> codecs.open(othername,encoding='utf8'), would I no longer have a
> problem -- I could just read in latin1 and write out utf8 with no more
> worries about encoding?
>
> thanks,
Replying to my own post, I feel so lonely! I guess that silence means I *am*
thinking correctly about the encoding/decoding stuff; I'll keep heading in
this direction unless someone out there sees it differently.....
--Tim
-
Re: encode/decode misunderstanding
Tim Arnold schrieb:
> Hi, I'm beginning to understand the encode/decode string methods, but I'd
> like confirmation that I'm still thinking in the right direction:
>
> I have a file of latin1 encoded text. Let's say I put one line of that file
> into a string variable 'tocline', as follows:
> tocline = 'Ficha Datos de p\xe9rdida AND acci\xf3n'
>
> import codecs
> tocFile = codecs.open('mytoc.htm','wb',encoding='utf8',errors='replace')
> tocline = tocline.decode('latin1','replace')
> tocFile.write(tocline)
> tocFile.close()
>
> What I think is that tocFile is wrapped to insure that anything written to
> it is in utf8
> I decode the latin1 string into python's internal unicode encoding and that
> gets written out as utf8.
>
> Questions:
> what exactly is the tocline when it's read in with that \xe9 and \xed in the
> string? A latin1 encoded string?
Yes. A simple, pure byte-string, that happens to contain bytes which
under the latin1-encoding are "correct".
> Is my method the right way to write such a line out to a file with utf8
> encoding?
Yes.
> If I read in the latin1 file using
> codecs.open(filename,encoding='latin1') and write out the utf8 file by
> opening with
> codecs.open(othername,encoding='utf8'), would I no longer have a problem --
> I could just read in latin1 and write out utf8 with no more worries about
> encoding?
As long as you don't mix bytestrings and only use unicode-objects, you
should be fine, yes.
Diez
-
Re: encode/decode misunderstanding
"Diez B. Roggisch" <deets@nospam.web.de> wrote in message
news:5h3ih4F3il4p1U1@mid.uni-berlin.de...
> Tim Arnold schrieb:
>> Hi, I'm beginning to understand the encode/decode string methods, but I'd
>> like confirmation that I'm still thinking in the right direction:
>>
>> I have a file of latin1 encoded text. Let's say I put one line of that
>> file into a string variable 'tocline', as follows:
>> tocline = 'Ficha Datos de p\xe9rdida AND acci\xf3n'
>>
>> import codecs
>> tocFile = codecs.open('mytoc.htm','wb',encoding='utf8',errors='replace')
>> tocline = tocline.decode('latin1','replace')
>> tocFile.write(tocline)
>> tocFile.close()
>>
>> What I think is that tocFile is wrapped to insure that anything written
>> to it is in utf8
>> I decode the latin1 string into python's internal unicode encoding and
>> that gets written out as utf8.
>>
>> Questions:
>> what exactly is the tocline when it's read in with that \xe9 and \xed in
>> the string? A latin1 encoded string?
>
> Yes. A simple, pure byte-string, that happens to contain bytes which under
> the latin1-encoding are "correct".
>
>> Is my method the right way to write such a line out to a file with utf8
>> encoding?
>
> Yes.
>
>> If I read in the latin1 file using
>> codecs.open(filename,encoding='latin1') and write out the utf8 file by
>> opening with
>> codecs.open(othername,encoding='utf8'), would I no longer have a
>> problem -- I could just read in latin1 and write out utf8 with no more
>> worries about encoding?
>
> As long as you don't mix bytestrings and only use unicode-objects, you
> should be fine, yes.
>
> Diez
wow, I was thinking correctly about encoding! time for a beer!
Diez, thanks very much for confirming my thoughts.
--Tim Arnold
Similar Threads
-
By Application Development in forum c++
Replies: 15
Last Post: 03-23-2007, 09:03 PM
-
By Application Development in forum DOTNET
Replies: 0
Last Post: 09-25-2006, 07:17 PM
-
By Application Development in forum Java
Replies: 2
Last Post: 03-12-2005, 12:39 PM
-
By Application Development in forum Java
Replies: 0
Last Post: 12-20-2004, 01:24 PM