encode/decode misunderstanding - Python

This is a discussion on encode/decode misunderstanding - Python ; Hi, I'm beginning to understand the encode/decode string methods, but I'd like confirmation that I'm still thinking in the right direction: I have a file of latin1 encoded text. Let's say I put one line of that file into a ...

+ Reply to Thread
Results 1 to 4 of 4

encode/decode misunderstanding

  1. Default encode/decode misunderstanding

    Hi, I'm beginning to understand the encode/decode string methods, but I'd
    like confirmation that I'm still thinking in the right direction:

    I have a file of latin1 encoded text. Let's say I put one line of that file
    into a string variable 'tocline', as follows:
    tocline = 'Ficha Datos de p\xe9rdida AND acci\xf3n'

    import codecs
    tocFile = codecs.open('mytoc.htm','wb',encoding='utf8',errors='replace')
    tocline = tocline.decode('latin1','replace')
    tocFile.write(tocline)
    tocFile.close()

    What I think is that tocFile is wrapped to insure that anything written to
    it is in utf8
    I decode the latin1 string into python's internal unicode encoding and that
    gets written out as utf8.

    Questions:
    what exactly is the tocline when it's read in with that \xe9 and \xed in the
    string? A latin1 encoded string?
    Is my method the right way to write such a line out to a file with utf8
    encoding?

    If I read in the latin1 file using
    codecs.open(filename,encoding='latin1') and write out the utf8 file by
    opening with
    codecs.open(othername,encoding='utf8'), would I no longer have a problem --
    I could just read in latin1 and write out utf8 with no more worries about
    encoding?

    thanks,
    --Tim



  2. Default Re: encode/decode misunderstanding

    > If I read in the latin1 file using
    > codecs.open(filename,encoding='latin1') and write out the utf8 file by
    > opening with
    > codecs.open(othername,encoding='utf8'), would I no longer have a
    > problem -- I could just read in latin1 and write out utf8 with no more
    > worries about encoding?
    >
    > thanks,


    Replying to my own post, I feel so lonely! I guess that silence means I *am*
    thinking correctly about the encoding/decoding stuff; I'll keep heading in
    this direction unless someone out there sees it differently.....

    --Tim



  3. Default Re: encode/decode misunderstanding

    Tim Arnold schrieb:
    > Hi, I'm beginning to understand the encode/decode string methods, but I'd
    > like confirmation that I'm still thinking in the right direction:
    >
    > I have a file of latin1 encoded text. Let's say I put one line of that file
    > into a string variable 'tocline', as follows:
    > tocline = 'Ficha Datos de p\xe9rdida AND acci\xf3n'
    >
    > import codecs
    > tocFile = codecs.open('mytoc.htm','wb',encoding='utf8',errors='replace')
    > tocline = tocline.decode('latin1','replace')
    > tocFile.write(tocline)
    > tocFile.close()
    >
    > What I think is that tocFile is wrapped to insure that anything written to
    > it is in utf8
    > I decode the latin1 string into python's internal unicode encoding and that
    > gets written out as utf8.
    >
    > Questions:
    > what exactly is the tocline when it's read in with that \xe9 and \xed in the
    > string? A latin1 encoded string?


    Yes. A simple, pure byte-string, that happens to contain bytes which
    under the latin1-encoding are "correct".

    > Is my method the right way to write such a line out to a file with utf8
    > encoding?


    Yes.

    > If I read in the latin1 file using
    > codecs.open(filename,encoding='latin1') and write out the utf8 file by
    > opening with
    > codecs.open(othername,encoding='utf8'), would I no longer have a problem --
    > I could just read in latin1 and write out utf8 with no more worries about
    > encoding?


    As long as you don't mix bytestrings and only use unicode-objects, you
    should be fine, yes.

    Diez

  4. Default Re: encode/decode misunderstanding

    "Diez B. Roggisch" <deets@nospam.web.de> wrote in message
    news:5h3ih4F3il4p1U1@mid.uni-berlin.de...
    > Tim Arnold schrieb:
    >> Hi, I'm beginning to understand the encode/decode string methods, but I'd
    >> like confirmation that I'm still thinking in the right direction:
    >>
    >> I have a file of latin1 encoded text. Let's say I put one line of that
    >> file into a string variable 'tocline', as follows:
    >> tocline = 'Ficha Datos de p\xe9rdida AND acci\xf3n'
    >>
    >> import codecs
    >> tocFile = codecs.open('mytoc.htm','wb',encoding='utf8',errors='replace')
    >> tocline = tocline.decode('latin1','replace')
    >> tocFile.write(tocline)
    >> tocFile.close()
    >>
    >> What I think is that tocFile is wrapped to insure that anything written
    >> to it is in utf8
    >> I decode the latin1 string into python's internal unicode encoding and
    >> that gets written out as utf8.
    >>
    >> Questions:
    >> what exactly is the tocline when it's read in with that \xe9 and \xed in
    >> the string? A latin1 encoded string?

    >
    > Yes. A simple, pure byte-string, that happens to contain bytes which under
    > the latin1-encoding are "correct".
    >
    >> Is my method the right way to write such a line out to a file with utf8
    >> encoding?

    >
    > Yes.
    >
    >> If I read in the latin1 file using
    >> codecs.open(filename,encoding='latin1') and write out the utf8 file by
    >> opening with
    >> codecs.open(othername,encoding='utf8'), would I no longer have a
    >> problem -- I could just read in latin1 and write out utf8 with no more
    >> worries about encoding?

    >
    > As long as you don't mix bytestrings and only use unicode-objects, you
    > should be fine, yes.
    >
    > Diez


    wow, I was thinking correctly about encoding! time for a beer!
    Diez, thanks very much for confirming my thoughts.

    --Tim Arnold



+ Reply to Thread

Similar Threads

  1. std::ostringstream misunderstanding
    By Application Development in forum c++
    Replies: 15
    Last Post: 03-23-2007, 09:03 PM
  2. How to encode/decode carriage return line feeds
    By Application Development in forum DOTNET
    Replies: 0
    Last Post: 09-25-2006, 07:17 PM
  3. Base64 encode / decode
    By Application Development in forum Java
    Replies: 2
    Last Post: 03-12-2005, 12:39 PM
  4. Base64 encode in VB6, decode in Java PROBLEM!!!
    By Application Development in forum Java
    Replies: 0
    Last Post: 12-20-2004, 01:24 PM