Python 3.0b2 cannot map '\u12b' - Python

This is a discussion on Python 3.0b2 cannot map '\u12b' - Python ; Hello, I am using Python 3.0b2. I have an XML file that has the unicode character '\u012b' in it, which, when parsed, causes a UnicodeEncodeError: 'charmap' codec can't encode character '\u012b' in position 26: character maps to <undefined> This happens ...

+ Reply to Thread
Results 1 to 9 of 9

Python 3.0b2 cannot map '\u12b'

  1. Default Python 3.0b2 cannot map '\u12b'

    Hello,

    I am using Python 3.0b2.
    I have an XML file that has the unicode character '\u012b' in it,
    which, when parsed, causes a UnicodeEncodeError:

    'charmap' codec can't encode character '\u012b' in position 26:
    character maps to <undefined>

    This happens even when I assign this character to a reference in the
    interpreter:

    Python 3.0b2 (r30b2:65106, Jul 18 2008, 18:44:17) [MSC v.1500 32 bit
    (Intel)] on
    win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> s = '\u012b'
    >>> s

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "C:\Python30\lib\io.py", line 1428, in write
    b = encoder.encode(s)
    File "C:\Python30\lib\encodings\cp437.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\u012b' in
    position
    1: character maps to <undefined>

    Is this a known issue, or am I doing something wrong?
    Here is a link to the XML file. The character is on line 600, char 54

    http://rubyquiz.com/SongLibrary.xml.gz

  2. Default Re: Python 3.0b2 cannot map '\u12b'

    josh logan <dear.jay.logan@gmail.com> wrote:
    >
    >I am using Python 3.0b2.
    >I have an XML file that has the unicode character '\u012b' in it,
    >which, when parsed, causes a UnicodeEncodeError:
    >
    >'charmap' codec can't encode character '\u012b' in position 26:
    >character maps to <undefined>
    >
    >This happens even when I assign this character to a reference in the
    >interpreter:
    >
    >Python 3.0b2 (r30b2:65106, Jul 18 2008, 18:44:17) [MSC v.1500 32 bit
    >(Intel)] on
    > win32
    >Type "help", "copyright", "credits" or "license" for more information.
    >>>> s = '\u012b'
    >>>> s

    >Traceback (most recent call last):
    > File "<stdin>", line 1, in <module>
    > File "C:\Python30\lib\io.py", line 1428, in write
    > b = encoder.encode(s)
    > File "C:\Python30\lib\encodings\cp437.py", line 19, in encode
    > return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    >UnicodeEncodeError: 'charmap' codec can't encode character '\u012b' in
    >position
    >1: character maps to <undefined>
    >
    >Is this a known issue, or am I doing something wrong?


    Both. U+012B is the Latin lower-case i with macron (i with a bar instead
    of a dot). That character does not exist in the 8-bit character set CP437,
    which you are trying to use.

    If you choose an 8-bit character set that includes i-with-macron, then it
    will work. UTF-8 would be a good choice. It's in ISO-8859-10.
    --
    Tim Roberts, timr@probo.com
    Providenza & Boekelheide, Inc.

  3. Default Re: Python 3.0b2 cannot map '\u12b'



    Tim Roberts wrote:
    > josh logan <dear.jay.logan@gmail.com> wrote:
    >> I am using Python 3.0b2.
    >> I have an XML file that has the unicode character '\u012b' in it,
    >> which, when parsed, causes a UnicodeEncodeError:
    >>
    >> 'charmap' codec can't encode character '\u012b' in position 26:
    >> character maps to <undefined>
    >>
    >> This happens even when I assign this character to a reference in the
    >> interpreter:
    >>
    >> Python 3.0b2 (r30b2:65106, Jul 18 2008, 18:44:17) [MSC v.1500 32 bit
    >> (Intel)] on
    >> win32
    >> Type "help", "copyright", "credits" or "license" for more information.
    >>>>> s = '\u012b'
    >>>>> s

    >> Traceback (most recent call last):
    >> File "<stdin>", line 1, in <module>
    >> File "C:\Python30\lib\io.py", line 1428, in write
    >> b = encoder.encode(s)
    >> File "C:\Python30\lib\encodings\cp437.py", line 19, in encode
    >> return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    >> UnicodeEncodeError: 'charmap' codec can't encode character '\u012b' in
    >> position
    >> 1: character maps to <undefined>
    >>
    >> Is this a known issue, or am I doing something wrong?

    >
    > Both. U+012B is the Latin lower-case i with macron (i with a bar instead
    > of a dot). That character does not exist in the 8-bit character set CP437,
    > which you are trying to use.
    >
    > If you choose an 8-bit character set that includes i-with-macron, then it
    > will work. UTF-8 would be a good choice. It's in ISO-8859-10.


    I doubt the OP 'chose' cp437. Why does Python using cp437 even when the
    default encoding is utf-8?

    On WinXP
    >>> sys.getdefaultencoding()

    'utf-8'
    >>> s='\u012b'
    >>> s

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "C:\Program Files\Python30\lib\io.py", line 1428, in write
    b = encoder.encode(s)
    File "C:\Program Files\Python30\lib\encodings\cp437.py", line 19, in
    encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    UnicodeEncodeError: 'charmap' codec can't encode character '\u012b' in
    position
    1: character maps to <undefined>

    To put it another way, how can one 'choose' utf-8 for display to screen?

    Using IDLE, display works fine.

    IDLE 3.0b2
    >>> s='\u012b'
    >>> s

    'Ä«' # i macron
    >>> import sys
    >>> sys.getdefaultencoding()

    'utf-8'

    I ran across this is a different context and mentioned it on the bug
    tracker, but the Windows interpreter seems broken here.

    I will send this in UTF-8 so the i-macron will hopefully show up.

    tjr


  4. Default Re: Python 3.0b2 cannot map '\u12b'

    On Mon, 01 Sep 2008 02:27:54 -0400, Terry Reedy wrote:

    > I doubt the OP 'chose' cp437. Why does Python using cp437 even when the
    > default encoding is utf-8?
    >
    > On WinXP
    > >>> sys.getdefaultencoding()

    > 'utf-8'
    > >>> s='\u012b'
    > >>> s

    > Traceback (most recent call last):
    > File "<stdin>", line 1, in <module>
    > File "C:\Program Files\Python30\lib\io.py", line 1428, in write
    > b = encoder.encode(s)
    > File "C:\Program Files\Python30\lib\encodings\cp437.py", line 19, in
    > encode
    > return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    > UnicodeEncodeError: 'charmap' codec can't encode character '\u012b' in
    > position
    > 1: character maps to <undefined>


    Most likely because Python figured out that the terminal expects cp437.
    What does `sys.stdout.encoding` say?

    > To put it another way, how can one 'choose' utf-8 for display to screen?


    If the terminal expects cp437 then displaying utf-8 might give some
    problems.

    Ciao,
    Marc 'BlackJack' Rintsch

  5. Default Re: Python 3.0b2 cannot map '\u12b'

    On Sep 1, 8:19 am, Marc 'BlackJack' Rintsch <bj_...@gmx.net> wrote:
    > On Mon, 01 Sep 2008 02:27:54 -0400, Terry Reedy wrote:
    > > I doubt the OP 'chose' cp437.  Why does Python using cp437 even when the
    > > default encoding is utf-8?

    >
    > > On WinXP
    > >  >>> sys.getdefaultencoding()
    > > 'utf-8'
    > >  >>> s='\u012b'
    > >  >>> s
    > > Traceback (most recent call last):
    > >    File "<stdin>", line 1, in <module>
    > >    File "C:\Program Files\Python30\lib\io.py", line 1428, in write
    > >      b = encoder.encode(s)
    > >    File "C:\Program Files\Python30\lib\encodings\cp437.py", line 19, in
    > > encode
    > >      return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    > > UnicodeEncodeError: 'charmap' codec can't encode character '\u012b' in
    > > position
    > > 1: character maps to <undefined>

    >
    > Most likely because Python figured out that the terminal expects cp437.  
    > What does `sys.stdout.encoding` say?
    >
    > > To put it another way, how can one 'choose' utf-8 for display to screen?

    >
    > If the terminal expects cp437 then displaying utf-8 might give some
    > problems.
    >
    > Ciao,
    >         Marc 'BlackJack' Rintsch


    So, it is not a problem with the program, but a problem when I print
    it out.
    sys.stdout.encoding does say cp437.

    Now, when I don't print anything out, the program hangs. I will try
    this again and let the board know the results.

    Thanks for all of your insight.

  6. Default Re: Python 3.0b2 cannot map '\u12b'



    Marc 'BlackJack' Rintsch wrote:
    > On Mon, 01 Sep 2008 02:27:54 -0400, Terry Reedy wrote:
    >
    >> I doubt the OP 'chose' cp437. Why does Python using cp437 even when the
    >> default encoding is utf-8?
    >>
    >> On WinXP
    >> >>> sys.getdefaultencoding()

    >> 'utf-8'
    >> >>> s='\u012b'
    >> >>> s

    >> Traceback (most recent call last):
    >> File "<stdin>", line 1, in <module>
    >> File "C:\Program Files\Python30\lib\io.py", line 1428, in write
    >> b = encoder.encode(s)
    >> File "C:\Program Files\Python30\lib\encodings\cp437.py", line 19, in
    >> encode
    >> return codecs.charmap_encode(input,self.errors,encoding_map)[0]
    >> UnicodeEncodeError: 'charmap' codec can't encode character '\u012b' in
    >> position
    >> 1: character maps to <undefined>

    >
    > Most likely because Python figured out that the terminal expects cp437.
    > What does `sys.stdout.encoding` say?


    The interpreter in the command prompt window says CP437.
    The IDLE Window says 'cp1252', and it handles the character fine.
    Given that Windows OS can handle the character, why is Python/Command
    Prompt limiting output?

    Characters the IDLE window cannot display (like surrogate pairs) it
    displays as boxes. But if I cut '[][]' (4 chars) and paste into
    Firefox, I get 3 chars. '[]' where [] has some digits instead of being
    empty. It is really confusing when every window on 'unicode-based'
    Windows handles a different subset. Is this the fault of Windows or of
    Python and IDLE (those two being more limited that FireFox)?

    >> To put it another way, how can one 'choose' utf-8 for display to screen?

    >
    > If the terminal expects cp437 then displaying utf-8 might give some
    > problems.


    My screen displays whatever Windows tells the graphics card to tell the
    screen to display. In OpenOffice, I can select a unicode font that
    displays at least everything in the BasicMultilingualPlane (BMP).

    Terry Jan Reedy


  7. Default Re: Python 3.0b2 cannot map '\u12b'

    On Mon, 01 Sep 2008 14:25:01 -0400, Terry Reedy wrote:

    > Marc 'BlackJack' Rintsch wrote:
    >> On Mon, 01 Sep 2008 02:27:54 -0400, Terry Reedy wrote:
    >>
    >> Most likely because Python figured out that the terminal expects cp437.
    >> What does `sys.stdout.encoding` say?

    >
    > The interpreter in the command prompt window says CP437. The IDLE Window
    > says 'cp1252', and it handles the character fine. Given that Windows OS
    > can handle the character, why is Python/Command Prompt limiting output?


    The windows command prompt expects cp437 because that's what old DOS
    programs print to it.

    > Characters the IDLE window cannot display (like surrogate pairs) it
    > displays as boxes. But if I cut '[][]' (4 chars) and paste into
    > Firefox, I get 3 chars. '[]' where [] has some digits instead of being
    > empty. It is really confusing when every window on 'unicode-based'
    > Windows handles a different subset.


    That's because it is not 'unicode-based'. Communication between those
    programs has to be done with bytes, so the sender has to encode unicode
    characters in the encoding the receiver expects.

    > Is this the fault of Windows or of Python and IDLE (those two being
    > more limited that FireFox)?


    It's nobodies fault. That's simply how the encoding stuff works.

    >>> To put it another way, how can one 'choose' utf-8 for display to
    >>> screen?

    >>
    >> If the terminal expects cp437 then displaying utf-8 might give some
    >> problems.

    >
    > My screen displays whatever Windows tells the graphics card to tell the
    > screen to display.


    But the terminal gets bytes and expects them to be cp437 encoded
    characters and not utf-8. So you can't send whatever unicode character
    you want, at least not without changing the encoding of the terminal.

    > In OpenOffice, I can select a unicode font that displays at least
    > everything in the BasicMultilingualPlane (BMP).


    But OOo works with unicode internally, so there's no communication with
    outside programs involved here.

    Ciao,
    Marc 'BlackJack' Rintsch

  8. Default Re: Python 3.0b2 cannot map '\u12b'



    Marc 'BlackJack' Rintsch wrote:

    First, thank you for the informative responses.

    > The windows command prompt expects cp437 because that's what old DOS
    > programs print to it.


    Grrr. When the interpreter runs, it opens the command prompt window
    with Python running, and the window closes when Python exits, so there
    are no other programs involved. I don't suppose there is anyway to tell
    Command Prompt to accept something better.

    > But OOo works with unicode internally, so there's no communication with
    > outside programs involved here.


    Python 3 uses unicode internally also, but I gather CommandPrompt is an
    outside program used as a quick substitute for coding a plain window
    with MFC, for instance.

    ----------------------
    I did some experiments.

    I added the /u flag after cmd.exe in the Command Prompt shortcut and the
    font to Lucida Console (which people on the web say handles unicode).

    I opened the prompts window and entered 'chcp 1252' the same codepage as
    IDLE. Start Python3.
    >>> import sys
    >>> sys.stdout.encoding

    'cp1252'
    >>> '\u012b'

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "C:\Program Files\Python30\lib\io.py", line 1428, in
    b = encoder.encode(s)
    File "C:\Program Files\Python30\lib\encodings\cp1252.py", <etc>
    same with raster font choice.

    chcp 65001, which supposedly is UTF-8, disables all output. Perhaps
    Python does not recognize it as a synonym for UTF-8.

    The same on IDLE (with codepage 1252) gives i macron (bar on top). So
    something else is going on other than just codepage.

    I tried a second time and instead got "'\u012b'" and no error. Hooray,
    I thought, but I closed and tried again the same way, as best I know,
    but got the same error as before. Cp65001 also did and then did not
    work. Python does notice the code page change.

    tjr


  9. Default Re: Python 3.0b2 cannot map '\u12b'

    Terry Reedy wrote:

    >> If the terminal expects cp437 then displaying utf-8 might give some
    >> problems.


    > My screen displays whatever Windows tells the graphics card to tell
    > the screen to display. In OpenOffice, I can select a unicode font
    > that displays at least everything in the BasicMultilingualPlane (BMP).


    It would appear that the Windows port of Python is probably just not
    forcing the Win32 console into the Unicode mode or using the Unicode
    APIs. (If this holds true, it could be a leftover from the Windows
    95/98/ME days, I suppose...)

    <http://en.wikipedia.org/wiki/Win32_console>

    As a workaround - for the time being - you might want to try something
    similar as described in the thread "Changing the (codec) error handler
    for the stdout/stderr streams in Python 3.0".

    The approach described in there will not let you print characters
    outside the codepage 437 repertoaire - any such characters will still
    need to be substituted with something else - but at least this
    substitution should happen automatically; i.e. you can keep using the
    normal print() function the normal way - even for the fancier
    characters - and your program will no longer crash.

    It would be nice to see proper Unicode Win32 console support in Python,
    of course, if at all possible.

    --
    znark


+ Reply to Thread