Normalize a polish L - Python
This is a discussion on Normalize a polish L - Python ; In UTF8, \u0141 is a capital L with a little dash through it as can be
seen in this image:
http://static.peterbe.com/lukasz.png
I tried this:
>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')
''
I was hoping it would convert it it 'L' ...
-
Normalize a polish L
In UTF8, \u0141 is a capital L with a little dash through it as can be
seen in this image:
http://static.peterbe.com/lukasz.png
I tried this:
>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')
''
I was hoping it would convert it it 'L' because that's what it
visually looks like. And I've seen it becoming a normal ascii L before
in other programs such as Thunderbird.
I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
none of them helped.
What am I doing wrong?
-
Re: Normalize a polish L
* Peter Bengtsson (Mon, 15 Oct 2007 16:33:26 -0000)
> In UTF8, \u0141 is a capital L with a little dash through it as can be
> seen in this image:
> http://static.peterbe.com/lukasz.png
> I tried this:
> >>> import unicodedata
> >>> unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')
> ''
>
> I was hoping it would convert it it 'L' because that's what it
> visually looks like. And I've seen it becoming a normal ascii L before
> in other programs such as Thunderbird.
The 'L' is actually pronounced like the English "w"...
> I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
> none of them helped.
>>> unicodedata.decomposition(u'\N{LATIN CAPITAL LETTER C WITH CEDILLA}')
'0043 0327'
>>> unicodedata.normalize('NFKD', u'\N{LATIN CAPITAL LETTER C WITH CEDILLA}').encode('ascii','ignore')
'C'
>>> unicodedata.decomposition(u'\N{LATIN CAPITAL LETTER L WITH STROKE}')
''
-
Re: Normalize a polish L
Thorsten Kampe wrote:
> The 'L' is actually pronounced like the English "w"...
'?' originally comes from "L" (<http://en.wikipedia.org/wiki/?>) and
is AFAIK transcribed so.
Also, a friend of mine writes himself "Lukas" (pronounced L-) even
though in Polish his name is ?ukas (short Wh-).
Regards,
Bjrn
--
BOFH excuse #126:
it has Intel Inside
-
Re: Normalize a polish L
Peter Bengtsson <peterbe@gmail.com> writes:
> In UTF8, \u0141 is a capital L with a little dash through it as can be
> seen in this image:
> http://static.peterbe.com/lukasz.png
>
> I tried this:
>>>> import unicodedata
>>>> unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')
> ''
>
> I was hoping it would convert it it 'L' because that's what it
> visually looks like. And I've seen it becoming a normal ascii L before
> in other programs such as Thunderbird.
>
> I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
> none of them helped.
>
> What am I doing wrong?
I had the same problem and my little research revealed that the problem
is caused by unicode standard itself. I don't know why
but characters with stroke don't have canonical equivalent.
I looked into this file:
http://unicode.org/Public/UNIDATA/UnicodeData.txt
and compared two positions:
1.
<UnicodeData.txt>
0142;LATIN SMALL LETTER L WITH STROKE;Ll;0;L;;;;;N;LATIN SMALL LETTER L SLASH \
;;0141;;0141
0141;LATIN CAPITAL LETTER L WITH STROKE;Lu;0;L;;;;;N;LATIN CAPITAL LETTER L SLASH \
;;;0142;
</UnicodeData.txt>
2.
<UnicodeData.txt>
0105;LATIN SMALL LETTER A WITH OGONEK;Ll;0;L;0061 0328;;;;N;LATIN SMALL LETTER A OGONEK \
;;0104;;0104
</UnicodeData.txt>
In the second position there is in the 6-th field canonical equivalent
but in the 1-st there is nothing. I don't know what justification
is behind that, but probably there is something. 
Regards,
Rob
-
Re: Normalize a polish L
* Bjoern Schliessmann (Mon, 15 Oct 2007 21:51:54 +0200)
> Thorsten Kampe wrote:
> > The 'L' is actually pronounced like the English "w"...
>
> '?' originally comes from "L" (<http://en.wikipedia.org/wiki/?>) and
> is AFAIK transcribed so.
There are lots of possible transcriptions for "LATIN CAPITAL LETTER L
WITH STROKE". Transcription is language dependent so the English and
German transcriptions of Polish names are different.
> Also, a friend of mine writes himself "Lukas" (pronounced L-) even
> though in Polish his name is ?ukas (short Wh-).
Why do you try to use characters in a character set that does not
contain these characters? That doesn't make any sense.
Thorsten
-
Re: Normalize a polish L
On Oct 16, 2:33 am, Peter Bengtsson <pete...@gmail.com> wrote:
> In UTF8, \u0141 is a capital L with a little dash through it as can be
> seen in this image:http://static.peterbe.com/lukasz.png
>
> I tried this:>>> import unicodedata
> >>> unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')
>
> ''
>
> I was hoping it would convert it it 'L' because that's what it
> visually looks like. And I've seen it becoming a normal ascii L before
> in other programs such as Thunderbird.
>
> I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
> none of them helped.
>
> What am I doing wrong?
The character in question is NOT composed (in the way that Unicode
means) of an 'L' and a little slash; hence the concepts of
"normalization" and "decomposition" don't apply.
To "asciify" such text, you need to build a look-up table that suits
your purpose. unicodedata.decomposition() is (accidentally) useful in
providing *some* of the entries for such a table.
-
Re: Normalize a polish L
Thorsten Kampe wrote:
> Why do you try to use characters in a character set that does not
> contain these characters? That doesn't make any sense.
I thought KNode was smart enough to switch to UTF-8; obviously, it
isn't.
Regards,
Bjrn
--
BOFH excuse #121:
halon system went off and killed the operators.
-
Re: Normalize a polish L
Thorsten Kampe wrote:
> The 'L' is actually pronounced like the English "w"...
'?' originally comes from "L" (<http://en.wikipedia.org/wiki/?>) and
is AFAIK transcribed so.
Also, a friend of mine writes himself "Lukas" (pronounced L-) even
though in Polish his name is Łukas (short Wh-).
Regards,
Björn
--
BOFH excuse #126:
it has Intel Inside
-
Re: Normalize a polish L
On Oct 15, 10:57 pm, John Machin <sjmac...@lexicon.net> wrote:
> On Oct 16, 2:33 am, Peter Bengtsson <pete...@gmail.com> wrote:
>
>
>
> > In UTF8, \u0141 is a capital L with a little dash through it as can be
> > seen in this image:http://static.peterbe.com/lukasz.png
>
> > I tried this:>>> import unicodedata
> > >>> unicodedata.normalize('NFKD', u'\u0141').encode('ascii','ignore')
>
> > ''
>
> > I was hoping it would convert it it 'L' because that's what it
> > visually looks like. And I've seen it becoming a normal ascii L before
> > in other programs such as Thunderbird.
>
> > I also tried the other forms: 'NFC', 'NFKC', 'NFD', and 'NFKD' but
> > none of them helped.
>
> > What am I doing wrong?
>
> The character in question is NOT composed (in the way that Unicode
> means) of an 'L' and a little slash; hence the concepts of
> "normalization" and "decomposition" don't apply.
>
> To "asciify" such text, you need to build a look-up table that suits
> your purpose. unicodedata.decomposition() is (accidentally) useful in
> providing *some* of the entries for such a table.
Thank you! That explains it.
-
Re: Normalize a polish L
On Oct 15, 6:57 pm, John Machin <sjmac...@lexicon.net> wrote:
> To "asciify" such text, you need to build a look-up table that suits
> your purpose. unicodedata.decomposition() is (accidentally) useful in
> providing *some* of the entries for such a table.
This is the only approach that can actually work, because every
language has different conventions on how to represent text without
diacritics.
For example, in Spanish, "" (u with umlaut) should be represented as
"u", but in German, it should be represented as "ue".
pingino -> pinguino
Frhstck -> Fruehstueck
I'd like that web applications (e.g. blogs) took into account these
conventions when creating URLs from the title of an article.
--
Roberto Bonvallet
Similar Threads
-
By Application Development in forum ASM x86 ASM 370
Replies: 2
Last Post: 04-17-2007, 02:00 AM
-
By Application Development in forum XML SOAP
Replies: 4
Last Post: 12-12-2006, 08:29 AM
-
By Application Development in forum Graphics
Replies: 1
Last Post: 07-10-2006, 09:52 PM
-
By Application Development in forum Graphics
Replies: 0
Last Post: 08-31-2004, 05:31 PM
-
By Application Development in forum Graphics
Replies: 3
Last Post: 06-11-2004, 02:27 AM