Should '\w' class match non ascii letters? - Java
This is a discussion on Should '\w' class match non ascii letters? - Java ; I've been working with Patterns for a while, and following thing
baffled: \w class doesnt seem to include non ascii letters (well at
least not polish ones ).
Javadoc seems to say nothing about it.
Heres the test:
import java.util.regex.*;
...
-
Should '\w' class match non ascii letters?
I've been working with Patterns for a while, and following thing
baffled: \w class doesnt seem to include non ascii letters (well at
least not polish ones
).
Javadoc seems to say nothing about it.
Heres the test:
import java.util.regex.*;
class rTest{
public static void main(String[] args){
System.out.println("Regexp: '\\w+'");
Pattern pat;
Matcher m;
pat = Pattern.compile("\\w+");
m= pat.matcher("a");
System.out.println("Matches 'a' " + m.matches());
m= pat.matcher("\u015b");
System.out.println("Matches '\u015b' " + m.matches());
m= pat.matcher("a");
System.out.println("Matches 'a' " + m.matches());
m= pat.matcher("¶");
System.out.println("Matches '¶' " + m.matches());
}
}
It prints (on my system):
Regexp: '\w+'
Matches 'a' true
Matches '¶' false
Matches 'a' true
Matches '¶' false
The question is: whether it is buggy behaviour or is it according to
specs, and is there any way to include all (polish) letters in a class
in an elegant way?
-
Re: Should '\w' class match non ascii letters?
jb wrote:
> I've been working with Patterns for a while, and following thing
> baffled: \w class doesnt seem to include non ascii letters (well at
> least not polish ones
).
>
> Javadoc seems to say nothing about it.
The Javadoc for 1.6 says
Predefined character classes
...
\w A word character: [a-zA-Z_0-9]
.... which seems explicit enough: There are only 63 characters that
match a \w.
> The question is: whether it is buggy behaviour or is it according to
> specs,
According to spec, I'd say.
> and is there any way to include all (polish) letters in a class
> in an elegant way?
My knowledge of Polish is non-existent (I'm unpolished), but
perhaps the \p{prop} construct might help? It looks promising
enough to investigate, anyhow.
--
Eric.Sosman@sun.com
-
Re: Should '\w' class match non ascii letters?
Eric Sosman writes:
> jb wrote:
>> I've been working with Patterns for a while, and following thing
>> baffled: \w class doesnt seem to include non ascii letters (well at
>> least not polish ones
).
>> Javadoc seems to say nothing about it.
>
> The Javadoc for 1.6 says
>
> Predefined character classes
> ...
> \w A word character: [a-zA-Z_0-9]
It says the same for 1.4.2 already.
I've never tried \p{prop} before, but I did now, and \p{L} appears to
match Finnish non-ASCII letters, so I guess it would work for Polish,
too. It is described in Javadoc for Pattern in 1.4.2 under headings
"Classes for Unicode blocks and categories" and "Unicode support".
-
-
Re: Should '\w' class match non ascii letters?
jb wrote:
> Well I assumed that my chars are between a-z, in alphabet they are
.
Brief note: the range a-z refers to all characters c such that 'a' <= c
and c <= 'z'. 'á' is > 'z', ergo it doesn't match. The range 0-f will
match all uppercase letters, all digits, the lowercase letters 'a', 'b',
'c', 'd', 'e', and 'f', as well as the punctuation characters in the
following string ":;<=>?@[\\]^_`", which is probably not what would be
intended.
--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth
-
Re: Should '\w' class match non ascii letters?
Joshua Cranmer wrote:
> jb wrote:
>> Well I assumed that my chars are between a-z, in alphabet they are
.
>
> Brief note: the range a-z refers to all characters c such that 'a' <= c
> and c <= 'z'. 'á' is > 'z', ergo it doesn't match. The range 0-f will
> match all uppercase letters, all digits, the lowercase letters 'a', 'b',
> 'c', 'd', 'e', and 'f', as well as the punctuation characters in the
> following string ":;<=>?@[\\]^_`", which is probably not what would be
> intended.
>
Agreed. ASCII tricks (for which ASCII was, in part, designed)
don't work well in the new world of UNICODE, or even Latin-1
BugBear
-
Re: Should '\w' class match non ascii letters?
bugbear wrote:
> Joshua Cranmer wrote:
>> jb wrote:
>>> Well I assumed that my chars are between a-z, in alphabet they are
>>>
.
>>
>> Brief note: the range a-z refers to all characters c such that 'a'
>> <= c and c <= 'z'. 'á' is > 'z', ergo it doesn't match. The range
>> 0-f will match all uppercase letters, all digits, the lowercase
>> letters 'a', 'b', 'c', 'd', 'e', and 'f', as well as the punctuation
>> characters in the following string ":;<=>?@[\\]^_`", which is
>> probably not what would be intended.
>>
>
> Agreed. ASCII tricks (for which ASCII was, in part, designed)
> don't work well in the new world of UNICODE, or even Latin-1
Huh? Both Latin-1 and Unicode are supersets of ASCII, so "0-f" would mean
the same thing in all of them.
-
Re: Should '\w' class match non ascii letters?
On Wed, 27 Aug 2008, Mike Schilling wrote:
> bugbear wrote:
>> Joshua Cranmer wrote:
>>> jb wrote:
>>>> Well I assumed that my chars are between a-z, in alphabet they are
>>>>
.
>>>
>>> Brief note: the range a-z refers to all characters c such that 'a' <=
>>> c and c <= 'z'. 'á' is > 'z', ergo it doesn't match. The range 0-f
>>> will match all uppercase letters, all digits, the lowercase letters
>>> 'a', 'b', 'c', 'd', 'e', and 'f', as well as the punctuation
>>> characters in the following string ":;<=>?@[\\]^_`", which is probably
>>> not what would be intended.
>>
>> Agreed. ASCII tricks (for which ASCII was, in part, designed)
>> don't work well in the new world of UNICODE, or even Latin-1
>
> Huh? Both Latin-1 and Unicode are supersets of ASCII, so "0-f" would
> mean the same thing in all of them.
Exactly. In ASCII, the numerical order of the codepoints is the same as
the collating sequence of the letters, so things like a-z mean what they
look like. In Latin-1 and unicode, this is no longer true: a-z looks like
it should include á, but it actually doesn't.
tom
--
I have been trying to find a way of framing this but yes, a light meal is
probably preferable to a heavy one under the circumstances. -- ninebelow
-
Re: Should '\w' class match non ascii letters?
Mike Schilling wrote:
> bugbear wrote:
>> Agreed. ASCII tricks (for which ASCII was, in part, designed)
>> don't work well in the new world of UNICODE, or even Latin-1
>
> Huh? Both Latin-1 and Unicode are supersets of ASCII, so "0-f" would mean
> the same thing in all of them.
I think bugbear was referring to the fact that in the English language
as defined by ASCII (excluding borrowed accents), the statements "char
is a lowercase letter" and |'a' <= char <= 'z'| are equivalent, but in
many scripts, that is not true (e.g., à).
--
Beware of bugs in the above code; I have only proved it correct, not
tried it. -- Donald E. Knuth
-
Re: Should '\w' class match non ascii letters?
Joshua Cranmer wrote:
> Mike Schilling wrote:
>> bugbear wrote:
>>> Agreed. ASCII tricks (for which ASCII was, in part, designed)
>>> don't work well in the new world of UNICODE, or even Latin-1
>>
>> Huh? Both Latin-1 and Unicode are supersets of ASCII, so "0-f" would
>> mean the same thing in all of them.
>
> I think bugbear was referring to the fact that in the English language
> as defined by ASCII (excluding borrowed accents), the statements "char
> is a lowercase letter" and |'a' <= char <= 'z'| are equivalent, but in
> many scripts, that is not true (e.g., à).
>
Heh. How about changing case by flipping a bit?
It works in ASCII, for English.
It fails quite miserably in unicode for thai, arabic,
polish, chinese, japanese...
BugBear