Should '\w' class match non ascii letters? - Java

This is a discussion on Should '\w' class match non ascii letters? - Java ; I've been working with Patterns for a while, and following thing baffled: \w class doesnt seem to include non ascii letters (well at least not polish ones ). Javadoc seems to say nothing about it. Heres the test: import java.util.regex.*; ...

+ Reply to Thread
Results 1 to 10 of 10

Should '\w' class match non ascii letters?

  1. Default Should '\w' class match non ascii letters?

    I've been working with Patterns for a while, and following thing
    baffled: \w class doesnt seem to include non ascii letters (well at
    least not polish ones ).

    Javadoc seems to say nothing about it.

    Heres the test:

    import java.util.regex.*;


    class rTest{
    public static void main(String[] args){
    System.out.println("Regexp: '\\w+'");
    Pattern pat;
    Matcher m;
    pat = Pattern.compile("\\w+");
    m= pat.matcher("a");
    System.out.println("Matches 'a' " + m.matches());
    m= pat.matcher("\u015b");
    System.out.println("Matches '\u015b' " + m.matches());
    m= pat.matcher("a");
    System.out.println("Matches 'a' " + m.matches());
    m= pat.matcher("¶");
    System.out.println("Matches '¶' " + m.matches());
    }
    }

    It prints (on my system):
    Regexp: '\w+'
    Matches 'a' true
    Matches '¶' false
    Matches 'a' true
    Matches '¶' false

    The question is: whether it is buggy behaviour or is it according to
    specs, and is there any way to include all (polish) letters in a class
    in an elegant way?

  2. Default Re: Should '\w' class match non ascii letters?

    jb wrote:
    > I've been working with Patterns for a while, and following thing
    > baffled: \w class doesnt seem to include non ascii letters (well at
    > least not polish ones ).
    >
    > Javadoc seems to say nothing about it.


    The Javadoc for 1.6 says

    Predefined character classes
    ...
    \w A word character: [a-zA-Z_0-9]

    .... which seems explicit enough: There are only 63 characters that
    match a \w.

    > The question is: whether it is buggy behaviour or is it according to
    > specs,


    According to spec, I'd say.

    > and is there any way to include all (polish) letters in a class
    > in an elegant way?


    My knowledge of Polish is non-existent (I'm unpolished), but
    perhaps the \p{prop} construct might help? It looks promising
    enough to investigate, anyhow.

    --
    Eric.Sosman@sun.com

  3. Default Re: Should '\w' class match non ascii letters?

    Eric Sosman writes:
    > jb wrote:
    >> I've been working with Patterns for a while, and following thing
    >> baffled: \w class doesnt seem to include non ascii letters (well at
    >> least not polish ones ).
    >> Javadoc seems to say nothing about it.

    >
    > The Javadoc for 1.6 says
    >
    > Predefined character classes
    > ...
    > \w A word character: [a-zA-Z_0-9]


    It says the same for 1.4.2 already.

    I've never tried \p{prop} before, but I did now, and \p{L} appears to
    match Finnish non-ASCII letters, so I guess it would work for Polish,
    too. It is described in Javadoc for Pattern in 1.4.2 under headings
    "Classes for Unicode blocks and categories" and "Unicode support".

  4. Default Re: Should '\w' class match non ascii letters?



    Eric Sosman wrote:

    > > Javadoc seems to say nothing about it.

    >
    > The Javadoc for 1.6 says
    >
    > Predefined character classes
    > ...
    > \w A word character: [a-zA-Z_0-9]


    Well I assumed that my chars are between a-z, in alphabet they are .

    Jussi Piitulainen wrote:

    > I've never tried \p{prop} before, but I did now, and \p{L} appears to
    > match Finnish non-ASCII letters, so I guess it would work for Polish,
    > too. It is described in Javadoc for Pattern in 1.4.2 under headings
    > "Classes for Unicode blocks and categories" and "Unicode support".


    Thanks it works .

  5. Default Re: Should '\w' class match non ascii letters?

    jb wrote:
    > Well I assumed that my chars are between a-z, in alphabet they are .


    Brief note: the range a-z refers to all characters c such that 'a' <= c
    and c <= 'z'. 'á' is > 'z', ergo it doesn't match. The range 0-f will
    match all uppercase letters, all digits, the lowercase letters 'a', 'b',
    'c', 'd', 'e', and 'f', as well as the punctuation characters in the
    following string ":;<=>?@[\\]^_`", which is probably not what would be
    intended.

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth

  6. Default Re: Should '\w' class match non ascii letters?

    Joshua Cranmer wrote:
    > jb wrote:
    >> Well I assumed that my chars are between a-z, in alphabet they are .

    >
    > Brief note: the range a-z refers to all characters c such that 'a' <= c
    > and c <= 'z'. 'á' is > 'z', ergo it doesn't match. The range 0-f will
    > match all uppercase letters, all digits, the lowercase letters 'a', 'b',
    > 'c', 'd', 'e', and 'f', as well as the punctuation characters in the
    > following string ":;<=>?@[\\]^_`", which is probably not what would be
    > intended.
    >


    Agreed. ASCII tricks (for which ASCII was, in part, designed)
    don't work well in the new world of UNICODE, or even Latin-1

    BugBear

  7. Default Re: Should '\w' class match non ascii letters?

    bugbear wrote:
    > Joshua Cranmer wrote:
    >> jb wrote:
    >>> Well I assumed that my chars are between a-z, in alphabet they are
    >>> .

    >>
    >> Brief note: the range a-z refers to all characters c such that 'a'
    >> <= c and c <= 'z'. 'á' is > 'z', ergo it doesn't match. The range
    >> 0-f will match all uppercase letters, all digits, the lowercase
    >> letters 'a', 'b', 'c', 'd', 'e', and 'f', as well as the punctuation
    >> characters in the following string ":;<=>?@[\\]^_`", which is
    >> probably not what would be intended.
    >>

    >
    > Agreed. ASCII tricks (for which ASCII was, in part, designed)
    > don't work well in the new world of UNICODE, or even Latin-1


    Huh? Both Latin-1 and Unicode are supersets of ASCII, so "0-f" would mean
    the same thing in all of them.



  8. Default Re: Should '\w' class match non ascii letters?

    On Wed, 27 Aug 2008, Mike Schilling wrote:

    > bugbear wrote:
    >> Joshua Cranmer wrote:
    >>> jb wrote:
    >>>> Well I assumed that my chars are between a-z, in alphabet they are
    >>>> .
    >>>
    >>> Brief note: the range a-z refers to all characters c such that 'a' <=
    >>> c and c <= 'z'. 'á' is > 'z', ergo it doesn't match. The range 0-f
    >>> will match all uppercase letters, all digits, the lowercase letters
    >>> 'a', 'b', 'c', 'd', 'e', and 'f', as well as the punctuation
    >>> characters in the following string ":;<=>?@[\\]^_`", which is probably
    >>> not what would be intended.

    >>
    >> Agreed. ASCII tricks (for which ASCII was, in part, designed)
    >> don't work well in the new world of UNICODE, or even Latin-1

    >
    > Huh? Both Latin-1 and Unicode are supersets of ASCII, so "0-f" would
    > mean the same thing in all of them.


    Exactly. In ASCII, the numerical order of the codepoints is the same as
    the collating sequence of the letters, so things like a-z mean what they
    look like. In Latin-1 and unicode, this is no longer true: a-z looks like
    it should include á, but it actually doesn't.

    tom

    --
    I have been trying to find a way of framing this but yes, a light meal is
    probably preferable to a heavy one under the circumstances. -- ninebelow

  9. Default Re: Should '\w' class match non ascii letters?

    Mike Schilling wrote:
    > bugbear wrote:
    >> Agreed. ASCII tricks (for which ASCII was, in part, designed)
    >> don't work well in the new world of UNICODE, or even Latin-1

    >
    > Huh? Both Latin-1 and Unicode are supersets of ASCII, so "0-f" would mean
    > the same thing in all of them.


    I think bugbear was referring to the fact that in the English language
    as defined by ASCII (excluding borrowed accents), the statements "char
    is a lowercase letter" and |'a' <= char <= 'z'| are equivalent, but in
    many scripts, that is not true (e.g., à).

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth

  10. Default Re: Should '\w' class match non ascii letters?

    Joshua Cranmer wrote:
    > Mike Schilling wrote:
    >> bugbear wrote:
    >>> Agreed. ASCII tricks (for which ASCII was, in part, designed)
    >>> don't work well in the new world of UNICODE, or even Latin-1

    >>
    >> Huh? Both Latin-1 and Unicode are supersets of ASCII, so "0-f" would
    >> mean the same thing in all of them.

    >
    > I think bugbear was referring to the fact that in the English language
    > as defined by ASCII (excluding borrowed accents), the statements "char
    > is a lowercase letter" and |'a' <= char <= 'z'| are equivalent, but in
    > many scripts, that is not true (e.g., à).
    >


    Heh. How about changing case by flipping a bit?

    It works in ASCII, for English.

    It fails quite miserably in unicode for thai, arabic,
    polish, chinese, japanese...

    BugBear

+ Reply to Thread