| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
| |||
| |||
| I've been working with Patterns for a while, and following thing baffled: \w class doesnt seem to include non ascii letters (well at least not polish ones ).Javadoc seems to say nothing about it. Heres the test: import java.util.regex.*; class rTest{ public static void main(String[] args){ System.out.println("Regexp: '\\w+'"); Pattern pat; Matcher m; pat = Pattern.compile("\\w+"); m= pat.matcher("a"); System.out.println("Matches 'a' " + m.matches()); m= pat.matcher("\u015b"); System.out.println("Matches '\u015b' " + m.matches()); m= pat.matcher("a"); System.out.println("Matches 'a' " + m.matches()); m= pat.matcher("¶"); System.out.println("Matches '¶' " + m.matches()); } } It prints (on my system): Regexp: '\w+' Matches 'a' true Matches '¶' false Matches 'a' true Matches '¶' false The question is: whether it is buggy behaviour or is it according to specs, and is there any way to include all (polish) letters in a class in an elegant way? |
|
#2
| |||
| |||
| jb wrote: > I've been working with Patterns for a while, and following thing > baffled: \w class doesnt seem to include non ascii letters (well at > least not polish ones ).> > Javadoc seems to say nothing about it. The Javadoc for 1.6 says Predefined character classes ... \w A word character: [a-zA-Z_0-9] .... which seems explicit enough: There are only 63 characters that match a \w. > The question is: whether it is buggy behaviour or is it according to > specs, According to spec, I'd say. > and is there any way to include all (polish) letters in a class > in an elegant way? My knowledge of Polish is non-existent (I'm unpolished), but perhaps the \p{prop} construct might help? It looks promising enough to investigate, anyhow. -- Eric.Sosman@sun.com |
|
#3
| |||
| |||
| Eric Sosman writes: > jb wrote: >> I've been working with Patterns for a while, and following thing >> baffled: \w class doesnt seem to include non ascii letters (well at >> least not polish ones ).>> Javadoc seems to say nothing about it. > > The Javadoc for 1.6 says > > Predefined character classes > ... > \w A word character: [a-zA-Z_0-9] It says the same for 1.4.2 already. I've never tried \p{prop} before, but I did now, and \p{L} appears to match Finnish non-ASCII letters, so I guess it would work for Polish, too. It is described in Javadoc for Pattern in 1.4.2 under headings "Classes for Unicode blocks and categories" and "Unicode support". |
|
#4
| |||
| |||
| Eric Sosman wrote: > > Javadoc seems to say nothing about it. > > The Javadoc for 1.6 says > > Predefined character classes > ... > \w A word character: [a-zA-Z_0-9] Well I assumed that my chars are between a-z, in alphabet they are .Jussi Piitulainen wrote: > I've never tried \p{prop} before, but I did now, and \p{L} appears to > match Finnish non-ASCII letters, so I guess it would work for Polish, > too. It is described in Javadoc for Pattern in 1.4.2 under headings > "Classes for Unicode blocks and categories" and "Unicode support". Thanks it works . |
|
#5
| |||
| |||
| jb wrote: > Well I assumed that my chars are between a-z, in alphabet they are .Brief note: the range a-z refers to all characters c such that 'a' <= c and c <= 'z'. 'á' is > 'z', ergo it doesn't match. The range 0-f will match all uppercase letters, all digits, the lowercase letters 'a', 'b', 'c', 'd', 'e', and 'f', as well as the punctuation characters in the following string ":;<=>?@[\\]^_`", which is probably not what would be intended. -- Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth |
|
#6
| |||
| |||
| Joshua Cranmer wrote: > jb wrote: >> Well I assumed that my chars are between a-z, in alphabet they are .> > Brief note: the range a-z refers to all characters c such that 'a' <= c > and c <= 'z'. 'á' is > 'z', ergo it doesn't match. The range 0-f will > match all uppercase letters, all digits, the lowercase letters 'a', 'b', > 'c', 'd', 'e', and 'f', as well as the punctuation characters in the > following string ":;<=>?@[\\]^_`", which is probably not what would be > intended. > Agreed. ASCII tricks (for which ASCII was, in part, designed) don't work well in the new world of UNICODE, or even Latin-1 BugBear |
|
#7
| |||
| |||
| bugbear wrote: > Joshua Cranmer wrote: >> jb wrote: >>> Well I assumed that my chars are between a-z, in alphabet they are >>> .>> >> Brief note: the range a-z refers to all characters c such that 'a' >> <= c and c <= 'z'. 'á' is > 'z', ergo it doesn't match. The range >> 0-f will match all uppercase letters, all digits, the lowercase >> letters 'a', 'b', 'c', 'd', 'e', and 'f', as well as the punctuation >> characters in the following string ":;<=>?@[\\]^_`", which is >> probably not what would be intended. >> > > Agreed. ASCII tricks (for which ASCII was, in part, designed) > don't work well in the new world of UNICODE, or even Latin-1 Huh? Both Latin-1 and Unicode are supersets of ASCII, so "0-f" would mean the same thing in all of them. |
|
#8
| |||
| |||
| On Wed, 27 Aug 2008, Mike Schilling wrote: > bugbear wrote: >> Joshua Cranmer wrote: >>> jb wrote: >>>> Well I assumed that my chars are between a-z, in alphabet they are >>>> .>>> >>> Brief note: the range a-z refers to all characters c such that 'a' <= >>> c and c <= 'z'. 'á' is > 'z', ergo it doesn't match. The range 0-f >>> will match all uppercase letters, all digits, the lowercase letters >>> 'a', 'b', 'c', 'd', 'e', and 'f', as well as the punctuation >>> characters in the following string ":;<=>?@[\\]^_`", which is probably >>> not what would be intended. >> >> Agreed. ASCII tricks (for which ASCII was, in part, designed) >> don't work well in the new world of UNICODE, or even Latin-1 > > Huh? Both Latin-1 and Unicode are supersets of ASCII, so "0-f" would > mean the same thing in all of them. Exactly. In ASCII, the numerical order of the codepoints is the same as the collating sequence of the letters, so things like a-z mean what they look like. In Latin-1 and unicode, this is no longer true: a-z looks like it should include á, but it actually doesn't. tom -- I have been trying to find a way of framing this but yes, a light meal is probably preferable to a heavy one under the circumstances. -- ninebelow |
|
#9
| |||
| |||
| Mike Schilling wrote: > bugbear wrote: >> Agreed. ASCII tricks (for which ASCII was, in part, designed) >> don't work well in the new world of UNICODE, or even Latin-1 > > Huh? Both Latin-1 and Unicode are supersets of ASCII, so "0-f" would mean > the same thing in all of them. I think bugbear was referring to the fact that in the English language as defined by ASCII (excluding borrowed accents), the statements "char is a lowercase letter" and |'a' <= char <= 'z'| are equivalent, but in many scripts, that is not true (e.g., à). -- Beware of bugs in the above code; I have only proved it correct, not tried it. -- Donald E. Knuth |
|
#10
| |||
| |||
| Joshua Cranmer wrote: > Mike Schilling wrote: >> bugbear wrote: >>> Agreed. ASCII tricks (for which ASCII was, in part, designed) >>> don't work well in the new world of UNICODE, or even Latin-1 >> >> Huh? Both Latin-1 and Unicode are supersets of ASCII, so "0-f" would >> mean the same thing in all of them. > > I think bugbear was referring to the fact that in the English language > as defined by ASCII (excluding borrowed accents), the statements "char > is a lowercase letter" and |'a' <= char <= 'z'| are equivalent, but in > many scripts, that is not true (e.g., à). > Heh. How about changing case by flipping a bit? It works in ASCII, for English. It fails quite miserably in unicode for thai, arabic, polish, chinese, japanese... BugBear |
![]() |
| Thread Tools | |
| Display Modes | |
In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.