regex question

This is a discussion on regex question within the PHP forums in Programming Languages category; I'm parsing this: name="value" and sometimes it looks like this: name2="value2 without the closing '"'. I don't want to capture the end quote. So, I did this: '/\s*(.*?)="(.*)"*/' // preg_match and that worked, but I thought, there's only going to be zero or one '"', so: '/\s*(.*?)="(.*)"+/' and that didn't. It's a little mysterious to me what the rules are for how the matches are made. With the * it looks like it looks first for the multiple " and then if there isn't one, while with the + it looks for the 0 match first. Now, I suppose I ...

Go Back   Application Development Forum > Programming Languages > PHP

Object Mix

Register FAQ Calendar Search Today's Posts Mark Forums Read
  #1  
Old 08-27-2008, 01:40 PM
Jeff
Guest
 
Default regex question

I'm parsing this:

name="value"

and sometimes it looks like this:

name2="value2

without the closing '"'. I don't want to capture the end quote.

So, I did this:

'/\s*(.*?)="(.*)"*/' // preg_match

and that worked, but I thought, there's only going to be zero or one
'"', so:
'/\s*(.*?)="(.*)"+/'

and that didn't.

It's a little mysterious to me what the rules are for how the matches
are made. With the * it looks like it looks first for the multiple " and
then if there isn't one, while with the + it looks for the 0 match first.

Now, I suppose I can capture everything and trim the end quote off.
But I wonder what the rules are.

Jeff
Reply With Quote
  #2  
Old 08-27-2008, 04:25 PM
Lars Eighner
Guest
 
Default Re: regex question

In our last episode, <fLCdncKkpdI8DCjVnZ2dnUVZ_hGdnZ2d@earthlink.com> , the
lovely and talented Jeff broadcast on comp.lang.php:

> I'm parsing this:


> name="value"


> and sometimes it looks like this:


> name2="value2


> without the closing '"'. I don't want to capture the end quote.


> So, I did this:


> '/\s*(.*?)="(.*)"*/' // preg_match


> and that worked, but I thought, there's only going to be zero or one
> '"', so:
> '/\s*(.*?)="(.*)"+/'


> and that didn't.


as you are using preg_match, you should know that + means match one or more
times. The second " can only match when it appears. You want to match 0 or
1 times (the closing quote may or may not appear). That's ? not +. * works
because it means 0 or more times. But it would also match "value""""""" .
If you know there will always be only 0 or 1 " then * works as well as ?,
but ? is more exact and excludes the possiblity of "value"""""".

> It's a little mysterious to me what the rules are for how the matches
> are made. With the * it looks like it looks first for the multiple " and
> then if there isn't one, while with the + it looks for the 0 match first.


+ can't look for zero match. That isn't what + means. + means 'at least
once.'

> Now, I suppose I can capture everything and trim the end quote off.
> But I wonder what the rules are.


I'm not sure whether PHP preg match really completely and accurately parses
all possible perl regexes, but since it advertises p in preg means perl,
the perl regex documentation would be a good place to start.

If you have perl installed: perldoc perlre . Otherwise the perl
documentation is available in several formats.

--
Lars Eighner <http://larseighner.com/> usenet@larseighner.com
"I believe in God and I believe in free markets,"
-Kenneth Lay, CEO for the now defunct Enron, whose loss of some 50
billion dollars represents the largest corporate bankruptcy in the
history of the US.
Reply With Quote
  #3  
Old 08-27-2008, 11:36 PM
Curtis
Guest
 
Default Re: regex question

Lars Eighner wrote:
> In our last episode, <fLCdncKkpdI8DCjVnZ2dnUVZ_hGdnZ2d@earthlink.com> , the
> lovely and talented Jeff broadcast on comp.lang.php:
>
>> I'm parsing this:

>
>> name="value"

>
>> and sometimes it looks like this:

>
>> name2="value2

>
>> without the closing '"'. I don't want to capture the end quote.

>
>> So, I did this:

>
>> '/\s*(.*?)="(.*)"*/' // preg_match

>
>> and that worked, but I thought, there's only going to be zero or one
>> '"', so:
>> '/\s*(.*?)="(.*)"+/'

>
>> and that didn't.

>
> as you are using preg_match, you should know that + means match one or more
> times. The second " can only match when it appears. You want to match 0 or
> 1 times (the closing quote may or may not appear). That's ? not +. * works
> because it means 0 or more times. But it would also match "value""""""" .
> If you know there will always be only 0 or 1 " then * works as well as ?,
> but ? is more exact and excludes the possiblity of "value"""""".


Yeah, exactly right. Also, using the ? quantifier is less work for the
regex engine.

>> It's a little mysterious to me what the rules are for how the matches
>> are made. With the * it looks like it looks first for the multiple " and
>> then if there isn't one, while with the + it looks for the 0 match first.

>
> + can't look for zero match. That isn't what + means. + means 'at least
> once.'
>
>> Now, I suppose I can capture everything and trim the end quote off.
>> But I wonder what the rules are.

>
> I'm not sure whether PHP preg match really completely and accurately parses
> all possible perl regexes, but since it advertises p in preg means perl,
> the perl regex documentation would be a good place to start.
>
> If you have perl installed: perldoc perlre . Otherwise the perl
> documentation is available in several formats.
>


PHP's preg_* functions use the PCRE library, which is not exactly the
same as Perl's regex. The differences can be found in PHP's documentation.

--
Curtis
Reply With Quote
  #4  
Old 08-27-2008, 11:46 PM
Curtis
Guest
 
Default Re: regex question

Jeff wrote:
> I'm parsing this:
>
> name="value"
>
> and sometimes it looks like this:
>
> name2="value2
>
> without the closing '"'. I don't want to capture the end quote.
>
> So, I did this:
>
> '/\s*(.*?)="(.*)"*/' // preg_match
>
> and that worked, but I thought, there's only going to be zero or one
> '"', so:
> '/\s*(.*?)="(.*)"+/'
>
> and that didn't.
>
> It's a little mysterious to me what the rules are for how the matches
> are made. With the * it looks like it looks first for the multiple " and
> then if there isn't one, while with the + it looks for the 0 match first.
>
> Now, I suppose I can capture everything and trim the end quote off.
> But I wonder what the rules are.
>
> Jeff


You probably don't need to capture the whitespace before the
attribute. When starting out with regex, it is tempting to match more
than necessary. You might try this alternative (extended regex allows
whitespace and easy insertion of comments):

<?php

$re = '/
(?: ^ | \b ) # beginning of string or line, or word boundary
( \w+ ) =
"
( (?> [^"]+ ) )
"?
(?: \b | $ ) # end of string or line, or word boundary
/x';

?>

I added a check for word boundaries and anchors, which check for the
beginning/end of the string/new line. I changed the name matching to
just allow word characters. The (?>...) syntax is to prevent
backtracking, which potentially speeds up the regex significantly
(called atomic matching). It isn't the most portable, but will work in
PHP, Perl, and maybe Python. Check perlre for more examples on this.

As the regex is now, since you want to allow space and other things in
your value, the value will always contain everything to the end of the
string, unless a closing quote appears. You would have to add
something to the negative character class for the value to better
specify what it cannot contain.

--
Curtis
Reply With Quote
  #5  
Old 08-28-2008, 12:13 AM
King Isaac
Guest
 
Default Re: regex question

Jeff wrote:
> I'm parsing this:
>
> name="value"
>
> and sometimes it looks like this:
>
> name2="value2
>
> without the closing '"'. I don't want to capture the end quote.
>
> So, I did this:
>
> '/\s*(.*?)="(.*)"*/' // preg_match
>
> and that worked, but I thought, there's only going to be zero or one
> '"', so:
> '/\s*(.*?)="(.*)"+/'
>
> and that didn't.
>
> It's a little mysterious to me what the rules are for how the matches
> are made. With the * it looks like it looks first for the multiple " and
> then if there isn't one, while with the + it looks for the 0 match first.
>
> Now, I suppose I can capture everything and trim the end quote off.
> But I wonder what the rules are.
>
> Jeff


You should use: http://www.rexv.org/

It's a great tool for writing and testing regex in realtime.

btw i think what you are looking for is:

/\s*([-a-zA-Z0-9]+?)\ ?= ?('|")(.*?)(\2|\s)/

isaac
Reply With Quote
Reply


Thread Tools
Display Modes


All times are GMT -5. The time now is 08:58 AM.


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
vB Ad Management by =RedTyger=

In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.