| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
| |||
| |||
| I'm parsing this: name="value" and sometimes it looks like this: name2="value2 without the closing '"'. I don't want to capture the end quote. So, I did this: '/\s*(.*?)="(.*)"*/' // preg_match and that worked, but I thought, there's only going to be zero or one '"', so: '/\s*(.*?)="(.*)"+/' and that didn't. It's a little mysterious to me what the rules are for how the matches are made. With the * it looks like it looks first for the multiple " and then if there isn't one, while with the + it looks for the 0 match first. Now, I suppose I can capture everything and trim the end quote off. But I wonder what the rules are. Jeff |
|
#2
| |||
| |||
| In our last episode, <fLCdncKkpdI8DCjVnZ2dnUVZ_hGdnZ2d@earthlink.com> , the lovely and talented Jeff broadcast on comp.lang.php: > I'm parsing this: > name="value" > and sometimes it looks like this: > name2="value2 > without the closing '"'. I don't want to capture the end quote. > So, I did this: > '/\s*(.*?)="(.*)"*/' // preg_match > and that worked, but I thought, there's only going to be zero or one > '"', so: > '/\s*(.*?)="(.*)"+/' > and that didn't. as you are using preg_match, you should know that + means match one or more times. The second " can only match when it appears. You want to match 0 or 1 times (the closing quote may or may not appear). That's ? not +. * works because it means 0 or more times. But it would also match "value""""""" . If you know there will always be only 0 or 1 " then * works as well as ?, but ? is more exact and excludes the possiblity of "value"""""". > It's a little mysterious to me what the rules are for how the matches > are made. With the * it looks like it looks first for the multiple " and > then if there isn't one, while with the + it looks for the 0 match first. + can't look for zero match. That isn't what + means. + means 'at least once.' > Now, I suppose I can capture everything and trim the end quote off. > But I wonder what the rules are. I'm not sure whether PHP preg match really completely and accurately parses all possible perl regexes, but since it advertises p in preg means perl, the perl regex documentation would be a good place to start. If you have perl installed: perldoc perlre . Otherwise the perl documentation is available in several formats. -- Lars Eighner <http://larseighner.com/> usenet@larseighner.com "I believe in God and I believe in free markets," -Kenneth Lay, CEO for the now defunct Enron, whose loss of some 50 billion dollars represents the largest corporate bankruptcy in the history of the US. |
|
#3
| |||
| |||
| Lars Eighner wrote: > In our last episode, <fLCdncKkpdI8DCjVnZ2dnUVZ_hGdnZ2d@earthlink.com> , the > lovely and talented Jeff broadcast on comp.lang.php: > >> I'm parsing this: > >> name="value" > >> and sometimes it looks like this: > >> name2="value2 > >> without the closing '"'. I don't want to capture the end quote. > >> So, I did this: > >> '/\s*(.*?)="(.*)"*/' // preg_match > >> and that worked, but I thought, there's only going to be zero or one >> '"', so: >> '/\s*(.*?)="(.*)"+/' > >> and that didn't. > > as you are using preg_match, you should know that + means match one or more > times. The second " can only match when it appears. You want to match 0 or > 1 times (the closing quote may or may not appear). That's ? not +. * works > because it means 0 or more times. But it would also match "value""""""" . > If you know there will always be only 0 or 1 " then * works as well as ?, > but ? is more exact and excludes the possiblity of "value"""""". Yeah, exactly right. Also, using the ? quantifier is less work for the regex engine. >> It's a little mysterious to me what the rules are for how the matches >> are made. With the * it looks like it looks first for the multiple " and >> then if there isn't one, while with the + it looks for the 0 match first. > > + can't look for zero match. That isn't what + means. + means 'at least > once.' > >> Now, I suppose I can capture everything and trim the end quote off. >> But I wonder what the rules are. > > I'm not sure whether PHP preg match really completely and accurately parses > all possible perl regexes, but since it advertises p in preg means perl, > the perl regex documentation would be a good place to start. > > If you have perl installed: perldoc perlre . Otherwise the perl > documentation is available in several formats. > PHP's preg_* functions use the PCRE library, which is not exactly the same as Perl's regex. The differences can be found in PHP's documentation. -- Curtis |
|
#4
| |||
| |||
| Jeff wrote: > I'm parsing this: > > name="value" > > and sometimes it looks like this: > > name2="value2 > > without the closing '"'. I don't want to capture the end quote. > > So, I did this: > > '/\s*(.*?)="(.*)"*/' // preg_match > > and that worked, but I thought, there's only going to be zero or one > '"', so: > '/\s*(.*?)="(.*)"+/' > > and that didn't. > > It's a little mysterious to me what the rules are for how the matches > are made. With the * it looks like it looks first for the multiple " and > then if there isn't one, while with the + it looks for the 0 match first. > > Now, I suppose I can capture everything and trim the end quote off. > But I wonder what the rules are. > > Jeff You probably don't need to capture the whitespace before the attribute. When starting out with regex, it is tempting to match more than necessary. You might try this alternative (extended regex allows whitespace and easy insertion of comments): <?php $re = '/ (?: ^ | \b ) # beginning of string or line, or word boundary ( \w+ ) = " ( (?> [^"]+ ) ) "? (?: \b | $ ) # end of string or line, or word boundary /x'; ?> I added a check for word boundaries and anchors, which check for the beginning/end of the string/new line. I changed the name matching to just allow word characters. The (?>...) syntax is to prevent backtracking, which potentially speeds up the regex significantly (called atomic matching). It isn't the most portable, but will work in PHP, Perl, and maybe Python. Check perlre for more examples on this. As the regex is now, since you want to allow space and other things in your value, the value will always contain everything to the end of the string, unless a closing quote appears. You would have to add something to the negative character class for the value to better specify what it cannot contain. -- Curtis |
|
#5
| |||
| |||
| Jeff wrote: > I'm parsing this: > > name="value" > > and sometimes it looks like this: > > name2="value2 > > without the closing '"'. I don't want to capture the end quote. > > So, I did this: > > '/\s*(.*?)="(.*)"*/' // preg_match > > and that worked, but I thought, there's only going to be zero or one > '"', so: > '/\s*(.*?)="(.*)"+/' > > and that didn't. > > It's a little mysterious to me what the rules are for how the matches > are made. With the * it looks like it looks first for the multiple " and > then if there isn't one, while with the + it looks for the 0 match first. > > Now, I suppose I can capture everything and trim the end quote off. > But I wonder what the rules are. > > Jeff You should use: http://www.rexv.org/ It's a great tool for writing and testing regex in realtime. btw i think what you are looking for is: /\s*([-a-zA-Z0-9]+?)\ ?= ?('|")(.*?)(\2|\s)/ isaac |
![]() |
| Thread Tools | |
| Display Modes | |
In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.