Using a regexp as field separator does not work!

This is a discussion on Using a regexp as field separator does not work! within the awk forums in Programming Languages category; I'm using gawk 3.1.0 (Windows native) and 3.1.6 (Cygwin). My input file contains fields which are separated by a vertical, optionally followed by spaces. Here is a small test program for this file format: BEGIN { print "run starts" FS="| *" } { print "processing line with",NF,"fields {",$0,"}" } When using the following 2-line input file: # set art 0,1,4|set art I get as output: run starts processing line with 3 fields { # set art } processing line with 2 fields { 0,1,4|set art } Still, a space is taken as field separator! What am I doing wrong? Ronald...

Go Back   Application Development Forum > Programming Languages > awk

Object Mix

Register FAQ Calendar Search Today's Posts Mark Forums Read
  #1  
Old 07-10-2008, 10:01 AM
Ronny
Guest
 
Default Using a regexp as field separator does not work!

I'm using gawk 3.1.0 (Windows native) and 3.1.6 (Cygwin). My input
file
contains fields which are separated by a vertical, optionally followed
by
spaces. Here is a small test program for this file format:

BEGIN {
print "run starts"
FS="| *"
}

{
print "processing line with",NF,"fields {",$0,"}"
}

When using the following 2-line input file:

# set art
0,1,4|set art

I get as output:

run starts
processing line with 3 fields { # set art }
processing line with 2 fields { 0,1,4|set art }

Still, a space is taken as field separator!

What am I doing wrong?

Ronald

Reply With Quote
  #2  
Old 07-10-2008, 10:22 AM
Loki Harfagr
Guest
 
Default Re: Using a regexp as field separator does not work!

Thu, 10 Jul 2008 07:01:03 -0700, Ronny did catÂ*:

> I'm using gawk 3.1.0 (Windows native) and 3.1.6 (Cygwin). My input file
> contains fields which are separated by a vertical, optionally followed
> by
> spaces. Here is a small test program for this file format:
>
> BEGIN {
> print "run starts"
> FS="| *"
> }
>
> {
> print "processing line with",NF,"fields {",$0,"}"
> }
>
> When using the following 2-line input file:
>
> # set art
> 0,1,4|set art
>
> I get as output:
>
> run starts
> processing line with 3 fields { # set art } processing line with 2
> fields { 0,1,4|set art }
>
> Still, a space is taken as field separator!
>
> What am I doing wrong?


You want a regexp but you described a string, change:
FS="| *"
to be:
FS=/| */

Reply With Quote
  #3  
Old 07-10-2008, 10:40 AM
Ronny
Guest
 
Default Re: Using a regexp as field separator does not work!

On 10 Jul., 16:22, Loki Harfagr <l...@thedarkdesign.free.fr.INVALID>
wrote:
> Thu, 10 Jul 2008 07:01:03 -0700, Ronny did cat :
> > BEGIN {
> > print "run starts"
> > FS="| *"
> > }
> > Still, a space is taken as field separator!

> You want a regexp but you described a string, change:
> FS="| *"
> to be:
> FS=/| */


Agreed, your solution works. Now I reread the man pages more
carefully, and
there is one thing which puzzles me:

... If FS is a single
character, fields are separated by that character. If FS is
the null
string, then each individual character becomes a separate
field. Oth-
erwise, FS is expected to be a full regular expression...

I agree that this explains why I have to use // for describing my
regexp. But
now I wonder what is the semantics of my original code? Taking the man
page
literally, FS can be either the null string, or a string of length
one, or a
regexp. It doesn't say anything about FS being a string of length > 1.
If this
is forbidden, I would have expected an error message, but it was
accepted...

Ronald
Reply With Quote
  #4  
Old 07-10-2008, 10:56 AM
Ronny
Guest
 
Default Re: Using a regexp as field separator does not work!

Plus, I also just noticed an error in my regexp. Since the
vertical bar is a metacharacter, I should write it as

/\| */

Ronald
Reply With Quote
  #5  
Old 07-10-2008, 11:04 AM
Ronny
Guest
 
Default Re: Using a regexp as field separator does not work!

Hmmmm... still does not work. Here is my modified program:

BEGIN {
print "run starts"
FS=/[|] */
}

{
print "processing line with",NF,"fields {STATUS=",$1," CMD=",$2,"}"
}

My input file contains:

0,1,4|set art

and I get as result:

run starts
processing line with 2 fields {STATUS= CMD= ,1,4|set art }

This means it does get two fields, but splits at the beginning of the
line!

Playing around, I found that the correct way to write the FS
assignment goes like this:

FS="[|] *"

So the problem was not the usage of a string instead of a regexp, but
that my
regexp was wrong....

Ronald
Reply With Quote
  #6  
Old 07-10-2008, 11:45 AM
Dave B
Guest
 
Default Re: Using a regexp as field separator does not work!

Ronny wrote:

> Agreed, your solution works. Now I reread the man pages more
> carefully, and
> there is one thing which puzzles me:
>
> ... If FS is a single
> character, fields are separated by that character. If FS is
> the null string, then each individual character becomes a separate
> field. Otherwise, FS is expected to be a full regular expression...
>
> I agree that this explains why I have to use // for describing my
> regexp.


No, you don't. The following works:

awk -v FS='\\| *'

The problem is that "|" must be escaped twice.

> But now I wonder what is the semantics of my original code?


FS="| *"

This uses as field separator either nothing, or *zero* or more spaces. An
empty FS is undefined by the standard, and is allowed as a special case by
GNU awk, but only through the special syntax FS='', or FS=, or -F ''.

In your case, I'm not sure how the regex is parsed. It seems to behave as if
runs of space are used as separator, but not with awk's default semantics:

$ echo ' abc de f' |awk -v FS='| *' '{print NF;for(i=1;i<=NF;i++)print$i}'
4

abc
de
f

But I'm not sure about what goes on behind the scenes here. Hopefully
someone will shed some light here.


> Taking the man page literally, FS can be either the null string, or a
> string of length one, or a regexp. It doesn't say anything about FS
> being a string of length > 1.


In that case, it's taken as a regex.

> If this is forbidden, I would have expected an error message, but it
> was accepted...


Because it's perfectly valid.

--
awk 'BEGIN{O="~"~"~";o="=="=="==";o+=+o;x=o""o;while(X ++<x-o-O)c=c"%c";
X=O""O;printf c,O+x*o*o+X,(X+x)*(O+o)-o,+X*X-o-O,o+x*o*o+X,x*o*o+X-o-o,
x*(o+o)+X-O,+X*X-X+o+o,x+x+x-o,o+X+O+o+x*o*o,x+O+x*o*o,x*o*o+x+O+o+o+O,
x+o+x*o*o,x+x*o*o+O,o+x+x*o*o,o+X*o*o,X+x*o*o,x*o* o+O+x,x+x*o*o-O,X-O}'
Reply With Quote
  #7  
Old 07-10-2008, 12:03 PM
Ed Morton
Guest
 
Default Re: Using a regexp as field separator does not work!



On 7/10/2008 10:45 AM, Dave B wrote:
> Ronny wrote:
>
>
>>Agreed, your solution works. Now I reread the man pages more
>>carefully, and
>>there is one thing which puzzles me:
>>
>> ... If FS is a single
>> character, fields are separated by that character. If FS is
>>the null string, then each individual character becomes a separate
>>field. Otherwise, FS is expected to be a full regular expression...
>>
>>I agree that this explains why I have to use // for describing my
>>regexp.

>
>
> No, you don't. The following works:
>
> awk -v FS='\\| *'
>
> The problem is that "|" must be escaped twice.
>
>
>>But now I wonder what is the semantics of my original code?

>
>
> FS="| *"
>
> This uses as field separator either nothing, or *zero* or more spaces. An
> empty FS is undefined by the standard, and is allowed as a special case by
> GNU awk, but only through the special syntax FS='', or FS=, or -F ''.
>
> In your case, I'm not sure how the regex is parsed. It seems to behave as if
> runs of space are used as separator, but not with awk's default semantics:
>
> $ echo ' abc de f' |awk -v FS='| *' '{print NF;for(i=1;i<=NF;i++)print$i}'
> 4
>
> abc
> de
> f
>
> But I'm not sure about what goes on behind the scenes here. Hopefully
> someone will shed some light here.


An FS that's a single blank character is a special case that treats contiguous
sequences of any white space as a single separator and, importantly, strips off
leading white space from the record. You specified an FS that's an RE instead of
a single blank character so you do not get the "special" behavior, just like you
wouldn't if you specified a literal blank character:

-------------
$ echo ' abc de f' |awk -v FS=' ' '{print NF;for(i=1;i<=NF;i++)print$i}'
3
abc
de
f
$ echo ' abc de f' |awk -v FS=' *' '{print NF;for(i=1;i<=NF;i++)print$i}'
4

abc
de
f
$ echo ' abc de f' |awk -v FS='[ ]' '{print NF;for(i=1;i<=NF;i++)print$i}'
5

abc
de

f
--------------

Note the leading white space above when you don't use the single blank character FS.

>
>
>>Taking the man page literally, FS can be either the null string, or a
>>string of length one, or a regexp. It doesn't say anything about FS
>>being a string of length > 1.

>
>
> In that case, it's taken as a regex.
>
>
>>If this is forbidden, I would have expected an error message, but it
>>was accepted...

>
>
> Because it's perfectly valid.
>


For the OP, this is what you want:

BEGIN {
print "run starts"
FS="[|] *"
}

{
print "processing line with",NF,"fields {",$0,"}"
}

Regards,

Ed.

Reply With Quote
  #8  
Old 07-10-2008, 12:27 PM
Dave B
Guest
 
Default Re: Using a regexp as field separator does not work!

Ed Morton wrote:

>> FS="| *"
>>[snip]

> An FS that's a single blank character is a special case that treats contiguous
> sequences of any white space as a single separator and, importantly, strips off
> leading white space from the record. You specified an FS that's an RE instead of
> a single blank character so you do not get the "special" behavior, just like you
> wouldn't if you specified a literal blank character:


Yes, I know that...but I'm not sure how awk determines it has encountered a
"field separator" if the regex '| *' is used as FS. What part of the
alternation is used? It seems that it's effectively treated like ' *', but
the corner case where a "nothingness" matches (which is allowed by ' *', and
which would thus make it behave similarly as if FS='') never happens. These
seem to be equivalent:

$ echo ' abc de f' |awk -v FS=' *' '{print NF;for(i=1;i<=NF;i++)print$i}'

$ echo ' abc de f' |awk -v FS='| *' '{print NF;for(i=1;i<=NF;i++)print$i}'

But I don't know why.


> For the OP, this is what you want:
>
> BEGIN {
> print "run starts"
> FS="[|] *"


or FS='\\| *'

--
awk 'BEGIN{O="~"~"~";o="=="=="==";o+=+o;x=o""o;while(X ++<x-o-O)c=c"%c";
X=O""O;printf c,O+x*o*o+X,(X+x)*(O+o)-o,+X*X-o-O,o+x*o*o+X,x*o*o+X-o-o,
x*(o+o)+X-O,+X*X-X+o+o,x+x+x-o,o+X+O+o+x*o*o,x+O+x*o*o,x*o*o+x+O+o+o+O,
x+o+x*o*o,x+x*o*o+O,o+x+x*o*o,o+X*o*o,X+x*o*o,x*o* o+O+x,x+x*o*o-O,X-O}'
Reply With Quote
  #9  
Old 07-10-2008, 12:46 PM
Ed Morton
Guest
 
Default Re: Using a regexp as field separator does not work!



On 7/10/2008 11:27 AM, Dave B wrote:
> Ed Morton wrote:
>
>
>>>FS="| *"
>>>[snip]

>>
>>An FS that's a single blank character is a special case that treats contiguous
>>sequences of any white space as a single separator and, importantly, strips off
>>leading white space from the record. You specified an FS that's an RE instead of
>>a single blank character so you do not get the "special" behavior, just like you
>>wouldn't if you specified a literal blank character:

>
>
> Yes, I know that...but I'm not sure how awk determines it has encountered a
> "field separator" if the regex '| *' is used as FS. What part of the
> alternation is used? It seems that it's effectively treated like ' *', but
> the corner case where a "nothingness" matches (which is allowed by ' *', and
> which would thus make it behave similarly as if FS='') never happens. These
> seem to be equivalent:
>
> $ echo ' abc de f' |awk -v FS=' *' '{print NF;for(i=1;i<=NF;i++)print$i}'
>
> $ echo ' abc de f' |awk -v FS='| *' '{print NF;for(i=1;i<=NF;i++)print$i}'
>
> But I don't know why.


I'm not sure I understand the question. An FS of a null character is another
special case, just like an FS of a single blank character is a special case. A
null character appearing as part of an FS that's an RE isn't treated the same as
a null character that IS an FS, just like a single blank character appearing as
part of an FS that's an RE isn't treated the same as a blank character that IS
an FS.

So, when you write FS='| *' you're saying the FS is either nothing at all OR a
sequence of zero or more blanks. Yes, that doesn't make sense so you can
obviously optimize it to ' *' but there's plenty of REs we see people write that
could be optimized and awk doesn't try to analyze and warn you about any of them
other than a useless backslash.

Ed.

Reply With Quote
  #10  
Old 07-10-2008, 01:13 PM
Dave B
Guest
 
Default Re: Using a regexp as field separator does not work!

Ed Morton wrote:

>> Yes, I know that...but I'm not sure how awk determines it has encountered a
>> "field separator" if the regex '| *' is used as FS. What part of the
>> alternation is used? It seems that it's effectively treated like ' *', but
>> the corner case where a "nothingness" matches (which is allowed by ' *', and
>> which would thus make it behave similarly as if FS='') never happens. These
>> seem to be equivalent:
>>
>> $ echo ' abc de f' |awk -v FS=' *' '{print NF;for(i=1;i<=NF;i++)print$i}'
>>
>> $ echo ' abc de f' |awk -v FS='| *' '{print NF;for(i=1;i<=NF;i++)print$i}'
>>
>> But I don't know why.

>
> I'm not sure I understand the question. An FS of a null character is another
> special case, just like an FS of a single blank character is a special case. A
> null character appearing as part of an FS that's an RE isn't treated the same as
> a null character that IS an FS, just like a single blank character appearing as
> part of an FS that's an RE isn't treated the same as a blank character that IS
> an FS.


Agreed, although strictly speaking we don't have "null characters", but
rather "null" or "empty" regexes here.

> So, when you write FS='| *' you're saying the FS is either nothing at all OR a
> sequence of zero or more blanks. Yes, that doesn't make sense so you can
> obviously optimize it to ' *' but there's plenty of REs we see people write that
> could be optimized and awk doesn't try to analyze and warn you about any of them
> other than a useless backslash.


Ok, I'll try to explain better. Suppose we have FS='x|y', and the input is

fooxbarybaz

We know that, when awk encounters 'x', FS matches, so awk decides that that
'x' is a field separator. The same happens when awk gets to 'y', later.
Every time, awk (actually, awk's regex engine) has used a certain part of
the alternation in the FS regex to try a match and decide if a piece of
input was to be considered a field separator (of course, this is a simple
regex, but it can be more complex, with each part matching longer strings,
or with a different structure. Also, I'm using a regex of the form x|y
because it's similar to the case at hand).

Now, if FS='| *' (again an alternation), and the input is

abc

in theory, awk should immediatley find a match for FS, since the part to the
left of "|" is an empty regex, which matches at the beginning of the string,
at the end, and between any two characters. And, the part to the right of
the "|" (" *") also allows matching an empty string, although awk should
not get that far, since the part to the left of the "|" already matches. But
this does not happen. Also, as each character is examined, awk should find a
match for the empty regex between any two characters, but again that doesn't
happen. But awk DOES know how to do that, because if you do

a="abc"; gsub(/| */,"X",a)

you correctly get

XaXbXcX

So, my doubt was: why isn't awk matching the null regex (using either part
of the alternation appearing in FS)? I guess the answer is: because FS is
special and does not work that way, unless FS is explicitly set to '' (GNU
awk only). Ok. But then, how does it work? Why does awk choose to treat a
nonsense FS like '| *' as if it were ' *'? What's the logic behind that?
Hope this was clearer.

--
awk 'BEGIN{O="~"~"~";o="=="=="==";o+=+o;x=o""o;while(X ++<x-o-O)c=c"%c";
X=O""O;printf c,O+x*o*o+X,(X+x)*(O+o)-o,+X*X-o-O,o+x*o*o+X,x*o*o+X-o-o,
x*(o+o)+X-O,+X*X-X+o+o,x+x+x-o,o+X+O+o+x*o*o,x+O+x*o*o,x*o*o+x+O+o+o+O,
x+o+x*o*o,x+x*o*o+O,o+x+x*o*o,o+X*o*o,X+x*o*o,x*o* o+O+x,x+x*o*o-O,X-O}'
Reply With Quote
Reply


Thread Tools
Display Modes


All times are GMT -5. The time now is 02:44 AM.


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
vB Ad Management by =RedTyger=

In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.