| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
| |||
| |||
| I'm using gawk 3.1.0 (Windows native) and 3.1.6 (Cygwin). My input file contains fields which are separated by a vertical, optionally followed by spaces. Here is a small test program for this file format: BEGIN { print "run starts" FS="| *" } { print "processing line with",NF,"fields {",$0,"}" } When using the following 2-line input file: # set art 0,1,4|set art I get as output: run starts processing line with 3 fields { # set art } processing line with 2 fields { 0,1,4|set art } Still, a space is taken as field separator! What am I doing wrong? Ronald |
|
#2
| |||
| |||
| Thu, 10 Jul 2008 07:01:03 -0700, Ronny did catÂ*: > I'm using gawk 3.1.0 (Windows native) and 3.1.6 (Cygwin). My input file > contains fields which are separated by a vertical, optionally followed > by > spaces. Here is a small test program for this file format: > > BEGIN { > print "run starts" > FS="| *" > } > > { > print "processing line with",NF,"fields {",$0,"}" > } > > When using the following 2-line input file: > > # set art > 0,1,4|set art > > I get as output: > > run starts > processing line with 3 fields { # set art } processing line with 2 > fields { 0,1,4|set art } > > Still, a space is taken as field separator! > > What am I doing wrong? You want a regexp but you described a string, change: FS="| *" to be: FS=/| */ |
|
#3
| |||
| |||
| On 10 Jul., 16:22, Loki Harfagr <l...@thedarkdesign.free.fr.INVALID> wrote: > Thu, 10 Jul 2008 07:01:03 -0700, Ronny did cat : > > BEGIN { > > print "run starts" > > FS="| *" > > } > > Still, a space is taken as field separator! > You want a regexp but you described a string, change: > FS="| *" > to be: > FS=/| */ Agreed, your solution works. Now I reread the man pages more carefully, and there is one thing which puzzles me: ... If FS is a single character, fields are separated by that character. If FS is the null string, then each individual character becomes a separate field. Oth- erwise, FS is expected to be a full regular expression... I agree that this explains why I have to use // for describing my regexp. But now I wonder what is the semantics of my original code? Taking the man page literally, FS can be either the null string, or a string of length one, or a regexp. It doesn't say anything about FS being a string of length > 1. If this is forbidden, I would have expected an error message, but it was accepted... Ronald |
|
#4
| |||
| |||
| Plus, I also just noticed an error in my regexp. Since the vertical bar is a metacharacter, I should write it as /\| */ Ronald |
|
#5
| |||
| |||
| Hmmmm... still does not work. Here is my modified program: BEGIN { print "run starts" FS=/[|] */ } { print "processing line with",NF,"fields {STATUS=",$1," CMD=",$2,"}" } My input file contains: 0,1,4|set art and I get as result: run starts processing line with 2 fields {STATUS= CMD= ,1,4|set art } This means it does get two fields, but splits at the beginning of the line! Playing around, I found that the correct way to write the FS assignment goes like this: FS="[|] *" So the problem was not the usage of a string instead of a regexp, but that my regexp was wrong.... Ronald |
|
#6
| |||
| |||
| Ronny wrote: > Agreed, your solution works. Now I reread the man pages more > carefully, and > there is one thing which puzzles me: > > ... If FS is a single > character, fields are separated by that character. If FS is > the null string, then each individual character becomes a separate > field. Otherwise, FS is expected to be a full regular expression... > > I agree that this explains why I have to use // for describing my > regexp. No, you don't. The following works: awk -v FS='\\| *' The problem is that "|" must be escaped twice. > But now I wonder what is the semantics of my original code? FS="| *" This uses as field separator either nothing, or *zero* or more spaces. An empty FS is undefined by the standard, and is allowed as a special case by GNU awk, but only through the special syntax FS='', or FS=, or -F ''. In your case, I'm not sure how the regex is parsed. It seems to behave as if runs of space are used as separator, but not with awk's default semantics: $ echo ' abc de f' |awk -v FS='| *' '{print NF;for(i=1;i<=NF;i++)print$i}' 4 abc de f But I'm not sure about what goes on behind the scenes here. Hopefully someone will shed some light here. > Taking the man page literally, FS can be either the null string, or a > string of length one, or a regexp. It doesn't say anything about FS > being a string of length > 1. In that case, it's taken as a regex. > If this is forbidden, I would have expected an error message, but it > was accepted... Because it's perfectly valid. -- awk 'BEGIN{O="~"~"~";o="=="=="==";o+=+o;x=o""o;while(X ++<x-o-O)c=c"%c"; X=O""O;printf c,O+x*o*o+X,(X+x)*(O+o)-o,+X*X-o-O,o+x*o*o+X,x*o*o+X-o-o, x*(o+o)+X-O,+X*X-X+o+o,x+x+x-o,o+X+O+o+x*o*o,x+O+x*o*o,x*o*o+x+O+o+o+O, x+o+x*o*o,x+x*o*o+O,o+x+x*o*o,o+X*o*o,X+x*o*o,x*o* o+O+x,x+x*o*o-O,X-O}' |
|
#7
| |||
| |||
| On 7/10/2008 10:45 AM, Dave B wrote: > Ronny wrote: > > >>Agreed, your solution works. Now I reread the man pages more >>carefully, and >>there is one thing which puzzles me: >> >> ... If FS is a single >> character, fields are separated by that character. If FS is >>the null string, then each individual character becomes a separate >>field. Otherwise, FS is expected to be a full regular expression... >> >>I agree that this explains why I have to use // for describing my >>regexp. > > > No, you don't. The following works: > > awk -v FS='\\| *' > > The problem is that "|" must be escaped twice. > > >>But now I wonder what is the semantics of my original code? > > > FS="| *" > > This uses as field separator either nothing, or *zero* or more spaces. An > empty FS is undefined by the standard, and is allowed as a special case by > GNU awk, but only through the special syntax FS='', or FS=, or -F ''. > > In your case, I'm not sure how the regex is parsed. It seems to behave as if > runs of space are used as separator, but not with awk's default semantics: > > $ echo ' abc de f' |awk -v FS='| *' '{print NF;for(i=1;i<=NF;i++)print$i}' > 4 > > abc > de > f > > But I'm not sure about what goes on behind the scenes here. Hopefully > someone will shed some light here. An FS that's a single blank character is a special case that treats contiguous sequences of any white space as a single separator and, importantly, strips off leading white space from the record. You specified an FS that's an RE instead of a single blank character so you do not get the "special" behavior, just like you wouldn't if you specified a literal blank character: ------------- $ echo ' abc de f' |awk -v FS=' ' '{print NF;for(i=1;i<=NF;i++)print$i}' 3 abc de f $ echo ' abc de f' |awk -v FS=' *' '{print NF;for(i=1;i<=NF;i++)print$i}' 4 abc de f $ echo ' abc de f' |awk -v FS='[ ]' '{print NF;for(i=1;i<=NF;i++)print$i}' 5 abc de f -------------- Note the leading white space above when you don't use the single blank character FS. > > >>Taking the man page literally, FS can be either the null string, or a >>string of length one, or a regexp. It doesn't say anything about FS >>being a string of length > 1. > > > In that case, it's taken as a regex. > > >>If this is forbidden, I would have expected an error message, but it >>was accepted... > > > Because it's perfectly valid. > For the OP, this is what you want: BEGIN { print "run starts" FS="[|] *" } { print "processing line with",NF,"fields {",$0,"}" } Regards, Ed. |
|
#8
| |||
| |||
| Ed Morton wrote: >> FS="| *" >>[snip] > An FS that's a single blank character is a special case that treats contiguous > sequences of any white space as a single separator and, importantly, strips off > leading white space from the record. You specified an FS that's an RE instead of > a single blank character so you do not get the "special" behavior, just like you > wouldn't if you specified a literal blank character: Yes, I know that...but I'm not sure how awk determines it has encountered a "field separator" if the regex '| *' is used as FS. What part of the alternation is used? It seems that it's effectively treated like ' *', but the corner case where a "nothingness" matches (which is allowed by ' *', and which would thus make it behave similarly as if FS='') never happens. These seem to be equivalent: $ echo ' abc de f' |awk -v FS=' *' '{print NF;for(i=1;i<=NF;i++)print$i}' $ echo ' abc de f' |awk -v FS='| *' '{print NF;for(i=1;i<=NF;i++)print$i}' But I don't know why. > For the OP, this is what you want: > > BEGIN { > print "run starts" > FS="[|] *" or FS='\\| *' -- awk 'BEGIN{O="~"~"~";o="=="=="==";o+=+o;x=o""o;while(X ++<x-o-O)c=c"%c"; X=O""O;printf c,O+x*o*o+X,(X+x)*(O+o)-o,+X*X-o-O,o+x*o*o+X,x*o*o+X-o-o, x*(o+o)+X-O,+X*X-X+o+o,x+x+x-o,o+X+O+o+x*o*o,x+O+x*o*o,x*o*o+x+O+o+o+O, x+o+x*o*o,x+x*o*o+O,o+x+x*o*o,o+X*o*o,X+x*o*o,x*o* o+O+x,x+x*o*o-O,X-O}' |
|
#9
| |||
| |||
| On 7/10/2008 11:27 AM, Dave B wrote: > Ed Morton wrote: > > >>>FS="| *" >>>[snip] >> >>An FS that's a single blank character is a special case that treats contiguous >>sequences of any white space as a single separator and, importantly, strips off >>leading white space from the record. You specified an FS that's an RE instead of >>a single blank character so you do not get the "special" behavior, just like you >>wouldn't if you specified a literal blank character: > > > Yes, I know that...but I'm not sure how awk determines it has encountered a > "field separator" if the regex '| *' is used as FS. What part of the > alternation is used? It seems that it's effectively treated like ' *', but > the corner case where a "nothingness" matches (which is allowed by ' *', and > which would thus make it behave similarly as if FS='') never happens. These > seem to be equivalent: > > $ echo ' abc de f' |awk -v FS=' *' '{print NF;for(i=1;i<=NF;i++)print$i}' > > $ echo ' abc de f' |awk -v FS='| *' '{print NF;for(i=1;i<=NF;i++)print$i}' > > But I don't know why. I'm not sure I understand the question. An FS of a null character is another special case, just like an FS of a single blank character is a special case. A null character appearing as part of an FS that's an RE isn't treated the same as a null character that IS an FS, just like a single blank character appearing as part of an FS that's an RE isn't treated the same as a blank character that IS an FS. So, when you write FS='| *' you're saying the FS is either nothing at all OR a sequence of zero or more blanks. Yes, that doesn't make sense so you can obviously optimize it to ' *' but there's plenty of REs we see people write that could be optimized and awk doesn't try to analyze and warn you about any of them other than a useless backslash. Ed. |
|
#10
| |||
| |||
| Ed Morton wrote: >> Yes, I know that...but I'm not sure how awk determines it has encountered a >> "field separator" if the regex '| *' is used as FS. What part of the >> alternation is used? It seems that it's effectively treated like ' *', but >> the corner case where a "nothingness" matches (which is allowed by ' *', and >> which would thus make it behave similarly as if FS='') never happens. These >> seem to be equivalent: >> >> $ echo ' abc de f' |awk -v FS=' *' '{print NF;for(i=1;i<=NF;i++)print$i}' >> >> $ echo ' abc de f' |awk -v FS='| *' '{print NF;for(i=1;i<=NF;i++)print$i}' >> >> But I don't know why. > > I'm not sure I understand the question. An FS of a null character is another > special case, just like an FS of a single blank character is a special case. A > null character appearing as part of an FS that's an RE isn't treated the same as > a null character that IS an FS, just like a single blank character appearing as > part of an FS that's an RE isn't treated the same as a blank character that IS > an FS. Agreed, although strictly speaking we don't have "null characters", but rather "null" or "empty" regexes here. > So, when you write FS='| *' you're saying the FS is either nothing at all OR a > sequence of zero or more blanks. Yes, that doesn't make sense so you can > obviously optimize it to ' *' but there's plenty of REs we see people write that > could be optimized and awk doesn't try to analyze and warn you about any of them > other than a useless backslash. Ok, I'll try to explain better. Suppose we have FS='x|y', and the input is fooxbarybaz We know that, when awk encounters 'x', FS matches, so awk decides that that 'x' is a field separator. The same happens when awk gets to 'y', later. Every time, awk (actually, awk's regex engine) has used a certain part of the alternation in the FS regex to try a match and decide if a piece of input was to be considered a field separator (of course, this is a simple regex, but it can be more complex, with each part matching longer strings, or with a different structure. Also, I'm using a regex of the form x|y because it's similar to the case at hand). Now, if FS='| *' (again an alternation), and the input is abc in theory, awk should immediatley find a match for FS, since the part to the left of "|" is an empty regex, which matches at the beginning of the string, at the end, and between any two characters. And, the part to the right of the "|" (" *") also allows matching an empty string, although awk should not get that far, since the part to the left of the "|" already matches. But this does not happen. Also, as each character is examined, awk should find a match for the empty regex between any two characters, but again that doesn't happen. But awk DOES know how to do that, because if you do a="abc"; gsub(/| */,"X",a) you correctly get XaXbXcX So, my doubt was: why isn't awk matching the null regex (using either part of the alternation appearing in FS)? I guess the answer is: because FS is special and does not work that way, unless FS is explicitly set to '' (GNU awk only). Ok. But then, how does it work? Why does awk choose to treat a nonsense FS like '| *' as if it were ' *'? What's the logic behind that? Hope this was clearer. -- awk 'BEGIN{O="~"~"~";o="=="=="==";o+=+o;x=o""o;while(X ++<x-o-O)c=c"%c"; X=O""O;printf c,O+x*o*o+X,(X+x)*(O+o)-o,+X*X-o-O,o+x*o*o+X,x*o*o+X-o-o, x*(o+o)+X-O,+X*X-X+o+o,x+x+x-o,o+X+O+o+x*o*o,x+O+x*o*o,x*o*o+x+O+o+o+O, x+o+x*o*o,x+x*o*o+O,o+x+x*o*o,o+X*o*o,X+x*o*o,x*o* o+O+x,x+x*o*o-O,X-O}' |
![]() |
| Thread Tools | |
| Display Modes | |
In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.