| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
| |||
| |||
| I have been trying to do this instead of placing everything in a hash/ array and compare in the END block. For example, if I have a file like this 111 2222 333 333 4445 3434 Notice there is a duplicate "333". How can I test if the next line is the same as the current line? I suppose I can use getline() but is there another clever way of achieving this? Also, how can I check for previous line? TIA |
|
#2
| |||
| |||
| On Saturday 26 July 2008 21:02, Mag Gam wrote: > I have been trying to do this instead of placing everything in a hash/ > array and compare in the END block. > > For example, if I have a file like this > > 111 > 2222 > 333 > 333 > 4445 > 3434 > > Notice there is a duplicate "333". How can I test if the next line is > the same as the current line? I suppose I can use getline() but is > there another clever way of achieving this? I don't know if that can be considered more clever, however you can just save the value of the previous line: awk '{if ($0==prev) { # ... this line is the same as previous line } prev=$0}' file What are you trying to do? What's the underlying problem? If you just want to remove duplicates, you can do awk '!a[$0]++' file > Also, how can I check for previous line? See above. -- All the commands are tested with bash and GNU tools, so they may use nonstandard features. I try to mention when something is nonstandard (if I'm aware of that), but I may miss something. Corrections are welcome. |
|
#3
| |||
| |||
| Thanks for the response. The underlying problem is, the file is huge; its close to 15g and I would like to compare. What I am trying to do is, compare the current line to the next like (or vice versa, 2nd line to 1st first). With the hash solution, I was able to get the answer. However my sysadmin is complaining I am taking up too much memory. On Jul 26, 3:14 pm, pk <p...@pk.invalid> wrote: > On Saturday 26 July 2008 21:02, Mag Gam wrote: > > > I have been trying to do this instead of placing everything in a hash/ > > array and compare in the END block. > > > For example, if I have a file like this > > > 111 > > 2222 > > 333 > > 333 > > 4445 > > 3434 > > > Notice there is a duplicate "333". How can I test if the next line is > > the same as the current line? I suppose I can use getline() but is > > there another clever way of achieving this? > > I don't know if that can be considered more clever, however you can just > save the value of the previous line: > > awk '{if ($0==prev) { # ... this line is the same as previous line } > prev=$0}' file > > What are you trying to do? What's the underlying problem? > > If you just want to remove duplicates, you can do > > awk '!a[$0]++' file > > > Also, how can I check for previous line? > > See above. > > -- > All the commands are tested with bash and GNU tools, so they may use > nonstandard features. I try to mention when something is nonstandard (if > I'm aware of that), but I may miss something. Corrections are welcome. |
|
#4
| |||
| |||
| Mag Gam wrote: > Thanks for the response. [Please don't top-post!] > > The underlying problem is, the file is huge; its close to 15g and I > would like to compare. (Never measured files in gram, so I can't help you here.) > > What I am trying to do is, compare the current line to the next like > (or vice versa, 2nd line to 1st first). Have you tried pk's proposal? - Which solves what you've asked for. > > With the hash solution, I was able to get the answer. However my > sysadmin is complaining I am taking up too much memory. You already told us that your own hash solution doesn't fit your needs. So just use pk's solution. What's the problem? Janis > > > > On Jul 26, 3:14 pm, pk <p...@pk.invalid> wrote: > >>On Saturday 26 July 2008 21:02, Mag Gam wrote: >> >> >>>I have been trying to do this instead of placing everything in a hash/ >>>array and compare in the END block. >> >>>For example, if I have a file like this >> >>>111 >>>2222 >>>333 >>>333 >>>4445 >>>3434 >> >>>Notice there is a duplicate "333". How can I test if the next line is >>>the same as the current line? I suppose I can use getline() but is >>>there another clever way of achieving this? >> >>I don't know if that can be considered more clever, however you can just >>save the value of the previous line: >> >>awk '{if ($0==prev) { # ... this line is the same as previous line } >> prev=$0}' file >> >>What are you trying to do? What's the underlying problem? >> >>If you just want to remove duplicates, you can do >> >>awk '!a[$0]++' file >> >> >>>Also, how can I check for previous line? >> >>See above. >> >>-- >>All the commands are tested with bash and GNU tools, so they may use >>nonstandard features. I try to mention when something is nonstandard (if >>I'm aware of that), but I may miss something. Corrections are welcome. > > |
|
#5
| |||
| |||
| On Sun, 27 Jul 2008 18:11:17 +0200, Janis Papanagnou wrote: > Mag Gam wrote: >> Thanks for the response. > > [Please don't top-post!] > > >> The underlying problem is, the file is huge; its close to 15g and I >> would like to compare. > > (Never measured files in gram, so I can't help you here.) Ah Janis, the poor OP wasn't meaning grams but gravitational levels and under 15 g that's certainly difficult to cure any file ~;O) >> What I am trying to do is, compare the current line to the next like >> (or vice versa, 2nd line to 1st first). > > Have you tried pk's proposal? - Which solves what you've asked for. > > >> With the hash solution, I was able to get the answer. However my >> sysadmin is complaining I am taking up too much memory. > > You already told us that your own hash solution doesn't fit your needs. > So just use pk's solution. What's the problem? I suspect pk's solution (though very good) may, in the OP case, still consume a lot of memory in the a[] buffer if by 'chance' the input overgravitated file has a lot of different lines ;-) If that's the point, I propose here a possible way to drastically reduce the memory usage, certainly not the golf contest winner of the month but quite close to list in obfuscating style samples ;D) Anyway: $ awk '{n++;n%=1;a[n]=a[n+1];a[n+1]=$0;if(a[n+1]==a[n]){print "Mind the gap";next}}1' that way if the OP sysadmin has a problem with mem usage that'd leave us with a few hypothesis, the server might upgrade from Z80-MSX-16KB towards power machines like AtariST520 or even a PDP-20, or the sysadmin has to be seen as human and may need to have some vacation time (like this week I just had ;-) or maybe the file has extremely looong records... (to OP: Replace the ``Londoner'' message by whatever you need, other msg or action) > > Janis > > >> >> >> On Jul 26, 3:14 pm, pk <p...@pk.invalid> wrote: >> >>>On Saturday 26 July 2008 21:02, Mag Gam wrote: >>> >>> >>>>I have been trying to do this instead of placing everything in a hash/ >>>>array and compare in the END block. >>> >>>>For example, if I have a file like this >>> >>>>111 >>>>2222 >>>>333 >>>>333 >>>>4445 >>>>3434 >>> >>>>Notice there is a duplicate "333". How can I test if the next line is >>>>the same as the current line? I suppose I can use getline() but is >>>>there another clever way of achieving this? >>> >>>I don't know if that can be considered more clever, however you can >>>just save the value of the previous line: >>> >>>awk '{if ($0==prev) { # ... this line is the same as previous line } >>> prev=$0}' file >>> >>>What are you trying to do? What's the underlying problem? >>> >>>If you just want to remove duplicates, you can do >>> >>>awk '!a[$0]++' file >>> >>> >>>>Also, how can I check for previous line? >>> >>>See above. >>> >>>-- >>>All the commands are tested with bash and GNU tools, so they may use >>>nonstandard features. I try to mention when something is nonstandard >>>(if I'm aware of that), but I may miss something. Corrections are >>>welcome. >> >> -- have space suit : "VMSBUX:B0N1@GOHH.GO" will travel : tr "MLKJHGFDSQNBVCXWPOIUYTREZA" "a-z" |
|
#6
| |||
| |||
| On Sat, 26 Jul 2008 12:02:48 -0700, Mag Gam wrote: > I have been trying to do this instead of placing everything in a hash/ > array and compare in the END block. > > For example, if I have a file like this > > 111 > 2222 > 333 > 333 > 4445 > 3434 > > Notice there is a duplicate "333". How can I test if the next line is > the same as the current line? I suppose I can use getline() but is there > another clever way of achieving this? > > Also, how can I check for previous line? Functionally, this is the same as PK's suggestion, it's just written out in a fuller (C-like), and hopefully, clearer, form - since you didn't say what you want to do with the lines after suppressing adjacent duplicates, I wrote it to print the non-duplicate lines as it encounters them. This should not be sensitive to the file size because it stores only one line at a time. { if( $0 != Prev ) print $0 Prev = $0 } In minimalist awk format, that's $0 != Prev {print} {Prev = $0} As a command line program that could be (minimalist format) awk '$0!=Prev{print}{Prev=$0}' source > target (tested under Fedora and XP (as a script file - all variations tested under Linux) with your sample data) BTW, "gigabytes" is usually abbreviated GB (Gb would be "gigabits"). Abbreviations for SI prefixes for units larger than kilo are all upper case - all those smaller than mega are in lower case - the full prefixes are in lower case unless the language requires initial capitals (k and K have an unofficial byte/bit context usage: k = 1000; K = 1024). -- T.E.D. (tdavis@mst.edu) MST (Missouri University of Science and Technology) used to be UMR (University of Missouri - Rolla). .. |
|
#7
| |||
| |||
| loki harfagr wrote: > On Sun, 27 Jul 2008 18:11:17 +0200, Janis Papanagnou wrote: > > >>Mag Gam wrote: >> >>>Thanks for the response. >> >>[Please don't top-post!] >> >> >> >>>The underlying problem is, the file is huge; its close to 15g and I >>>would like to compare. >> >>(Never measured files in gram, so I can't help you here.) > > > Ah Janis, the poor OP wasn't meaning grams but gravitational levels > and under 15 g that's certainly difficult to cure any file ~;O) :-) Frankly, I wasn't sure whether he could have meant gravity ;-) > > >>>What I am trying to do is, compare the current line to the next like >>>(or vice versa, 2nd line to 1st first). >> >>Have you tried pk's proposal? - Which solves what you've asked for. >> >> >> >>>With the hash solution, I was able to get the answer. However my >>>sysadmin is complaining I am taking up too much memory. >> >>You already told us that your own hash solution doesn't fit your needs. >>So just use pk's solution. What's the problem? > > > I suspect pk's solution (though very good) may, in the OP case, > still consume a lot of memory in the a[] buffer if by 'chance' > the input overgravitated file has a lot of different lines ;-) Oh, I meant his first proposal, the one without a[]... awk '{if ($0==prev) { # ... this line is the same as previous line } prev=$0}' file Janis > > If that's the point, I propose here a possible way to > drastically reduce the memory usage, certainly not the > golf contest winner of the month but quite close to list > in obfuscating style samples ;D) > Anyway: > > $ awk '{n++;n%=1;a[n]=a[n+1];a[n+1]=$0;if(a[n+1]==a[n]){print "Mind the gap";next}}1' > > that way if the OP sysadmin has a problem with mem usage that'd leave us > with a few hypothesis, the server might upgrade from Z80-MSX-16KB towards > power machines like AtariST520 or even a PDP-20, or the sysadmin has to > be seen as human and may need to have some vacation time (like this week I > just had ;-) or maybe the file has extremely looong records... > > (to OP: Replace the ``Londoner'' message by whatever you need, other msg or action) > > > >>Janis >> >> >> >>> >>>On Jul 26, 3:14 pm, pk <p...@pk.invalid> wrote: >>> >>> >>>>On Saturday 26 July 2008 21:02, Mag Gam wrote: >>>> >>>> >>>> >>>>>I have been trying to do this instead of placing everything in a hash/ >>>>>array and compare in the END block. >>>> >>>>>For example, if I have a file like this >>>> >>>>>111 >>>>>2222 >>>>>333 >>>>>333 >>>>>4445 >>>>>3434 >>>> >>>>>Notice there is a duplicate "333". How can I test if the next line is >>>>>the same as the current line? I suppose I can use getline() but is >>>>>there another clever way of achieving this? >>>> >>>>I don't know if that can be considered more clever, however you can >>>>just save the value of the previous line: >>>> >>>>awk '{if ($0==prev) { # ... this line is the same as previous line } >>>> prev=$0}' file >>>> >>>>What are you trying to do? What's the underlying problem? >>>> >>>>If you just want to remove duplicates, you can do >>>> >>>>awk '!a[$0]++' file >>>> >>>> >>>> >>>>>Also, how can I check for previous line? >>>> >>>>See above. >>>> >>>>-- >>>>All the commands are tested with bash and GNU tools, so they may use >>>>nonstandard features. I try to mention when something is nonstandard >>>>(if I'm aware of that), but I may miss something. Corrections are >>>>welcome. >>> >>> > > > |
|
#8
| |||
| |||
| Ted Davis wrote: > On Sat, 26 Jul 2008 12:02:48 -0700, Mag Gam wrote: > > >>I have been trying to do this instead of placing everything in a hash/ >>array and compare in the END block. >> >>For example, if I have a file like this >> >>111 >>2222 >>333 >>333 >>4445 >>3434 >> >>Notice there is a duplicate "333". How can I test if the next line is >>the same as the current line? I suppose I can use getline() but is there >>another clever way of achieving this? >> >>Also, how can I check for previous line? > > > Functionally, this is the same as PK's suggestion, it's just written out > in a fuller (C-like), and hopefully, clearer, form - since you didn't say > what you want to do with the lines after suppressing adjacent duplicates, > I wrote it to print the non-duplicate lines as it encounters them. This > should not be sensitive to the file size because it stores only one line > at a time. > > { > if( $0 != Prev ) print $0 > Prev = $0 > } > > In minimalist awk format, that's > $0 != Prev {print} > {Prev = $0} > > > As a command line program that could be (minimalist format) > > awk '$0!=Prev{print}{Prev=$0}' source > target If we're going to go minimalist, maybe even... awk '$0!=prev;{prev=$0}' source > target Janis > > (tested under Fedora and XP (as a script file - all variations tested > under Linux) with your sample data) > > BTW, "gigabytes" is usually abbreviated GB (Gb would be "gigabits"). > Abbreviations for SI prefixes for units larger than kilo are all upper > case - all those smaller than mega are in lower case - the full prefixes > are in lower case unless the language requires initial capitals (k and K > have an unofficial byte/bit context usage: k = 1000; K = 1024). > |
|
#9
| |||
| |||
| > If you just want to remove duplicates, you can do > awk '!a[$0]++' file Typical wizardry in awk. Can someone please explain why/how this works? Thanks, Sashi |
|
#10
| |||
| |||
| On Wed, 30 Jul 2008 18:31:41 -0700 (PDT), Sashi <smalladi@gmail.com> wrote: >> If you just want to remove duplicates, you can do >> awk '!a[$0]++' file > >Typical wizardry in awk. >Can someone please explain why/how this works? awk '(!$0 in a) { # if not seen a[$0]++ # add $0 to seen list a[] print # and print $0 }' file Grant. -- http://bugsplatter.mine.nu/ |
![]() |
| Thread Tools | |
| Display Modes | |
In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.