read ahead or before

This is a discussion on read ahead or before within the awk forums in Programming Languages category; I have been trying to do this instead of placing everything in a hash/ array and compare in the END block. For example, if I have a file like this 111 2222 333 333 4445 3434 Notice there is a duplicate "333". How can I test if the next line is the same as the current line? I suppose I can use getline() but is there another clever way of achieving this? Also, how can I check for previous line? TIA...

Go Back   Application Development Forum > Programming Languages > awk

Object Mix

Register FAQ Calendar Search Today's Posts Mark Forums Read
  #1  
Old 07-26-2008, 03:02 PM
Mag Gam
Guest
 
Default read ahead or before

I have been trying to do this instead of placing everything in a hash/
array and compare in the END block.

For example, if I have a file like this

111
2222
333
333
4445
3434

Notice there is a duplicate "333". How can I test if the next line is
the same as the current line? I suppose I can use getline() but is
there another clever way of achieving this?

Also, how can I check for previous line?

TIA
Reply With Quote
  #2  
Old 07-26-2008, 03:14 PM
pk
Guest
 
Default Re: read ahead or before

On Saturday 26 July 2008 21:02, Mag Gam wrote:

> I have been trying to do this instead of placing everything in a hash/
> array and compare in the END block.
>
> For example, if I have a file like this
>
> 111
> 2222
> 333
> 333
> 4445
> 3434
>
> Notice there is a duplicate "333". How can I test if the next line is
> the same as the current line? I suppose I can use getline() but is
> there another clever way of achieving this?


I don't know if that can be considered more clever, however you can just
save the value of the previous line:

awk '{if ($0==prev) { # ... this line is the same as previous line }
prev=$0}' file

What are you trying to do? What's the underlying problem?

If you just want to remove duplicates, you can do

awk '!a[$0]++' file

> Also, how can I check for previous line?


See above.

--
All the commands are tested with bash and GNU tools, so they may use
nonstandard features. I try to mention when something is nonstandard (if
I'm aware of that), but I may miss something. Corrections are welcome.
Reply With Quote
  #3  
Old 07-27-2008, 10:05 AM
Mag Gam
Guest
 
Default Re: read ahead or before

Thanks for the response.

The underlying problem is, the file is huge; its close to 15g and I
would like to compare.

What I am trying to do is, compare the current line to the next like
(or vice versa, 2nd line to 1st first).

With the hash solution, I was able to get the answer. However my
sysadmin is complaining I am taking up too much memory.



On Jul 26, 3:14 pm, pk <p...@pk.invalid> wrote:
> On Saturday 26 July 2008 21:02, Mag Gam wrote:
>
> > I have been trying to do this instead of placing everything in a hash/
> > array and compare in the END block.

>
> > For example, if I have a file like this

>
> > 111
> > 2222
> > 333
> > 333
> > 4445
> > 3434

>
> > Notice there is a duplicate "333". How can I test if the next line is
> > the same as the current line? I suppose I can use getline() but is
> > there another clever way of achieving this?

>
> I don't know if that can be considered more clever, however you can just
> save the value of the previous line:
>
> awk '{if ($0==prev) { # ... this line is the same as previous line }
> prev=$0}' file
>
> What are you trying to do? What's the underlying problem?
>
> If you just want to remove duplicates, you can do
>
> awk '!a[$0]++' file
>
> > Also, how can I check for previous line?

>
> See above.
>
> --
> All the commands are tested with bash and GNU tools, so they may use
> nonstandard features. I try to mention when something is nonstandard (if
> I'm aware of that), but I may miss something. Corrections are welcome.


Reply With Quote
  #4  
Old 07-27-2008, 12:11 PM
Janis Papanagnou
Guest
 
Default Re: read ahead or before

Mag Gam wrote:
> Thanks for the response.


[Please don't top-post!]

>
> The underlying problem is, the file is huge; its close to 15g and I
> would like to compare.


(Never measured files in gram, so I can't help you here.)

>
> What I am trying to do is, compare the current line to the next like
> (or vice versa, 2nd line to 1st first).


Have you tried pk's proposal? - Which solves what you've asked for.

>
> With the hash solution, I was able to get the answer. However my
> sysadmin is complaining I am taking up too much memory.


You already told us that your own hash solution doesn't fit your
needs. So just use pk's solution. What's the problem?

Janis

>
>
>
> On Jul 26, 3:14 pm, pk <p...@pk.invalid> wrote:
>
>>On Saturday 26 July 2008 21:02, Mag Gam wrote:
>>
>>
>>>I have been trying to do this instead of placing everything in a hash/
>>>array and compare in the END block.

>>
>>>For example, if I have a file like this

>>
>>>111
>>>2222
>>>333
>>>333
>>>4445
>>>3434

>>
>>>Notice there is a duplicate "333". How can I test if the next line is
>>>the same as the current line? I suppose I can use getline() but is
>>>there another clever way of achieving this?

>>
>>I don't know if that can be considered more clever, however you can just
>>save the value of the previous line:
>>
>>awk '{if ($0==prev) { # ... this line is the same as previous line }
>> prev=$0}' file
>>
>>What are you trying to do? What's the underlying problem?
>>
>>If you just want to remove duplicates, you can do
>>
>>awk '!a[$0]++' file
>>
>>
>>>Also, how can I check for previous line?

>>
>>See above.
>>
>>--
>>All the commands are tested with bash and GNU tools, so they may use
>>nonstandard features. I try to mention when something is nonstandard (if
>>I'm aware of that), but I may miss something. Corrections are welcome.

>
>

Reply With Quote
  #5  
Old 07-27-2008, 01:48 PM
loki harfagr
Guest
 
Default Re: read ahead or before

On Sun, 27 Jul 2008 18:11:17 +0200, Janis Papanagnou wrote:

> Mag Gam wrote:
>> Thanks for the response.

>
> [Please don't top-post!]
>
>
>> The underlying problem is, the file is huge; its close to 15g and I
>> would like to compare.

>
> (Never measured files in gram, so I can't help you here.)


Ah Janis, the poor OP wasn't meaning grams but gravitational levels
and under 15 g that's certainly difficult to cure any file ~;O)

>> What I am trying to do is, compare the current line to the next like
>> (or vice versa, 2nd line to 1st first).

>
> Have you tried pk's proposal? - Which solves what you've asked for.
>
>
>> With the hash solution, I was able to get the answer. However my
>> sysadmin is complaining I am taking up too much memory.

>
> You already told us that your own hash solution doesn't fit your needs.
> So just use pk's solution. What's the problem?


I suspect pk's solution (though very good) may, in the OP case,
still consume a lot of memory in the a[] buffer if by 'chance'
the input overgravitated file has a lot of different lines ;-)

If that's the point, I propose here a possible way to
drastically reduce the memory usage, certainly not the
golf contest winner of the month but quite close to list
in obfuscating style samples ;D)
Anyway:

$ awk '{n++;n%=1;a[n]=a[n+1];a[n+1]=$0;if(a[n+1]==a[n]){print "Mind the gap";next}}1'

that way if the OP sysadmin has a problem with mem usage that'd leave us
with a few hypothesis, the server might upgrade from Z80-MSX-16KB towards
power machines like AtariST520 or even a PDP-20, or the sysadmin has to
be seen as human and may need to have some vacation time (like this week I
just had ;-) or maybe the file has extremely looong records...

(to OP: Replace the ``Londoner'' message by whatever you need, other msg or action)


>
> Janis
>
>
>>
>>
>> On Jul 26, 3:14 pm, pk <p...@pk.invalid> wrote:
>>
>>>On Saturday 26 July 2008 21:02, Mag Gam wrote:
>>>
>>>
>>>>I have been trying to do this instead of placing everything in a hash/
>>>>array and compare in the END block.
>>>
>>>>For example, if I have a file like this
>>>
>>>>111
>>>>2222
>>>>333
>>>>333
>>>>4445
>>>>3434
>>>
>>>>Notice there is a duplicate "333". How can I test if the next line is
>>>>the same as the current line? I suppose I can use getline() but is
>>>>there another clever way of achieving this?
>>>
>>>I don't know if that can be considered more clever, however you can
>>>just save the value of the previous line:
>>>
>>>awk '{if ($0==prev) { # ... this line is the same as previous line }
>>> prev=$0}' file
>>>
>>>What are you trying to do? What's the underlying problem?
>>>
>>>If you just want to remove duplicates, you can do
>>>
>>>awk '!a[$0]++' file
>>>
>>>
>>>>Also, how can I check for previous line?
>>>
>>>See above.
>>>
>>>--
>>>All the commands are tested with bash and GNU tools, so they may use
>>>nonstandard features. I try to mention when something is nonstandard
>>>(if I'm aware of that), but I may miss something. Corrections are
>>>welcome.

>>
>>




--
have space suit : "VMSBUX:B0N1@GOHH.GO"
will travel : tr "MLKJHGFDSQNBVCXWPOIUYTREZA" "a-z"
Reply With Quote
  #6  
Old 07-27-2008, 04:01 PM
Ted Davis
Guest
 
Default Re: read ahead or before

On Sat, 26 Jul 2008 12:02:48 -0700, Mag Gam wrote:

> I have been trying to do this instead of placing everything in a hash/
> array and compare in the END block.
>
> For example, if I have a file like this
>
> 111
> 2222
> 333
> 333
> 4445
> 3434
>
> Notice there is a duplicate "333". How can I test if the next line is
> the same as the current line? I suppose I can use getline() but is there
> another clever way of achieving this?
>
> Also, how can I check for previous line?


Functionally, this is the same as PK's suggestion, it's just written out
in a fuller (C-like), and hopefully, clearer, form - since you didn't say
what you want to do with the lines after suppressing adjacent duplicates,
I wrote it to print the non-duplicate lines as it encounters them. This
should not be sensitive to the file size because it stores only one line
at a time.

{
if( $0 != Prev ) print $0
Prev = $0
}

In minimalist awk format, that's
$0 != Prev {print}
{Prev = $0}


As a command line program that could be (minimalist format)

awk '$0!=Prev{print}{Prev=$0}' source > target

(tested under Fedora and XP (as a script file - all variations tested
under Linux) with your sample data)

BTW, "gigabytes" is usually abbreviated GB (Gb would be "gigabits").
Abbreviations for SI prefixes for units larger than kilo are all upper
case - all those smaller than mega are in lower case - the full prefixes
are in lower case unless the language requires initial capitals (k and K
have an unofficial byte/bit context usage: k = 1000; K = 1024).

--

T.E.D. (tdavis@mst.edu) MST (Missouri University of Science and Technology)
used to be UMR (University of Missouri - Rolla).
..

Reply With Quote
  #7  
Old 07-27-2008, 05:08 PM
Janis Papanagnou
Guest
 
Default Re: read ahead or before

loki harfagr wrote:
> On Sun, 27 Jul 2008 18:11:17 +0200, Janis Papanagnou wrote:
>
>
>>Mag Gam wrote:
>>
>>>Thanks for the response.

>>
>>[Please don't top-post!]
>>
>>
>>
>>>The underlying problem is, the file is huge; its close to 15g and I
>>>would like to compare.

>>
>>(Never measured files in gram, so I can't help you here.)

>
>
> Ah Janis, the poor OP wasn't meaning grams but gravitational levels
> and under 15 g that's certainly difficult to cure any file ~;O)


:-) Frankly, I wasn't sure whether he could have meant gravity ;-)

>
>
>>>What I am trying to do is, compare the current line to the next like
>>>(or vice versa, 2nd line to 1st first).

>>
>>Have you tried pk's proposal? - Which solves what you've asked for.
>>
>>
>>
>>>With the hash solution, I was able to get the answer. However my
>>>sysadmin is complaining I am taking up too much memory.

>>
>>You already told us that your own hash solution doesn't fit your needs.
>>So just use pk's solution. What's the problem?

>
>
> I suspect pk's solution (though very good) may, in the OP case,
> still consume a lot of memory in the a[] buffer if by 'chance'
> the input overgravitated file has a lot of different lines ;-)


Oh, I meant his first proposal, the one without a[]...

awk '{if ($0==prev) { # ... this line is the same as previous line }
prev=$0}' file


Janis

>
> If that's the point, I propose here a possible way to
> drastically reduce the memory usage, certainly not the
> golf contest winner of the month but quite close to list
> in obfuscating style samples ;D)
> Anyway:
>
> $ awk '{n++;n%=1;a[n]=a[n+1];a[n+1]=$0;if(a[n+1]==a[n]){print "Mind the gap";next}}1'
>
> that way if the OP sysadmin has a problem with mem usage that'd leave us
> with a few hypothesis, the server might upgrade from Z80-MSX-16KB towards
> power machines like AtariST520 or even a PDP-20, or the sysadmin has to
> be seen as human and may need to have some vacation time (like this week I
> just had ;-) or maybe the file has extremely looong records...
>
> (to OP: Replace the ``Londoner'' message by whatever you need, other msg or action)
>
>
>
>>Janis
>>
>>
>>
>>>
>>>On Jul 26, 3:14 pm, pk <p...@pk.invalid> wrote:
>>>
>>>
>>>>On Saturday 26 July 2008 21:02, Mag Gam wrote:
>>>>
>>>>
>>>>
>>>>>I have been trying to do this instead of placing everything in a hash/
>>>>>array and compare in the END block.
>>>>
>>>>>For example, if I have a file like this
>>>>
>>>>>111
>>>>>2222
>>>>>333
>>>>>333
>>>>>4445
>>>>>3434
>>>>
>>>>>Notice there is a duplicate "333". How can I test if the next line is
>>>>>the same as the current line? I suppose I can use getline() but is
>>>>>there another clever way of achieving this?
>>>>
>>>>I don't know if that can be considered more clever, however you can
>>>>just save the value of the previous line:
>>>>
>>>>awk '{if ($0==prev) { # ... this line is the same as previous line }
>>>> prev=$0}' file
>>>>
>>>>What are you trying to do? What's the underlying problem?
>>>>
>>>>If you just want to remove duplicates, you can do
>>>>
>>>>awk '!a[$0]++' file
>>>>
>>>>
>>>>
>>>>>Also, how can I check for previous line?
>>>>
>>>>See above.
>>>>
>>>>--
>>>>All the commands are tested with bash and GNU tools, so they may use
>>>>nonstandard features. I try to mention when something is nonstandard
>>>>(if I'm aware of that), but I may miss something. Corrections are
>>>>welcome.
>>>
>>>

>
>
>

Reply With Quote
  #8  
Old 07-27-2008, 05:40 PM
Janis Papanagnou
Guest
 
Default Re: read ahead or before

Ted Davis wrote:
> On Sat, 26 Jul 2008 12:02:48 -0700, Mag Gam wrote:
>
>
>>I have been trying to do this instead of placing everything in a hash/
>>array and compare in the END block.
>>
>>For example, if I have a file like this
>>
>>111
>>2222
>>333
>>333
>>4445
>>3434
>>
>>Notice there is a duplicate "333". How can I test if the next line is
>>the same as the current line? I suppose I can use getline() but is there
>>another clever way of achieving this?
>>
>>Also, how can I check for previous line?

>
>
> Functionally, this is the same as PK's suggestion, it's just written out
> in a fuller (C-like), and hopefully, clearer, form - since you didn't say
> what you want to do with the lines after suppressing adjacent duplicates,
> I wrote it to print the non-duplicate lines as it encounters them. This
> should not be sensitive to the file size because it stores only one line
> at a time.
>
> {
> if( $0 != Prev ) print $0
> Prev = $0
> }
>
> In minimalist awk format, that's
> $0 != Prev {print}
> {Prev = $0}
>
>
> As a command line program that could be (minimalist format)
>
> awk '$0!=Prev{print}{Prev=$0}' source > target


If we're going to go minimalist, maybe even...

awk '$0!=prev;{prev=$0}' source > target


Janis

>
> (tested under Fedora and XP (as a script file - all variations tested
> under Linux) with your sample data)
>
> BTW, "gigabytes" is usually abbreviated GB (Gb would be "gigabits").
> Abbreviations for SI prefixes for units larger than kilo are all upper
> case - all those smaller than mega are in lower case - the full prefixes
> are in lower case unless the language requires initial capitals (k and K
> have an unofficial byte/bit context usage: k = 1000; K = 1024).
>

Reply With Quote
  #9  
Old 07-30-2008, 09:31 PM
Sashi
Guest
 
Default Re: read ahead or before

> If you just want to remove duplicates, you can do
> awk '!a[$0]++' file


Typical wizardry in awk.
Can someone please explain why/how this works?

Thanks,
Sashi

Reply With Quote
  #10  
Old 07-30-2008, 10:18 PM
Grant
Guest
 
Default Re: read ahead or before

On Wed, 30 Jul 2008 18:31:41 -0700 (PDT), Sashi <smalladi@gmail.com> wrote:

>> If you just want to remove duplicates, you can do
>> awk '!a[$0]++' file

>
>Typical wizardry in awk.
>Can someone please explain why/how this works?


awk '(!$0 in a) { # if not seen
a[$0]++ # add $0 to seen list a[]
print # and print $0
}' file

Grant.
--
http://bugsplatter.mine.nu/
Reply With Quote
Reply


Thread Tools
Display Modes


All times are GMT -5. The time now is 02:13 AM.


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
vB Ad Management by =RedTyger=

In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.