Partial sorting

This is a discussion on Partial sorting within the Forth forums in Programming Languages category; Hi all, What would be a good algorithm for finding the top 256 of a million items? I suppose the "top 256" list would remain sorted and any candidates greater than the minimum would be inserted into the list. Would a linear array be the best representation for this list? -Brad...

Go Back   Application Development Forum > Programming Languages > Forth

Object Mix

Register FAQ Calendar Search Today's Posts Mark Forums Read
  #1  
Old 09-03-2008, 10:05 AM
Brad Eckert
Guest
 
Default Partial sorting

Hi all,

What would be a good algorithm for finding the top 256 of a million
items?

I suppose the "top 256" list would remain sorted and any candidates
greater than the minimum would be inserted into the list. Would a
linear array be the best representation for this list?

-Brad
Reply With Quote
  #2  
Old 09-03-2008, 01:20 PM
Icarus Sparry
Guest
 
Default Re: Partial sorting

On Wed, 03 Sep 2008 07:05:21 -0700, Brad Eckert wrote:

> Hi all,
>
> What would be a good algorithm for finding the top 256 of a million
> items?
>
> I suppose the "top 256" list would remain sorted and any candidates
> greater than the minimum would be inserted into the list. Would a linear
> array be the best representation for this list?
>
> -Brad


Tony Hoare has a good algorithm, which is problem 9 in chapter 11 of Jon
Bently's Programming Pearls 2nd edition. It does require that you can
permute the elements, but it has expected O(n) running time.

It is a varient of quicksort.

// precondition l<=k<=u
// postcondition x[l..k-1]<= x[k] <=x[k+1..u]
// So if called select1(0,n,k) this will get the top k elements into the
// first k elements of the array.
void select1(l, u, k)
{
if (l >= u)
return;
swap(l, randint(l, u));
t = x[l];
i = l;
j = u+1;
loop
do i++; while (i <= u && x[i] < t);
do j--; while (x[j] > t);
if (i > j)
break;
swap(i, j);

swap(l, j);
if (j < k)
select1(j+1, u, k);
else if (j > k)
select1(l, j-1, k);
}

To find out more, you want to look at "selection" algorithms.
Reply With Quote
  #3  
Old 09-03-2008, 01:23 PM
Marcel Hendrix
Guest
 
Default Re: Partial sorting

Brad Eckert <nospaambrad1@tinyboot.com> writes Re: Partial sorting

> What would be a good algorithm for finding the top 256 of a million items?


> I suppose the "top 256" list would remain sorted and any candidates
> greater than the minimum would be inserted into the list. Would a
> linear array be the best representation for this list?


I don't think the top 256 need to be sorted, as a just-in-time linear
search for the minimum might be more efficient. A CS type should be
able to spew Forth the relevant O(n) numbers right away.

-marcel


Reply With Quote
  #4  
Old 09-03-2008, 03:06 PM
Jonah Thomas
Guest
 
Default Re: Partial sorting

mhx@iae.nl (Marcel Hendrix) wrote:
> Brad Eckert <nospaambrad1@tinyboot.com> writes
>
> > What would be a good algorithm for finding the top 256 of a million
> > items?

>
> > I suppose the "top 256" list would remain sorted and any candidates
> > greater than the minimum would be inserted into the list. Would a
> > linear array be the best representation for this list?

>
> I don't think the top 256 need to be sorted, as a just-in-time linear
> search for the minimum might be more efficient. A CS type should be
> able to spew Forth the relevant O(n) numbers right away.


Yes. By the time you've checked the first 25000 records then the ones
you have will be around the top 1%. From then on you'll be replacing
items less than 1% of the time. So if you can do something to speed up
the case when you compare-and-discard it will probably have more effect
than speeding up the compare-and-replace.
Reply With Quote
  #5  
Old 09-03-2008, 03:34 PM
Albert van der Horst
Guest
 
Default Re: Partial sorting

In article <0e738bad-3556-447e-821f-56e9b4bf7ad2@59g2000hsb.googlegroups.com>,
Brad Eckert <nospaambrad1@tinyboot.com> wrote:
>Hi all,
>
>What would be a good algorithm for finding the top 256 of a million
>items?
>
>I suppose the "top 256" list would remain sorted and any candidates
>greater than the minimum would be inserted into the list. Would a
>linear array be the best representation for this list?


A heap would be better, allowing for 8 instead of 128 moves.
But your algorithm would be order of a million anyway, and
the "top 256" list would affect only a small factor.
Most of the time you would be skipping numbers that would
not beat the worst of the "top 256" list anyway.

If you have a good sorting algorithm available to just sort
the million items, you may discover that it is hard to beat
it, even if it does more than necessary.

>
>-Brad



--
--
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- like all pyramid schemes -- ultimately falters.
albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst

Reply With Quote
  #6  
Old 09-03-2008, 04:04 PM
Marcel Hendrix
Guest
 
Default Re: Partial sorting

Jonah Thomas <jethomas5@gmail.com> writes Re: Partial sorting

> mhx@iae.nl (Marcel Hendrix) wrote:
>> Brad Eckert <nospaambrad1@tinyboot.com> writes


>> > What would be a good algorithm for finding the top 256 of a million
>> > items?


>> > I suppose the "top 256" list would remain sorted and any candidates
>> > greater than the minimum would be inserted into the list. Would a
>> > linear array be the best representation for this list?


>> I don't think the top 256 need to be sorted, as a just-in-time linear
>> search for the minimum might be more efficient. A CS type should be
>> able to spew Forth the relevant O(n) numbers right away.


> Yes. By the time you've checked the first 25000 records then the ones
> you have will be around the top 1%. From then on you'll be replacing
> items less than 1% of the time. So if you can do something to speed up
> the case when you compare-and-discard it will probably have more effect
> than speeding up the compare-and-replace.


Let's see. Both schemes (a = keep sorted, b = search) can cache the
current minimum, so searching candidates is equally fast.

If scheme a keeps a sorted array, it will need 4 compares and about
128 memory moves to insert a new element. If it uses a linked list,
there would be no moves, but 128 compares and maybe 3 writes (are
there algorithms to do a binary search in a linked list?). I guess
the linked list will be faster than an array.

With scheme b about 128 compares are needed to find the element to
replace and 1 write to do it.

Unless it is possible to insert quickly in a linked list, I think
both schemes will be equally fast. It is trivially simple to
implement scheme b, so I'd chose that one (unless measurement
proves me wrong :-).

-marcel

Reply With Quote
  #7  
Old 09-03-2008, 07:09 PM
Jonah Thomas
Guest
 
Default Re: Partial sorting

mhx@iae.nl (Marcel Hendrix) wrote:

> Let's see. Both schemes (a = keep sorted, b = search) can cache the
> current minimum, so searching candidates is equally fast.


Yes, and if you can find any way to speed up that part it can
potentially make a big difference. It happens about a million times and
the rest happens much fewer.

> If scheme a keeps a sorted array, it will need 4 compares and about
> 128 memory moves to insert a new element.


Are you figuring a maximum of 8 compares and so 4 compares on average?
But half the time it takes 8 compares, 1/4 the time it takes 7, 1/8 the
time it takes 6, etc.

> If it uses a linked list,
> there would be no moves, but 128 compares and maybe 3 writes (are
> there algorithms to do a binary search in a linked list?).


You can build a binary tree with linked lists but then it isn't just a
linked list -- you have to rebuild the tree.

3 writes? Because you have the list ordered. That looks like a good
trade, 2 extra writes to avoid on average 128 compares.

> I guess the linked list will be faster than an array.


How fast can you do a block move? It will be a primitive, written in
assembly or C, using whatever tricks the hardware can help you with? But
128 compares on a linked list will be done in high-level code, which
your system might optimise for you. It will take at least two reads each
time, one to traverse the link and the other to read the data. (Maybe 2@
is fast?) But then it will take a lot of fiddling to
do a binary search on the array. You don't have to rebalance an array
that gets treated as a tree, but the fiddling will slow it down.

I'm not ready to guess which is faster, I'd have to measure it.

> With scheme b about 128 compares are needed to find the element to
> replace and 1 write to do it.


If the array isn't sorted, you can replace the current minimum, and then
won't you need 255 compares to find the new minimum? I may have missed
what you're doing.

> Unless it is possible to insert quickly in a linked list, I think
> both schemes will be equally fast. It is trivially simple to
> implement scheme b, so I'd chose that one (unless measurement
> proves me wrong :-).


It looks to me like you avoid the overhead of the linked list, at the
expense of doubling the search. You search the ordered linked list to
find where to insert your new item, and on average it should take less
than half the time because the larger the numbers are, the rarer they
are. So the chance that a new number is in the upper half of the list
should be less than the chance that it's in the lower half.

The overhead of the linked list is that to traverse it you need an extra
read instead of an increment, and to link in you need two extra writes.
I'm assuming you won't have cache problems with a linked list with only
256 items. Hard to believe that's more important than searching the
whole array each time.

But I still think the 99.9+% of the time that the comparison fails and
nothing much happens will take so much time that the differences among
the successes will be lessened. A simple approach that's only a little
bit slower might be the best.
Reply With Quote
Reply


Thread Tools
Display Modes


All times are GMT -5. The time now is 06:03 AM.


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
vB Ad Management by =RedTyger=

In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.