| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
| |||
| |||
| Hi all, What would be a good algorithm for finding the top 256 of a million items? I suppose the "top 256" list would remain sorted and any candidates greater than the minimum would be inserted into the list. Would a linear array be the best representation for this list? -Brad |
|
#2
| |||
| |||
| On Wed, 03 Sep 2008 07:05:21 -0700, Brad Eckert wrote: > Hi all, > > What would be a good algorithm for finding the top 256 of a million > items? > > I suppose the "top 256" list would remain sorted and any candidates > greater than the minimum would be inserted into the list. Would a linear > array be the best representation for this list? > > -Brad Tony Hoare has a good algorithm, which is problem 9 in chapter 11 of Jon Bently's Programming Pearls 2nd edition. It does require that you can permute the elements, but it has expected O(n) running time. It is a varient of quicksort. // precondition l<=k<=u // postcondition x[l..k-1]<= x[k] <=x[k+1..u] // So if called select1(0,n,k) this will get the top k elements into the // first k elements of the array. void select1(l, u, k) { if (l >= u) return; swap(l, randint(l, u)); t = x[l]; i = l; j = u+1; loop do i++; while (i <= u && x[i] < t); do j--; while (x[j] > t); if (i > j) break; swap(i, j); swap(l, j); if (j < k) select1(j+1, u, k); else if (j > k) select1(l, j-1, k); } To find out more, you want to look at "selection" algorithms. |
|
#3
| |||
| |||
| Brad Eckert <nospaambrad1@tinyboot.com> writes Re: Partial sorting > What would be a good algorithm for finding the top 256 of a million items? > I suppose the "top 256" list would remain sorted and any candidates > greater than the minimum would be inserted into the list. Would a > linear array be the best representation for this list? I don't think the top 256 need to be sorted, as a just-in-time linear search for the minimum might be more efficient. A CS type should be able to spew Forth the relevant O(n) numbers right away. -marcel |
|
#4
| |||
| |||
| mhx@iae.nl (Marcel Hendrix) wrote: > Brad Eckert <nospaambrad1@tinyboot.com> writes > > > What would be a good algorithm for finding the top 256 of a million > > items? > > > I suppose the "top 256" list would remain sorted and any candidates > > greater than the minimum would be inserted into the list. Would a > > linear array be the best representation for this list? > > I don't think the top 256 need to be sorted, as a just-in-time linear > search for the minimum might be more efficient. A CS type should be > able to spew Forth the relevant O(n) numbers right away. Yes. By the time you've checked the first 25000 records then the ones you have will be around the top 1%. From then on you'll be replacing items less than 1% of the time. So if you can do something to speed up the case when you compare-and-discard it will probably have more effect than speeding up the compare-and-replace. |
|
#5
| |||
| |||
| In article <0e738bad-3556-447e-821f-56e9b4bf7ad2@59g2000hsb.googlegroups.com>, Brad Eckert <nospaambrad1@tinyboot.com> wrote: >Hi all, > >What would be a good algorithm for finding the top 256 of a million >items? > >I suppose the "top 256" list would remain sorted and any candidates >greater than the minimum would be inserted into the list. Would a >linear array be the best representation for this list? A heap would be better, allowing for 8 instead of 128 moves. But your algorithm would be order of a million anyway, and the "top 256" list would affect only a small factor. Most of the time you would be skipping numbers that would not beat the worst of the "top 256" list anyway. If you have a good sorting algorithm available to just sort the million items, you may discover that it is hard to beat it, even if it does more than necessary. > >-Brad -- -- Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- like all pyramid schemes -- ultimately falters. albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst |
|
#6
| |||
| |||
| Jonah Thomas <jethomas5@gmail.com> writes Re: Partial sorting > mhx@iae.nl (Marcel Hendrix) wrote: >> Brad Eckert <nospaambrad1@tinyboot.com> writes >> > What would be a good algorithm for finding the top 256 of a million >> > items? >> > I suppose the "top 256" list would remain sorted and any candidates >> > greater than the minimum would be inserted into the list. Would a >> > linear array be the best representation for this list? >> I don't think the top 256 need to be sorted, as a just-in-time linear >> search for the minimum might be more efficient. A CS type should be >> able to spew Forth the relevant O(n) numbers right away. > Yes. By the time you've checked the first 25000 records then the ones > you have will be around the top 1%. From then on you'll be replacing > items less than 1% of the time. So if you can do something to speed up > the case when you compare-and-discard it will probably have more effect > than speeding up the compare-and-replace. Let's see. Both schemes (a = keep sorted, b = search) can cache the current minimum, so searching candidates is equally fast. If scheme a keeps a sorted array, it will need 4 compares and about 128 memory moves to insert a new element. If it uses a linked list, there would be no moves, but 128 compares and maybe 3 writes (are there algorithms to do a binary search in a linked list?). I guess the linked list will be faster than an array. With scheme b about 128 compares are needed to find the element to replace and 1 write to do it. Unless it is possible to insert quickly in a linked list, I think both schemes will be equally fast. It is trivially simple to implement scheme b, so I'd chose that one (unless measurement proves me wrong :-). -marcel |
|
#7
| |||
| |||
| mhx@iae.nl (Marcel Hendrix) wrote: > Let's see. Both schemes (a = keep sorted, b = search) can cache the > current minimum, so searching candidates is equally fast. Yes, and if you can find any way to speed up that part it can potentially make a big difference. It happens about a million times and the rest happens much fewer. > If scheme a keeps a sorted array, it will need 4 compares and about > 128 memory moves to insert a new element. Are you figuring a maximum of 8 compares and so 4 compares on average? But half the time it takes 8 compares, 1/4 the time it takes 7, 1/8 the time it takes 6, etc. > If it uses a linked list, > there would be no moves, but 128 compares and maybe 3 writes (are > there algorithms to do a binary search in a linked list?). You can build a binary tree with linked lists but then it isn't just a linked list -- you have to rebuild the tree. 3 writes? Because you have the list ordered. That looks like a good trade, 2 extra writes to avoid on average 128 compares. > I guess the linked list will be faster than an array. How fast can you do a block move? It will be a primitive, written in assembly or C, using whatever tricks the hardware can help you with? But 128 compares on a linked list will be done in high-level code, which your system might optimise for you. It will take at least two reads each time, one to traverse the link and the other to read the data. (Maybe 2@ is fast?) But then it will take a lot of fiddling to do a binary search on the array. You don't have to rebalance an array that gets treated as a tree, but the fiddling will slow it down. I'm not ready to guess which is faster, I'd have to measure it. > With scheme b about 128 compares are needed to find the element to > replace and 1 write to do it. If the array isn't sorted, you can replace the current minimum, and then won't you need 255 compares to find the new minimum? I may have missed what you're doing. > Unless it is possible to insert quickly in a linked list, I think > both schemes will be equally fast. It is trivially simple to > implement scheme b, so I'd chose that one (unless measurement > proves me wrong :-). It looks to me like you avoid the overhead of the linked list, at the expense of doubling the search. You search the ordered linked list to find where to insert your new item, and on average it should take less than half the time because the larger the numbers are, the rarer they are. So the chance that a new number is in the upper half of the list should be less than the chance that it's in the lower half. The overhead of the linked list is that to traverse it you need an extra read instead of an increment, and to link in you need two extra writes. I'm assuming you won't have cache problems with a linked list with only 256 items. Hard to believe that's more important than searching the whole array each time. But I still think the 99.9+% of the time that the comparison fails and nothing much happens will take so much time that the differences among the successes will be lessened. A simple approach that's only a little bit slower might be the best. |
![]() |
| Thread Tools | |
| Display Modes | |
In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.