Sorting a file with many already-sorted chunks

This is a discussion on Sorting a file with many already-sorted chunks within the Programming Languages forums in category; What is the best sorting algorithm if the file consists of many chunks of various sizes, each chunk is already ordered, but they are out of order? Example: 38013 38014 38015 33016 33017 33018 44034 44035 44036 44037 44038 44039 44040 44041 44042 44043 44044 44045 44046 44047 34892 34893 34894 34895 34896 34897 34898 34899 34900 34901 34902 34903 34904 34905 34906 38016 38017 38018 38019 38020 Here we have five already-ordered chunks, length 3, 3, 14, 15 and 5. I would think QuickSort is the best in such situation, but I am not sure....

Go Back   Application Development Forum > Programming Languages

Object Mix

Register FAQ Calendar Search Today's Posts Mark Forums Read
  #1  
Old 08-21-2008, 10:56 AM
ilya2@rcn.com
Guest
 
Default Sorting a file with many already-sorted chunks

What is the best sorting algorithm if the file consists of many chunks
of various sizes, each chunk is already ordered, but they are out of
order? Example:

38013
38014
38015
33016
33017
33018
44034
44035
44036
44037
44038
44039
44040
44041
44042
44043
44044
44045
44046
44047
34892
34893
34894
34895
34896
34897
34898
34899
34900
34901
34902
34903
34904
34905
34906
38016
38017
38018
38019
38020

Here we have five already-ordered chunks, length 3, 3, 14, 15 and 5. I
would think QuickSort is the best in such situation, but I am not sure.
Reply With Quote
  #2  
Old 08-21-2008, 11:04 AM
Richard Heathfield
Guest
 
Default Re: Sorting a file with many already-sorted chunks

ilya2@rcn.com said:

> What is the best sorting algorithm if the file consists of many chunks
> of various sizes, each chunk is already ordered, but they are out of
> order?


Expect improvements from others on this idea:

start a new group
while the read of the next datum succeeds
compare it to the previous datum (if there is one) in this group
if it's greater, add it to this group
otherwise, start a new group, and add this datum to the new group,
which becomes the "current" group, the "this" group
wend
merge the groups in the canonical way

--
Richard Heathfield <http://www.cpax.org.uk>
Email: -http://www. +rjh@
Google users: <http://www.cpax.org.uk/prg/writings/googly.php>
"Usenet is a strange place" - dmr 29 July 1999
Reply With Quote
  #3  
Old 08-21-2008, 12:13 PM
Juha Nieminen
Guest
 
Default Re: Sorting a file with many already-sorted chunks

ilya2@rcn.com wrote:
> What is the best sorting algorithm if the file consists of many chunks
> of various sizes, each chunk is already ordered, but they are out of
> order?


Read all the sorted chunks (detecting where a chunk ends and a new
starts is a trivial conditional) and then merge-sort all the chunks
together. (In other words, merge pairs of chunks, then merge pairs of
the resulting chunks, etc. until you have only one left.)
Reply With Quote
  #4  
Old 08-21-2008, 12:35 PM
stan
Guest
 
Default Re: Sorting a file with many already-sorted chunks

Richard Heathfield wrote:
> ilya2@rcn.com said:
>
>> What is the best sorting algorithm if the file consists of many chunks
>> of various sizes, each chunk is already ordered, but they are out of
>> order?

>
> Expect improvements from others on this idea:
>
> start a new group
> while the read of the next datum succeeds
> compare it to the previous datum (if there is one) in this group
> if it's greater, add it to this group
> otherwise, start a new group, and add this datum to the new group,
> which becomes the "current" group, the "this" group
> wend
> merge the groups in the canonical way


I think this application would require great care. The scan to
find where a new sorted group starts and ends will most likely cost more
than a simple straightforward sort. There are too many unknowns to give
much advice. How large is the file? How large and how many sorted groups
are there? Are any elements repeated?

This sure sounds like one of those cases where it could pay to make sure
you are solving the right problem before jumping on what appears to be
an obvious solution. On top of that this really seems like one of those
times where it will be real easy to generate an O(n^2) solution when a
standard library qsort or a standard mergesort would probably win
without even considering any partial sort of the input.

How about more information about the input?
Reply With Quote
  #5  
Old 08-21-2008, 01:54 PM
ilya2@rcn.com
Guest
 
Default Re: Sorting a file with many already-sorted chunks

On Aug 21, 12:13*pm, Juha Nieminen <nos...@thanks.invalid> wrote:
> il...@rcn.com wrote:
> > What is the best sorting algorithm if the file consists of many chunks
> > of various sizes, each chunk is already ordered, but they are out of
> > order?

>
> * Read all the sorted chunks (detecting where a chunk ends and a new
> starts is a trivial conditional) and then merge-sort all the chunks
> together. (In other words, merge pairs of chunks, then merge pairs of
> the resulting chunks, etc. until you have only one left.)


Thanks -- it worked quite well.
Reply With Quote
  #6  
Old 08-21-2008, 02:32 PM
Pascal J. Bourguignon
Guest
 
Default Re: Sorting a file with many already-sorted chunks

Juha Nieminen <nospam@thanks.invalid> writes:

> ilya2@rcn.com wrote:
>> What is the best sorting algorithm if the file consists of many chunks
>> of various sizes, each chunk is already ordered, but they are out of
>> order?

>
> Read all the sorted chunks (detecting where a chunk ends and a new
> starts is a trivial conditional) and then merge-sort all the chunks
> together. (In other words, merge pairs of chunks, then merge pairs of
> the resulting chunks, etc. until you have only one left.)


That's O(N^2).

You must do the merge of all chunks at once to reduce it to O(N).

As long as the number of chunks is not so big that at least one
element from each chunk can be held in memory, it can be done in O(N).

Read the file once, keeping in memory the start and end position of
each chunk, and the first element of each chunk.

Then scan all the chunks for the smallest element, and output it,
advancing the chunk (read the next element, incrementing the current
position for the chunk). Repeat until all chunks are empty.

--
__Pascal Bourguignon__ http://www.informatimago.com/

HANDLE WITH EXTREME CARE: This product contains minute electrically
charged particles moving at velocities in excess of five hundred
million miles per hour.
Reply With Quote
  #7  
Old 08-21-2008, 02:53 PM
Willem
Guest
 
Default Re: Sorting a file with many already-sorted chunks

Pascal J. Bourguignon wrote:
) Juha Nieminen <nospam@thanks.invalid> writes:
)> Read all the sorted chunks (detecting where a chunk ends and a new
)> starts is a trivial conditional) and then merge-sort all the chunks
)> together. (In other words, merge pairs of chunks, then merge pairs of
)> the resulting chunks, etc. until you have only one left.)
)
) That's O(N^2).

Nope, it's O(N log M) where M is the number of chunks.

) You must do the merge of all chunks at once to reduce it to O(N).

That would imply that it is O(1) to find the smallest element each time.
But it is actually O(log M) where M is the number of elements.

) As long as the number of chunks is not so big that at least one
) element from each chunk can be held in memory, it can be done in O(N).

Because you're using 'O(N)' unqualified, I assume you're talking about
time complexity. In which case, you're wrong, as shown above.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
Reply With Quote
  #8  
Old 08-21-2008, 02:55 PM
Willem
Guest
 
Default Re: Sorting a file with many already-sorted chunks

stan wrote:
) Richard Heathfield wrote:
)> start a new group
)> while the read of the next datum succeeds
)> compare it to the previous datum (if there is one) in this group
)> if it's greater, add it to this group
)> otherwise, start a new group, and add this datum to the new group,
)> which becomes the "current" group, the "this" group
)> wend
)> merge the groups in the canonical way
)
) I think this application would require great care. The scan to
) find where a new sorted group starts and ends will most likely cost more
) than a simple straightforward sort.

Don't be silly. Just above, Richard explains very simply how to fin
d the start of each of the groups in O(N) time.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
Reply With Quote
  #9  
Old 08-21-2008, 03:16 PM
Richard Harter
Guest
 
Default Re: Sorting a file with many already-sorted chunks

On Thu, 21 Aug 2008 20:32:10 +0200, pjb@informatimago.com (Pascal
J. Bourguignon) wrote:

>Juha Nieminen <nospam@thanks.invalid> writes:
>
>> ilya2@rcn.com wrote:
>>> What is the best sorting algorithm if the file consists of many chunks
>>> of various sizes, each chunk is already ordered, but they are out of
>>> order?

>>
>> Read all the sorted chunks (detecting where a chunk ends and a new
>> starts is a trivial conditional) and then merge-sort all the chunks
>> together. (In other words, merge pairs of chunks, then merge pairs of
>> the resulting chunks, etc. until you have only one left.)

>
>That's O(N^2).


No, its O(N*log2(M)) where M is the number of chunks. It's
log2(M) passes with each pass being O(N).

>
>You must do the merge of all chunks at once to reduce it to O(N).


It's still O(N*log2(M)). You need a data structure to hold
headers for M chunks. Searching that data structure for the next
item to merge in is O(log2(M)).

>
>As long as the number of chunks is not so big that at least one
>element from each chunk can be held in memory, it can be done in O(N).
>
>Read the file once, keeping in memory the start and end position of
>each chunk, and the first element of each chunk.
>
>Then scan all the chunks for the smallest element, and output it,
>advancing the chunk (read the next element, incrementing the current
>position for the chunk). Repeat until all chunks are empty.



Richard Harter, cri@tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
Save the Earth now!!
It's the only planet with chocolate.
Reply With Quote
  #10  
Old 08-21-2008, 03:23 PM
Richard Harter
Guest
 
Default Re: Sorting a file with many already-sorted chunks

On Thu, 21 Aug 2008 16:13:56 GMT, Juha Nieminen
<nospam@thanks.invalid> wrote:

>ilya2@rcn.com wrote:
>> What is the best sorting algorithm if the file consists of many chunks
>> of various sizes, each chunk is already ordered, but they are out of
>> order?

>
> Read all the sorted chunks (detecting where a chunk ends and a new
>starts is a trivial conditional) and then merge-sort all the chunks
>together. (In other words, merge pairs of chunks, then merge pairs of
>the resulting chunks, etc. until you have only one left.)


If the chunks are of significantly unequal size it is more
efficient to sort the chunks by size and then successively merge
the two smallest. The principle is the same as in Huffman
encoding.



Richard Harter, cri@tiac.net
http://home.tiac.net/~cri, http://www.varinoma.com
Save the Earth now!!
It's the only planet with chocolate.
Reply With Quote
Reply


Thread Tools
Display Modes


All times are GMT -5. The time now is 05:13 PM.


Powered by vBulletin® Version 3.7.2
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
vB Ad Management by =RedTyger=

In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.