| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#11
| |||
| |||
| On Aug 4, 2:14*pm, jzakiya <jzak...@mail.com> wrote: > Marcel > > I hope this answers your questions. > > >I had some problems understanding what (or better *why*) your code > >tried to do things. > >The way a prime sieve gets used, e.g. in the Euler Project, I normally > >allocate a 1 megabyte array, initialize it with boolean flags (true: > >the value corresponding with this index is a prime, where the integer > >to index code is "3 - 1 rshift"), and then use this boolean > >array for the rest of the program. The setting up time of the Sieve is > >truly insignificant compared to the run-time of the actual code that > >uses the Sieve to determine if a given integer is prime. > >(Of course, I uses Albert's van der Horst benchpin code to do the real > >searching, the sieve array is only to weed out the common candidates. > >If benchpin is not used, the Sieve might need to be hundredths of > >megabytes in size for Euler Project programs.) > > The Forth versions also set up a boolean SIEVE array set to FALSE. > Then for each prime generator (P3,P5,P7, etc) it sets to non-FALSE > only potential prime candidates, and the initial set of seed primes. > Fewer prime candidates are selected as the prime generators get > bigger, thus they are more efficient at selecting primes for large N. > > >An interesting improvement in this area would be when many more (than > >currently done) prime booleans could be compressed into the allocated > >array. Or in other words: a better compressor than the standard > >: int->index 3 - 1 rshift ; . > >However, this is not what your code does. > > If memory use efficiency is more important than speed you can do this. > The modulus value tells you the span of numbers examined by each pass > of the prime generators, and the residue values tells you the specific > values that can possibly be prime over that span. > > So for P5=30*k+(1,7,11,13,17,19,23,29) for each pass (k=0,1,2,3..n) up > to N 30 is added to (1,..29) each pass (0,30,60,90..), to identify > possible primes. These 30*k residue values can be a bit of a byte of > memory (lsb low value) starting at k=1 (k=0 gives you the residue, > which you don't have to store). > byte01 = 59,53,49,47,43,41,37,31 would be 1101111, (49 not prime) > byte02 = 89,83,79,77,73,71,67,61 would be 1110111, (77 not prime) > ....... etc > > For larger Pn's you'd get greater compression. P11 has a modulus of > 2310 and 480 residues. The 480 residues generate 480 possible primes > for each span of 2310 numbers. Thus, you would need only 480/8 = 60 > bytes to represent the primes over each span of 2310. So you would > need only about 60*(N/2310) bytes of memory to store prime > representatives up to a value of N. > > >I did not really understand why you called it the 'Ultimate' prime sieve.. > >It will be faster to store precomputed boolean flags in a diskfile and > >read them with mmap. Of course, one would only do that when the prime > >setup time was limiting the run time of the actual code (?) or when the > >size of the array is larger than what can be conveniently allocated. Your > >code ALLOTs an array, which for common Forths limits the Sieve size to > >some 64 MByte (tiny compared to main memory or disk space). > > The word "Ultimate" was an inside joke to myself, and a hyperbalistic > way to get people to scrutinize my math/methodology. As I said in my > paper, I started this because I was looking at ways to generate primes > to solve Project Euler problems. In my search for code I found ways to > do the Sieve of Eratosthenes, then the Sieve of Atkin, and wheel > sieves. So when I came up with my sieving > methodology (as compared to non-sieving approaches) I labeled it as > the "ultimate" prime sieve as a way of saying "here's another one" and > to draw attention to it. BUT, mathematically, the generators I've > identified, and the process to identify additional generators of my > form, CAN produce actual implementations that generate primes faster > than what now exists, especially when implemented in a parallel. So in > that sense they could be "ultimate" too. > > >After these deliberations I tried to run your code. As the point (probably) > >is to compare the Sieve setup time of primesP3, primesP4 and primesP7, > >I came up with a test rig which sets the sieve to its maximum size (num > >and lndx), erases it, and then runs the words. To have a basic reference, > > Thanks for benchmarking my code. *I've found Forth restricts the use > of my method because I can't do really big arrays (gforth lets me have > an array up to about 8 million, other forths much lower). Other > languages let you get into the billions, and even trillions, for array > sizes much easier. > > >I timed the way *benchpin *initializes its Sieve (453 ms). Furthermore, > >I tested the speed of your base word *primz? *which finds primes without > >using the sieve, and of the word *primz2? *which one would normally use > >to access the Sieve in practice as > > As Albert has stated, benchpin isn't a sieve, and doesn't generate the > primes (though it can be modified to to so), it only counts them up to > N. primz? is a proof-of-concept implementation of a deterministic > primality tester I did to show how the reverse prime generation > methodology can be used to check numbers for primality. It is NOT the > fastest general purpose way to deterministically check for primality, > though it is deterministic. > > FYI, here are the most current timings from someone who is > benchmarking my methods against the Sieve of Eratoshenes/Atkin. He's > using an overclocked 3.67 GHz Intel quadcore 9650 system, coding in C. > Times are in seconds. > > Case * * * * nPrime7x * nPrime7x * nPrime7x > * * * * * * * Atkin * * *Zakiya * * *Erat. > > 1 trillion * 459.97 * * 462.58 * * 527.03 > > 2 trillion * 943.66 * * 951.47 * *1065.09 > > 3 trillion *1465.52 * *1496.38 * *1636.63 > > So, he's got the Sieve of Atkin and my P7 generator able to generate > all the primes up to N=1,000,000,000,000 (1 Trillion) in about 7 min, > 40 secs. Impressive!! > > >I hope I understood the point of your work correctly. > >-marcel > > I hope this helps to explain my work more clearly. ;-) > > Jabari > > PS: Albert, I've run your benchpin.f code, as is, on 6 different > forths. For straight Linux, it runs on gforth, pfe, and kforth, and > using wine on Linux, it runs on Swiftforth, MPE, and Win32Forth. I received these new results for my P60 included in the above list. Atkin ZakiyaP7 ZakiyaP60 Erat. 1 trillion 459.97 462.58 463.08 527.03 2 trilion 943.66 951.47 929.88 1065.09 3 trillion 1465.52 1496.38 1431.38 1636.63 Again, I believe my higher order generators, P11 on up, should be able to be implemented to be even MUCH faster. Jabari |
|
#12
| |||
| |||
| In article <ad56b0a7-b7c4-4933-a267-cdfdf1039590@a1g2000hsb.googlegroups.com>, jzakiya <jzakiya@mail.com> wrote: <SNIP> > >Jabari > >PS: Albert, I've run your benchpin.f code, as is, on 6 different >forths. For straight Linux, it runs on gforth, pfe, and kforth, and >using wine on Linux, it runs on Swiftforth, MPE, and Win32Forth. As it is intended as an ISO benchmark, it would be truely embarassing to find an ISO Forth it, that doesn't run it :-). If you want to use gforth with even larger memory, use the -m option. "gforth --help" gives a lot of information about options. lina has a similar option, -g (grow). The lina64 can have sieves in the gigabytes, as does 64-bit gforth. Groetjes Albert -- -- Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- like all pyramid schemes -- ultimately falters. albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst |
|
#13
| |||
| |||
| On Aug 5, 1:42*pm, Albert van der Horst <alb...@spenarnc.xs4all.nl> wrote: > In article <ad56b0a7-b7c4-4933-a267-cdfdf1039...@a1g2000hsb.googlegroups.com>, > > jzakiya *<jzak...@mail.com> wrote: > > <SNIP> > > > > >Jabari > > >PS: Albert, I've run your benchpin.f code, as is, on 6 different > >forths. For straight Linux, it runs on gforth, pfe, and kforth, and > >using wine on Linux, it runs on Swiftforth, MPE, and Win32Forth. > > As it is intended as an ISO benchmark, it would be truely embarassing > to find an ISO Forth it, that doesn't run it :-). > > If you want to use gforth with even larger memory, use the -m option. > "gforth --help" *gives a lot of information about options. > > lina has a similar option, -g (grow). The lina64 can have > sieves in the gigabytes, as does 64-bit gforth. > > Groetjes Albert > > -- > -- > Albert van der Horst, UTRECHT,THE NETHERLANDS > Economic growth -- like all pyramid schemes -- ultimately falters. > albert@spe&ar&c.xs4all.nl &=nhttp://home.hccnet.nl/a.w.m.van.der.horst Hey Albert, Thanks for telling me how too increase gforth's memory capacity. Once I started gforth like this: gforth -m 1000M I was able do to much bigger arrays. Then using this gforth timing word: : SECS TIME&DATE 2DROP DROP 60 * + 60 * + ; I could do some rough timings. Here are some results of my P3, P5, and P7 prime generators up to the N values given (time in seconds). Examples benchmark process. num . 900000007 secs num primesp7 secs swap - . nprimes 61 46009215 I also then bechnmarked Albert's Pi generator as follows: secs num pi swap secs swap - . . 30 46009215 Here are results on my P4 2.8 GHz, 1GB laptop. num 500,000,007 700,000,007 900,000,007 1,000,000,007 ----------------------------------------------------------------- Pi(N) 26,355,868 36,252,932 46,009,215 50,847,535 ----------------------------------------------------------------- P3 48 68 91 98 ----------------------------------------------------------------- P5 38 55 71 80 ----------------------------------------------------------------- P7 34 47 61 69 ----------------------------------------------------------------- pi 16 23 30 33 Times (in secs) are the same for gforth and gforth-fast. This is pretty good for a single core/threaded implementation. Again, the ultimate performance is based on the generator function, and its implementation. I would really like to see a multi-threaded Forth version on a multi-core system to compare to the C versions. Jabari |
|
#14
| |||
| |||
| jzakiya <jzakiya@gmail.com> writes Re: Ultimate Prime Sieve -- Sieve of Zakiya (SoZ) [..] > Here are results on my P4 2.8 GHz, 1GB laptop. > > num 500,000,007 700,000,007 900,000,007 1,000,000,007 > ----------------------------------------------------------------- > Pi(N) 26,355,868 36,252,932 46,009,215 50,847,535 > ----------------------------------------------------------------- > P3 48 68 91 98 > ----------------------------------------------------------------- > P5 38 55 71 80 > ----------------------------------------------------------------- > P7 34 47 61 69 > ----------------------------------------------------------------- > pi 16 23 30 33 > Times (in secs) are the same for gforth and gforth-fast. > This is pretty good for a single core/threaded implementation. iForth32 on 3 GHz PIV: FORTH> largetest sieve of 400000007 bytes. primesP3 : 12.317 seconds elapsed. (21336326 primes in sieve) primesP5 : 10.141 seconds elapsed. (21336326 primes in sieve) primesP7 : 9.023 seconds elapsed. (21336326 primes in sieve) ok FORTH> in ok FORTH> largetest sieve of 700000007 bytes. primesP3 : 22.357 seconds elapsed. (36252932 primes in sieve) primesP5 : 18.418 seconds elapsed. (36252932 primes in sieve) primesP7 : 16.438 seconds elapsed. (36252932 primes in sieve) ok FORTH> in ok FORTH> largetest sieve of 1000000007 bytes. primesP3 : 32.530 seconds elapsed. (50847535 primes in sieve) primesP5 : 26.776 seconds elapsed. (50847535 primes in sieve) primesP7 : 23.799 seconds elapsed. (50847535 primes in sieve) ok iForth64 on 3 GHz AMD X2: FORTH> largetest sieve of 1000000007 bytes. primesP3 : 17.585 seconds elapsed. (50847535 primes in sieve) primesP5 : 14.743 seconds elapsed. (50847535 primes in sieve) primesP7 : 13.303 seconds elapsed. (50847535 primes in sieve) ok -marcel |
|
#15
| |||
| |||
| On Aug 6, 2:01*pm, m...@iae.nl (Marcel Hendrix) wrote: > jzakiya <jzak...@gmail.com> writes Re: Ultimate Prime Sieve -- Sieve of Zakiya (SoZ) > [..] > > > Here are results on my P4 2.8 GHz, 1GB laptop. > > > num * * 500,000,007 * 700,000,007 * 900,000,007 * 1,000,000,007 > > ----------------------------------------------------------------- > > Pi(N) * *26,355,868 * *36,252,932 * *46,009,215 * * 50,847,535 > > ----------------------------------------------------------------- > > P3 * * * * * 48 * * * * * *68 * * * * **91 * * * * * *98 > > ----------------------------------------------------------------- > > P5 * * * * * 38 * * * * * *55 * * * * **71 * * * * * *80 > > ----------------------------------------------------------------- > > P7 * * * * * 34 * * * * * *47 * * * * **61 * * * * * *69 > > ----------------------------------------------------------------- > > pi * * * * * 16 * * * * * *23 * * * * **30 * * * * * *33 > > Times (in secs) are the same for gforth and gforth-fast. > > This is pretty good for a single core/threaded implementation. > > iForth32 on 3 GHz PIV: > > FORTH> largetest > sieve of 400000007 bytes. > primesP3 *: 12.317 seconds elapsed. (21336326 primes in sieve) > primesP5 *: 10.141 seconds elapsed. (21336326 primes in sieve) > primesP7 *: 9.023 seconds elapsed. (21336326 primes in sieve) ok > FORTH> in *ok > FORTH> largetest > sieve of 700000007 bytes. > primesP3 *: 22.357 seconds elapsed. (36252932 primes in sieve) > primesP5 *: 18.418 seconds elapsed. (36252932 primes in sieve) > primesP7 *: 16.438 seconds elapsed. (36252932 primes in sieve) ok > FORTH> in *ok > FORTH> largetest > sieve of 1000000007 bytes. > primesP3 *: 32.530 seconds elapsed. (50847535 primes in sieve) > primesP5 *: 26.776 seconds elapsed. (50847535 primes in sieve) > primesP7 *: 23.799 seconds elapsed. (50847535 primes in sieve) ok > > iForth64 on 3 GHz AMD X2: > > FORTH> largetest > sieve of 1000000007 bytes. > primesP3 *: 17.585 seconds elapsed. (50847535 primes in sieve) > primesP5 *: 14.743 seconds elapsed. (50847535 primes in sieve) > primesP7 *: 13.303 seconds elapsed. (50847535 primes in sieve) ok > > -marcel Uuuuh Marcal, you got better toys to play with than me. :-) Was the version on the AMD X2 a multi-threaded version that used both cores in parallel? Did you have to modify the code much? Do you have enough memory to do numbers into the trillions? If/when I ever code P11 in Forth (did it for Ruby & Python) I'll post it so you can try that too. Jabari |
|
#16
| |||
| |||
| jzakiya <jzakiya@gmail.com> writes Re: Ultimate Prime Sieve -- Sieve of Zakiya (SoZ) [..] >> iForth64 on 3 GHz AMD X2: >> FORTH> largetest >> sieve of 1000000007 bytes. >> primesP3 : 17.585 seconds elapsed. (50847535 primes in sieve) >> primesP5 : 14.743 seconds elapsed. (50847535 primes in sieve) >> primesP7 : 13.303 seconds elapsed. (50847535 primes in sieve) ok > Uuuuh Marcal, you got better toys to play with than me. :-) > Was the version on the AMD X2 a multi-threaded version that used both > cores in parallel? Did you have to modify the code much? No, it is not multi-threaded yet. The inner loop looks straightforward enough, but I did not put in the time to analyze if the eliminations really run completely independently. > Do you have enough memory to do numbers into the trillions? No, of course not. But again, it looks as if you can Sieve in batches of 0.5 GByte with little overhead, especially if the threads for p1 .. p8 run in parallel. > If/when I ever code P11 in Forth (did it for Ruby & Python) I'll post > it so you can try that too. You can already predict from the above that it will be around 11 to 12 seconds. -marcel |
|
#17
| |||
| |||
| mhx@iae.nl (Marcel Hendrix) wrote Re: Ultimate Prime Sieve -- Sieve of Zakiya (SoZ) > jzakiya <jzakiya@gmail.com> writes Re: Ultimate Prime Sieve -- Sieve of Zakiya (SoZ) [..] >> Was the version on the AMD X2 a multi-threaded version that used both >> cores in parallel? Did you have to modify the code much? > No, it is not multi-threaded yet. The inner loop looks straightforward > enough, but I did not put in the time to analyze if the eliminations > really run completely independently. This is a pretty awful parallellization, but anyway. FORTH> largetest sieve of 1000000007 bytes. primesP3 : 18.222 seconds. (50,847,535 primes in sieve) primesP5 : 15.068 seconds. (50,847,535 primes in sieve) primesP5p : 14.064 seconds. (50,847,535 primes in sieve) primesP7 : 13.450 seconds. (50,847,535 primes in sieve) In primesP5p, one thread/CPU does the even values of I, the other thread/CPU does the odd ones. I felt that the most straightforward and fastest way would be to split the Sieve in 2 parts and assign each part to a thread. However, (1) I would need to understand exactly how your code works, (2) when I tried it with two equal (but different) half-size Sieves in parallel, the code was not very much faster than primesP5p: 2 * primesP5 : 12.899 seconds. (26,355,868 primes in sieve, 26,355,868 primes in sieve2) ok FWIW. -marcel |
|
#18
| |||
| |||
| On Aug 8, 5:18*pm, m...@iae.nl (Marcel Hendrix) wrote: > m...@iae.nl (Marcel Hendrix) wrote Re: Ultimate Prime Sieve -- Sieve of Zakiya (SoZ) > > > > > jzakiya <jzak...@gmail.com> writes Re: Ultimate Prime Sieve -- Sieve ofZakiya (SoZ) > [..] > >> Was the version on the AMD X2 a multi-threaded version that used both > >> cores in parallel? Did you have to modify the code much? > > No, it is not multi-threaded yet. The inner loop looks straightforward > > enough, but I did not put in the time to analyze if the eliminations > > really run completely independently. > > This is a pretty awful parallellization, but anyway. > > FORTH> largetest > sieve of 1000000007 bytes. > primesP3 * * : 18.222 seconds. (50,847,535 primes in sieve) > primesP5 * * : 15.068 seconds. (50,847,535 primes in sieve) > primesP5p * *: 14.064 seconds. (50,847,535 primes in sieve) > primesP7 * * : 13.450 seconds. (50,847,535 primes in sieve) > > In primesP5p, one thread/CPU does the even values of I, the other thread/CPU > does the odd ones. > > I felt that the most straightforward and fastest way would be to split > the Sieve in 2 parts and assign each part to a thread. However, (1) I > would need to understand exactly how your code works, (2) when I tried > it with two equal (but different) half-size Sieves in parallel, the code > was not very much faster than primesP5p: > > 2 * primesP5 : 12.899 seconds. (26,355,868 primes in sieve, 26,355,868 primes in sieve2) ok > > FWIW. > > -marcel Marcel, Here's my suggestion for parallellization. Here's an explanation of the math. The Sieve of Eratosthenes (SoE) is what I call a brute force sieve (bfs). It starts with 2 and then eliminates as prime candidates all multiples of 2. Then it takes the next integer not eliminated (3) and eliminates all multiples of that, and repeats the process to whatever number you want to find the primes up to. A bit more efficient version of the SoE starts at 3, and only eliminates odd number multiples up to the sqrt(N), since 2 is the only even prime. I employ number theory to a) identify a smaller pool of candidate primes and then b) do a sieve only on this pool of prime candidates. To identify the prime candidates for a given prime generator function, which consists of a modulus value and a list of prime residues (vectors), you just add multiples of the modulus to each residue up to the value N. Then to do the sieve, you multiply each residue value by the consecutive primes (starting with the prime determined by the generator function) up to the sqrt(N) and eliminate these values from the prime candidates pool. Whatever is left are the primes up to N. One beautiful thing about this method is that all the operations are independent. You can perform them in any order, as long as you do all of them. So to create the prime candidates each operation is of the form: (for all residues ni, modulus k, and array length lndx) begin ni < lndx while k ni + to ni 1 ni sieve c! repeat The sieve operations are of the form: (pi = p*ni; x = p*k, where p are the primes from Pn < p < sqrt(N)) begin pi < lndx while 0 pi sieve c! x pi + to pi repeat For example P7 has 48 residues. If you had 48 cores you could do the prime candidates generation, and then sieve operations, by just writing "1"s and then "0"s to a single sieve array, all in parallel. But since you have 2 cores (AMD X2) here's how I would do it, using P7 = 30*k+(1,7,11,13,17,19,23,29) as an example. To produce the prime candidates, just split the residues in half and perform them with each core. Core1: ni = (1,7,11,13); Core2: ni = (17,19,23,29); k =15 begin ni < lndx while k ni + to ni 1 ni sieve c! repeat But to perform the sieve, load balance the cores by partitioning the residues like this: Core1: ni = (7,11,29,31); Core2: ni = (13,17,19,23) I suggest this because the small numbers will take the most iterations to do, and the larger numbers the fewest. So take the 2 smallest and 2 largest and do them in one core, and the middle values in the 2nd core. This should make each core do about the same number of total operations. I hope this is clear, and makes sense. See if this will produce faster results. Jabari |
|
#19
| |||
| |||
| On Aug 11, 2:37*pm, jzakiya <jzak...@mail.com> wrote: > On Aug 8, 5:18*pm, m...@iae.nl (Marcel Hendrix) wrote: > > > > > m...@iae.nl (Marcel Hendrix) wrote Re: Ultimate Prime Sieve -- Sieve ofZakiya (SoZ) > > > > jzakiya <jzak...@gmail.com> writes Re: Ultimate Prime Sieve -- Sieve of Zakiya (SoZ) > > [..] > > >> Was the version on the AMD X2 a multi-threaded version that used both > > >> cores in parallel? Did you have to modify the code much? > > > No, it is not multi-threaded yet. The inner loop looks straightforward > > > enough, but I did not put in the time to analyze if the eliminations > > > really run completely independently. > > > This is a pretty awful parallellization, but anyway. > > > FORTH> largetest > > sieve of 1000000007 bytes. > > primesP3 * * : 18.222 seconds. (50,847,535 primes in sieve) > > primesP5 * * : 15.068 seconds. (50,847,535 primes in sieve) > > primesP5p * *: 14.064 seconds. (50,847,535 primes in sieve) > > primesP7 * * : 13.450 seconds. (50,847,535 primes in sieve) > > > In primesP5p, one thread/CPU does the even values of I, the other thread/CPU > > does the odd ones. > > > I felt that the most straightforward and fastest way would be to split > > the Sieve in 2 parts and assign each part to a thread. However, (1) I > > would need to understand exactly how your code works, (2) when I tried > > it with two equal (but different) half-size Sieves in parallel, the code > > was not very much faster than primesP5p: > > > 2 * primesP5 : 12.899 seconds. (26,355,868 primes in sieve, 26,355,868 primes in sieve2) ok > > > FWIW. > > > -marcel > > Marcel, > > Here's my suggestion for parallellization. > > Here's an explanation of the math. > > The Sieve of Eratosthenes (SoE) is what I call a brute force sieve > (bfs). It starts with 2 and then eliminates as prime candidates all > multiples of 2. Then it takes the next integer not eliminated (3) and > eliminates all multiples of that, and repeats the process to whatever > number you want to find the primes up to. *A bit more efficient > version of the SoE starts at 3, and only eliminates > odd number multiples up to the sqrt(N), since 2 is the only even > prime. > > I employ number theory to a) identify a smaller pool of candidate > primes and then b) do a sieve only on this pool of prime candidates. > > To identify the prime candidates for a given prime generator function, > which consists of a modulus value and a list of prime residues > (vectors), you just add multiples of the modulus to each residue up to > the value N. Then to do the sieve, you multiply each residue value by > the consecutive primes (starting with the prime > determined by the generator function) up to the sqrt(N) and eliminate > these values from the prime candidates pool. Whatever is left are the > primes up to N. > > One beautiful thing about this method is that all the operations are > independent. You can perform them in any order, as long as you do all > of them. > > So to create the prime candidates each operation is of the form: > > (for all residues ni, modulus k, and array length lndx) > > begin ni < lndx while *k ni + to ni *1 ni sieve c! repeat > > The sieve operations are of the form: > > (pi = p*ni; x = p*k, where p are the primes from Pn < p < sqrt(N)) > > begin pi < lndx while *0 pi sieve c! *x pi + to pi *repeat > > For example P7 has 48 residues. If you had 48 cores you could do > the prime candidates generation, and then sieve operations, by just > writing "1"s and then "0"s to a single sieve array, all in parallel. > > But since you have 2 cores (AMD X2) here's how I would do it, using > P7 = 30*k+(1,7,11,13,17,19,23,29) as an example. > > To produce the prime candidates, just split the residues in half and > perform them with each core. > > Core1: ni = (1,7,11,13); Core2: ni = (17,19,23,29); k =15 > > begin ni < lndx while *k ni + to ni *1 ni sieve c! repeat > > But to perform the sieve, load balance the cores by partitioning the > residues like this: > > Core1: ni = (7,11,29,31); Core2: ni = (13,17,19,23) > > I suggest this because the small numbers will take the most iterations > to do, and the larger numbers the fewest. So take the 2 smallest and 2 > largest and do them in one core, and the middle values in the 2nd > core. This should make each core do about the same number of total > operations. > > I hope this is clear, and makes sense. > See if this will produce faster results. > > Jabari CORRECTION: The example was for P5 NOT P7. Let me make sure I explain this more clearly. Using P5 as an example, to do the sieve part in parallel you'd do: num sqrt 3 - 2/ 1+ 2 do I sieve c@ if \ p1=7*i, p2=11*i, p3=13*i --- p7=29*i, p8=31*i, x=30*i \ j i 7i 11i 13i 17i 19i 23i 29i 31i 30i \ 7->2 23 37 44 58 65 79 100 107 105 \ p1 p2 p3 p4 p5 p6 p7 p8 x1 \ Compute p1--p8 and X1 for I \ Core1 threads begin p1 lndx < while 0 p1 sieve c! x1 p1 + to p1 repeat begin p2 lndx < while 0 p2 sieve c! x1 p2 + to p2 repeat begin p7 lndx < while 0 p7 sieve c! x1 p7 + to p7 repeat begin p8 lndx < while 0 p8 sieve c! x1 p8 + to p8 repeat \ Core2 threads begin p3 lndx < while 0 p3 sieve c! x1 p3 + to p3 repeat begin p4 lndx < while 0 p4 sieve c! x1 p4 + to p4 repeat begin p5 lndx < while 0 p5 sieve c! x1 p5 + to p5 repeat begin p6 lndx < while 0 p6 sieve c! x1 p6 + to p6 repeat then loop You could also have each thread be exactly the same and have them compute pi and x1 independently in each thread compute pi,x1 begin pi lndx < while ..... repeat then you'd just feed the residue, modulus, and I to each thread. When all the threads return for each I (primes Pn < p < sqrt(N)) then do the next I. Repeat for all I (prime factors). A supper parallel version would unroll the loop (I values) and do all these threads at once in parallel. All you need are simple cores too, since all you're doing in each thread is adding numbers and writing "0"s to sieve array (memory). I hope this is conceptually clearer. Jabari |
|
#20
| |||
| |||
| On Aug 11, 4:43*pm, jzakiya <jzak...@mail.com> wrote: > On Aug 11, 2:37*pm, jzakiya <jzak...@mail.com> wrote: > > > > > On Aug 8, 5:18*pm, m...@iae.nl (Marcel Hendrix) wrote: > > > > m...@iae.nl (Marcel Hendrix) wrote Re: Ultimate Prime Sieve -- Sieve of Zakiya (SoZ) > > > > > jzakiya <jzak...@gmail.com> writes Re: Ultimate Prime Sieve -- Sieve of Zakiya (SoZ) > > > [..] > > > >> Was the version on the AMD X2 a multi-threaded version that used both > > > >> cores in parallel? Did you have to modify the code much? > > > > No, it is not multi-threaded yet. The inner loop looks straightforward > > > > enough, but I did not put in the time to analyze if the eliminations > > > > really run completely independently. > > > > This is a pretty awful parallellization, but anyway. > > > > FORTH> largetest > > > sieve of 1000000007 bytes. > > > primesP3 * * : 18.222 seconds. (50,847,535 primes in sieve) > > > primesP5 * * : 15.068 seconds. (50,847,535 primes in sieve) > > > primesP5p * *: 14.064 seconds. (50,847,535 primes in sieve) > > > primesP7 * * : 13.450 seconds. (50,847,535 primes in sieve) > > > > In primesP5p, one thread/CPU does the even values of I, the other thread/CPU > > > does the odd ones. > > > > I felt that the most straightforward and fastest way would be to split > > > the Sieve in 2 parts and assign each part to a thread. However, (1) I > > > would need to understand exactly how your code works, (2) when I tried > > > it with two equal (but different) half-size Sieves in parallel, the code > > > was not very much faster than primesP5p: > > > > 2 * primesP5 : 12.899 seconds. (26,355,868 primes in sieve, 26,355,868 primes in sieve2) ok > > > > FWIW. > > > > -marcel > > > Marcel, > > > Here's my suggestion for parallellization. > > > Here's an explanation of the math. > > > The Sieve of Eratosthenes (SoE) is what I call a brute force sieve > > (bfs). It starts with 2 and then eliminates as prime candidates all > > multiples of 2. Then it takes the next integer not eliminated (3) and > > eliminates all multiples of that, and repeats the process to whatever > > number you want to find the primes up to. *A bit more efficient > > version of the SoE starts at 3, and only eliminates > > odd number multiples up to the sqrt(N), since 2 is the only even > > prime. > > > I employ number theory to a) identify a smaller pool of candidate > > primes and then b) do a sieve only on this pool of prime candidates. > > > To identify the prime candidates for a given prime generator function, > > which consists of a modulus value and a list of prime residues > > (vectors), you just add multiples of the modulus to each residue up to > > the value N. Then to do the sieve, you multiply each residue value by > > the consecutive primes (starting with the prime > > determined by the generator function) up to the sqrt(N) and eliminate > > these values from the prime candidates pool. Whatever is left are the > > primes up to N. > > > One beautiful thing about this method is that all the operations are > > independent. You can perform them in any order, as long as you do all > > of them. > > > So to create the prime candidates each operation is of the form: > > > (for all residues ni, modulus k, and array length lndx) > > > begin ni < lndx while *k ni + to ni *1 ni sieve c! repeat > > > The sieve operations are of the form: > > > (pi = p*ni; x = p*k, where p are the primes from Pn < p < sqrt(N)) > > > begin pi < lndx while *0 pi sieve c! *x pi + to pi *repeat > > > For example P7 has 48 residues. If you had 48 cores you could do > > the prime candidates generation, and then sieve operations, by just > > writing "1"s and then "0"s to a single sieve array, all in parallel. > > > But since you have 2 cores (AMD X2) here's how I would do it, using > > P7 = 30*k+(1,7,11,13,17,19,23,29) as an example. > > > To produce the prime candidates, just split the residues in half and > > perform them with each core. > > > Core1: ni = (1,7,11,13); Core2: ni = (17,19,23,29); k =15 > > > begin ni < lndx while *k ni + to ni *1 ni sieve c! repeat > > > But to perform the sieve, load balance the cores by partitioning the > > residues like this: > > > Core1: ni = (7,11,29,31); Core2: ni = (13,17,19,23) > > > I suggest this because the small numbers will take the most iterations > > to do, and the larger numbers the fewest. So take the 2 smallest and 2 > > largest and do them in one core, and the middle values in the 2nd > > core. This should make each core do about the same number of total > > operations. > > > I hope this is clear, and makes sense. > > See if this will produce faster results. > > > Jabari > > CORRECTION: The example was for P5 NOT P7. > > Let me make sure I explain this more clearly. > > Using P5 as an example, to do the sieve part in parallel you'd do: > > * num sqrt 3 - 2/ 1+ 2 do > * * * *I sieve c@ if > * * * * * \ p1=7*i, p2=11*i, p3=13*i --- p7=29*i, p8=31*i, x=30*i > * * * * * \ j *i *7i *11i * 13i * 17i * 19i * 23i* 29i * 31i * 30i > * * * * * \ 7->2 *23 * 37 * *44 * *58 * *65 **79 * 100 * 107 * 105 > * * * * * \ * * * p1 * p2 * *p3 * *p4 * *p5* *p6 * *p7 * *p8 * *x1 > > * * * * * \ Compute p1--p8 and X1 for I > > * * * * * \ Core1 threads > * * * * * begin p1 lndx < while *0 p1 sieve c! *x1 p1 + to p1 *repeat > * * * * * begin p2 lndx < while *0 p2 sieve c! *x1 p2 + to p2 *repeat > * * * * * begin p7 lndx < while *0 p7 sieve c! *x1 p7 + to p7 *repeat > * * * * * begin p8 lndx < while *0 p8 sieve c! *x1 p8 + to p8 *repeat > > * * * * * \ Core2 threads > * * * * * begin p3 lndx < while *0 p3 sieve c! *x1 p3 + to p3 *repeat > * * * * * begin p4 lndx < while *0 p4 sieve c! *x1 p4 + to p4 *repeat > * * * * * begin p5 lndx < while *0 p5 sieve c! *x1 p5 + to p5 *repeat > * * * * * begin p6 lndx < while *0 p6 sieve c! *x1 p6 + to p6 *repeat > > * * * * then > * *loop > > You could also have each thread be exactly the same and have them > compute pi and x1 independently in each thread > > * * * * *compute pi,x1 *begin pi lndx < while ..... repeat > > then you'd just feed the residue, modulus, and I to each thread. > When all the threads return for each I (primes Pn < p < sqrt(N)) then > do the next I. Repeat for all I (prime factors). > > A supper parallel version would unroll the loop (I values) and do all > these threads at once in parallel. All you need are simple cores too, > since all you're doing in each thread is adding numbers and writing > "0"s to sieve array (memory). > > I hope this is conceptually clearer. > > Jabari A specific implementation may be something like this for each core \ Core 1 num sqrt 3 - 2/ 1+ 2 do I sieve c@ if \ Compute p1,p2,p7,p8 and X1 for I begin p8 lndx < while 0 p1 sieve c! 0 p2 sieve c! 0 p7 sieve c! 0 p8 sieve c! x1 p1 + to p1 x1 p2 + to p2 x1 p7 + to p7 x1 p8 + to p8 repeat p7 lndx < if 0 p7 sieve c! then p2 lndx < if 0 p2 sieve c! then p1 lndx < if 0 p1 sieve c! then then loop \ Core 2 num sqrt 3 - 2/ 1+ 2 do I sieve c@ if \ Compute p3,p4,p5,p6 and X1 for I begin p6 lndx < while 0 p3 sieve c! 0 p4 sieve c! 0 p5 sieve c! 0 p5 sieve c! x1 p3 + to p3 x1 p4 + to p4 x1 p5 + to p5 x1 p6 + to p6 repeat p5 lndx < if 0 p5 sieve c! then p4 lndx < if 0 p4 sieve c! then p3 lndx < if 0 p3 sieve c! then then loop Not knowing the AMD X2 architecture, these are generic concepts. I hope they give you some good ideas on how to partition the threading. Jabari |
![]() |
| Thread Tools | |
| Display Modes | |
In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.