| Register | FAQ | Calendar | Search | Today's Posts | Mark Forums Read |
|
#1
| |||
| |||
| I have to write to disk some rather large matrices (N=10000 or more) of double precision numbers. The output file is opened with FORM='UNFORMATTED'. At the moment the writing is done passing the matrix to a subroutine which treats the matrix as large vector and writes it down as such. In its minimal form it's something like: !========================================= program myprogram real(dp), dimension(Nmax,Nmax) :: matrix .... call write_vector( matrix , size(matrix), iunit) .... end program myprogram !=========================== subroutine write_vector( vector , ilength, iunit) real(dp), dimension(ilength) :: vector write(iunit) vector end subroutine write_vector !========================================= This way of doing things has two advantages (as far as I understand): 1) it's quite fast 2) it minimizes the number of record-length descriptor written down (and hence, the file size) My problem is that, if the total size of the matrix as given by size(matrix) is greater than 2**31 I get several kinds of problems, one of which being that the record-length descriptor overflows. I think I have two ways of dealing with the situation. The first way is using the STREAM (i.e., pure binary) format. This would solve the record-length overflow problem (NB: I am using the Intel compiler v9.1 under Linux 32bit or Linux 64bit). Note that on Linux64 I can specify the use of large integers but the record-length descriptor always is a 32-bit integer. The second way is keeping the UNFORMATTED format but splitting up the writing process. For several reasons (going to STREAM would imply changing too many other things), I'm trying to pursue this strategy. The first thing I tried was to write down the matrix column-by-column using a piece of code of the kind !========================================= ilength = size(matrix, 1) do i=1, Nmax call write_vector( matrix(:,i) , ilength, iunit) enddo !========================================= This implies writing an overhead of Nmax record-descriptors, which however is negligible for big runs. Now, from some benchmarks I ran (see later) ***it seems*** that the "column-by-column" (CBC) way is always slower, and sometimes very much so, than the "all at once" (AAO) one. Now my questions are: Is there any reason to expect CBC-writing to be much slower than AAO? Is this a too much system-dependent thing to say anything at all? Are there better ways of doing what I need? Perhaps using pointers? Any suggestion would be welcome and thank you for your attention. Lorenzo DETAILS ON THE BENCHMARKS: I did some tests writing to disk a matrix of doubleprecision random numbers using the two CBC and AAO formats. Tests were done under Linux32 on a Pentium IV 3GHz with 3GiB of ram and a 50GB disk (I don't know if IDE or SATA). The seed for the random number generator was taken from /dev/random and so was different at each run. The benchmark program was structured as follows: 1)build the matrix, 2) write it to disk with AAO 3) write it to disk with CBC The program was compiled with no optimizations(-O0 ) and intel compiler v9.1. Times were clocked using the function SYSTEM_CLOCK I repeated each run for three times to see if the results were similar. Here are some representative results (times in seconds, N is the matrix size). N , AAO , CBC , FileSize 5000, 0.743s , 0.885s , 200MB 8000, 2.110s , 15.870s , 512MB 10000, 3.245s , 41.230s , 800MB 15000, 97.723s ,113.670s , 1800MB Unfortunately, I'm afraid that various kinds of hardware or software disk cache mechanisms mask the real performances. For example, I can hardly believe in the N=10000 case that my disk can really write 800MB in 3.245 seconds (it must be some kind of cache/buffer effect...). Until about N=7000 AAO and CBC are about equally fast, then there is an interval where AAO is much faster, and then again their performance seems to even out. I also did some tests at N=10000 using full compiler optimization (-O3) but results didn't seem to change appreciably from the no-optimization case. |
|
#2
| |||
| |||
| Lorents wrote: > I have to write to disk some rather large matrices (N=10000 or more) of > double precision numbers. > The output file is opened with FORM='UNFORMATTED'. (snip) > Now, from some benchmarks I ran (see later) ***it seems*** that the > "column-by-column" (CBC) way is always slower, and sometimes very much > so, than the "all at once" (AAO) one. > Now my questions are: > Is there any reason to expect CBC-writing to be much slower than AAO? > Is this a too much system-dependent thing to say anything at all? > Are there better ways of doing what I need? Perhaps using pointers? For reasonable N, I would not expect a large difference. Also, some systems might want a buffer big enough for a single WRITE, which would give CBC a big advantage... (snip) > Unfortunately, I'm afraid that various kinds of hardware or software > disk cache mechanisms mask the real performances. > For example, I can hardly believe in the N=10000 case that my disk can > really write 800MB in 3.245 seconds (it must be some kind of > cache/buffer effect...). > Until about N=7000 AAO and CBC are about equally fast, then there is an > interval where AAO is much faster, and then again their performance > seems to even out. Yes, it sounds like you are filling the disk cache with the first case, and then for the second it actually waits until it is written to disk. Try exchanging the order of AAO and CBC. I think you will then see that CBC is faster when it fits in cache and AAO doesn't. > I also did some tests at N=10000 using full compiler optimization (-O3) > but results didn't seem to change appreciably from the no-optimization > case. Most of the work is in library routines that don't change with optimization level. Disk drive speeds don't change, either. -- glen |
|
#3
| |||
| |||
| Writing unformatted data in Fortran includes binary record-size marks at each end. Depending on the compiler, this can cause a large overhead for each write. The last time I used g77, it would write unformatted records by doing a seek() to step over the first size mark, write data, seek backwards to fill in the missing size mark, then seek forward again. When writing to an NFS-mounted file, performance could easily be about 100X slower when writing multiple small records. Even though the total record size should be easy to calculate to avoid g77's method, there may be other issues where inserting the file-size marks are reducing the buffering efficiency, depending on how smart the compiler does I/O. If your compiler has a raw binary I/O extension (some have access='BINARY'), you could try raw binary output, and see if these issues go away. |
|
#4
| |||
| |||
| Hi! Without any intent of being rude (this is a serious consideration): when you're handling such large matrices, isn't it time to ask yourself the question: "Are all the data that I want to store still meaningful?" Nobody is able to grasp the significance of 10000*10000 numbers up to the last double precision digit behind the comma. Maybe this is the time to cut out the relevant part of your matrix and just save that? Is your matrix sparse? Is the condition number very large? In these cases (and probably many others) you can significantly reduce the size of the data to store! Regards, Arjan |
|
#5
| |||
| |||
| Arjan wrote: > Without any intent of being rude (this is a serious > consideration): when you're handling such large matrices, > isn't it time to ask yourself the question: "Are all the > data that I want to store still meaningful?" > Nobody is able to grasp the significance of 10000*10000 > numbers up to the last double precision digit behind the comma. That would be true for printing them out, but not always for intermediate storage. For most large matrix calculations you need double precision to get any significance at all in the result. (and likely more than double) > Maybe this is the time to cut out the relevant part > of your matrix and just save that? > Is your matrix sparse? Is the condition number very large? > In these cases (and probably many others) you can > significantly reduce the size of the data to store! Certainly worth thinking about, but often you want to write it out so you can decide later how much is relevant. -- glen |
|
#6
| |||
| |||
| glen herrmannsfeldt wrote: [...] > For reasonable N, I would not expect a large difference. > Also, some systems might want a buffer big enough for a single > WRITE, which would give CBC a big advantage... > (snip) Ok, good to know :-) > > Yes, it sounds like you are filling the disk cache with > the first case, and then for the second it actually waits > until it is written to disk. I did a few more tests and yes, the (software, I presume) disk cache / write behind makes my rudimentary benchmark strategy almost useless... For very big runs clocked times increase a lot probably because all the memory is used to store the matrix and there is no memory left for the disk cache. > > Most of the work is in library routines that don't change > with optimization level. Disk drive speeds don't change, > either. > Yes, I thought as much too. Thank you, Lorenzo |
|
#7
| |||
| |||
| Arjan wrote: > Hi! > > Without any intent of being rude (this is a serious consideration): > when you're handling such large matrices, > isn't it time to ask yourself the question: "Are all the data that I > want to store still meaningful?" Hi, unfortunately the answer is "yes" as these data are intermediate for computing other quantities. In my particular case, they are the nuclear-motion wave functions for a triatomic molecule. What is stored are the values of each wave function on a 3D grid. A typical grid is about 100x100x50 (in some coordinate system) while the number of wave functions (states) is of the order of 500-1000, so it's not too difficult to get a huge number of values. The wave functions are needed later to calculate transition intensities and other quantities. Lorenzo |
|
#8
| |||
| |||
| > In my particular case, they are the nuclear-motion > wave functions for a triatomic molecule. > What is stored are the values of > each wave function on a 3D grid. > A typical grid is about 100x100x50 (in some coordinate > system) while the number of wave functions (states) > is of the order of 500-1000, so it's not too difficult > to get a huge number of values. > The wave functions are needed later > to calculate transition intensities and other quantities. This sounds like a PhD-project, am I right? I'll continue to play the devil's advocate (still with all due respect!): > (in some coordinate system) What's the grid? Which base do you take? It must be a very inefficient base if you need that many nodes! In e.g. a Proper Orthogonal Decomposition, you'll get the relevant eigenstates. The ones with relatively very small eigenvalues are insignificant. Therefore, their transitions are irrelevant. Throw them away and reduce the space required for your problem! Storing 100*100*50 nodes in double precision cannot be meaningful. This is not a programming/computer issue, but a lack of insight in the physics! Talk to someone who knows about the physics, not to a programmer! Sorry if the above sounds blunt (which it does now that I read it back), but I strongly feel that you'll gain time if you take a totally different approach. Sorry for the inconvenience. Best regards and good luck with your project, Arjan |
|
#9
| |||
| |||
| Arjan wrote: >> In my particular case, they are the nuclear-motion >> wave functions for a triatomic molecule. [...] > > This sounds like a PhD-project, am I right? Well, you are right in thinking I'm a PhD student (fortunately, towards the end). However, it's not I who wrote this program (actually, it's a suite of programs). What I am doing is maintenance, fixing bugs and adding a few functionalities. The original program suite is called DVR3D and was written over a period of 20 years in my research group, firstly but my supervisor and then by many others. For the record, the full program suite can be found here: http://www.tampa.phys.ucl.ac.uk/ftp/vr/cpc03/ The file cpc.pdf contains the paper describing the program (it was later published in 2004 in Computer Physics Communications). If you have a look at this file you'll perhaps get an idea of the level of complexity of this whole thing... > > I'll continue to play the devil's advocate (still with all due > respect!): > Please, do :-) >> (in some coordinate system) > > What's the grid? Which base do you take? It must be a very inefficient > base if you need > that many nodes! Ok, in fact I simplified a little. The program is not "grid based" in the "naive" way, it used a numerical technique know as DVR, discrete variable representation, which also involves generating a grid. The grid points are the Gaussian quadrature points associated with the basis functions chosen for the three coordinates. The coordinate system used in my case are Radau coordinates... If you're really interested you can have a look at the paper, unfortunately I'm afraid the details are far too involved to be explained here on the newsgroup... I can only say that for "ordinary" runs the grid size is much smaller, about 45x45x35 but in the particular calculations at hand we have to use a much larger one (we are investigated very high energy states close to dissociation which are very delocalised). Also, even if a cleverer way of dealing with this particular situation existed, writing an ad hoc modification for this is not an option, it'd be way to complicated and time consusiming for me to do now. > In e.g. a Proper Orthogonal Decomposition, you'll get > the relevant eigenstates. > The ones with relatively very small eigenvalues are insignificant. > Therefore, their transitions are irrelevant. > Throw them away and reduce the space required for your problem! > Storing 100*100*50 nodes in double precision cannot be meaningful. > This is not a programming/computer issue, but a lack of insight in the > physics! > Talk to someone who knows about the physics, not to a programmer! I think you're underestimating the difficulty of the problem at hand. A three-body system may look small but it can still be very, very difficult in some cases. Many people on my group work on the molecular ion H3+ and even with present-day computers/ computer clusters computations are heavily limited by computational limitations... Lorenzo |
![]() |
| Thread Tools | |
| Display Modes | |
In an effort to better serve ads to our visitors, cookies are used on objectmix.com. For more information, check out our Privacy Policy.