| Title: | Digital Fortran |
| Notice: | Read notes 1.* for important information |
| Moderator: | QUARK::LIONEL |
| Created: | Thu Jun 01 1995 |
| Last Modified: | Fri Jun 06 1997 |
| Last Successful Update: | Fri Jun 06 1997 |
| Number of topics: | 1333 |
| Total number of notes: | 6734 |
UNIX 4.0 564 f77 4.1 -92 a fortran program doing fourier transforms runs with different timings - on an AlphaServer8400 5/440 4MB cache ; 4 GByte memory - on a Rawhide 4xev56@400 MHz ; 500 megs memory on the rawhide : nuwigres runs in 43' with 54% cpu due to paging on the tl : nuwigres runs in 62' with 99% cpu ??? same software, output why is the turbolaser slower despite faster cpu's and full cpu utilization ? /Joseph
| T.R | Title | User | Personal Name | Date | Lines |
|---|---|---|---|---|---|
| 1195.1 | Rawhide has lower memory latency | WIDTH::MDAVIS | Mark Davis - compiler maniac | Tue Feb 25 1997 11:39 | 10 |
but not 3 times lower!? [Were the 43' 62' elapsed time => rawhide 3x faster, or cpu time => rahwide 33% faster ?] Was anything else running on the TL when you ran this? I assume that since you need more than 500meg physical memory, memory speed is key, and the 10% faster cpu has little relevance to the overall speed.... I hope the FFT doesn't have power-of-2 cache thrashing problems. BTW, what's the size of the rawhide cache (I assume it's also 4meg)? | |||||
| 1195.2 | Is it always like this? | PERFOM::HENNING | Thu Feb 27 1997 05:48 | 24 | |
Rawhide/400 has a 4mb cache
Mark mentioned latency. Ditto bandwidth. If the job is only using a
single CPU, Rawhide is better at delivering memory bandwidth than is
Turbolaser...but not 3x better!
http://www.cs.virginia.edu/stream/standard/Bandwidth.html
Machine ID ncpus COPY SCALE ADD TRIAD
DEC_8400_5-350 1 215.7 207.5 219.6 234.2
DEC_4100_5-400- 1 247.8 243.9 264.9 268.0
Of course, TL has more "headroom" as you add CPUs.
If the job is prone to power-of-two problems, then it will show great
variability from run-to-run. It will tend to sometimes seem to be
stuck in a slow mode, and sometimes seem to run quite nicely.
Rebooting the system just before running will either be the best thing
you can do for it or the worst. It might like having its pages
randomized - and although expensive, letting the vm page reclaim
routines run around and muck up your working set would be one way of
randomizing. Having other jobs compete for the memory would randomize.
Conversely, having plenty of memory lying around idle would tend to let
a bad mapping stay bad and never budge.
| |||||
| 1195.4 | The divide is killing you | WIDTH::MDAVIS | Mark Davis - compiler maniac | Thu Feb 27 1997 16:23 | 93 |
I assume you're compiling at least -fast, which permits the
divide-by-loop-invariant to become
compute inverse outside loop
multiply by inverse inside loop.
Therefore in your inner loop:
DO IEP=1,NRPA
EFINL=WPRPA(IEP,KK)
EAVER=(EINIT+EFINL)/2.D0
EDEN(IAB,ICD,IQ)=EDEN(IAB,ICD,IQ)+
& XLEFT(IIA,IIC,IEP,KK)*XRIGHT(IIB,IID,IE,KK)*OES(IEP,IE,KK)*
& FQQ/(ABSCIS(IQ)+EAVER)
ENDDO
the divide by 2.0 becomes mult by .5, but the divide by (ABSCIS(IQ)+EAVER)
is the dominant cost in the loop. The divide takes ~ 25 cycles.
Compiling with f77 -fast -tune ev5 and different switches generates
different numbers of cycles per iteration:
-unroll 1 52
<normal unroll> 33
-O5 (pipelining) 25
The pipelining case shows that the optimizer does the best job possible:
it has to do at least 1 divide per iteration, and since divides can't
overlap, then 25 cycles/iteration is the minimum.
HOWEVER, the function can be recoded to save LOTS of time. The expression
"FQQ/(ABSCIS(IQ)+EAVER)" depends only on the inner 3 loop
indices: the arrays involved: WPRPA, ABSCIS, WRPA, WEIGHT are not
modified anywhere in the routine. Therefore it is possible to
precompute this expr for all values of IQ, IE, IEP Outside of the deep
loop nest and store it into a 3 dimension variable. The original
routine is computing these values 20*20*20*20 TIMEs more often than
necessary.
The other thing to do is to use a local scalar var: EDEN1
to accumulate the values into, instead of EDEN(IAB,ICD,IQ), since
the indices for this array element don't change inside the inner 2 loops.
So at the beginning of the routine you add:
dimension fq_ab_eav_inv(325, 325, 64)
DO IQ=1,64
QQ=ABSCIS(IQ)/XLAMB
FQQ=1.D0/(1.D0+QQ*QQ)**2
FQQ=FQQ*FQQ*ABSCIS(IQ)*WEIGHT(IQ)
DO IE=1,NRPA
EINIT=WRPA(IE,KK)
DO IEP=1,NRPA
EFINL=WPRPA(IEP,KK)
EAVER=(EINIT+EFINL)/2.D0
fq_ab_eav_inv(iep,ie,iq) = FQQ/(ABSCIS(IQ)+EAVER)
enddo
enddo
enddo
and replace the inner 3 loops with:
DO IQ=1,64
C
EDEN1=0.D0
DO IE=1,NRPA
DO IEP=1,NRPA
EDEN1=EDEN1+
& XLEFT(IIA,IIC,IEP,KK)*XRIGHT(IIB,IID,IE,KK)*OES(IEP,IE,KK)*
& fq_ab_eav_inv(iep,ie,iq)
ENDDO
ENDDO
EDEN(IAB,ICD,IQ)=EDEN1*2.D0/(197.328*PI)
ENDDO
Now the inner loop has 4 loads, 3 multiplies, and 1 add. The software
pipeliner (-fast -O5) does its job and reduces the number of cycles
per iteration to 6.5 !
Unfortunately, this transformation will speed up the program on ANYONEs
machine, even our competitors (though our divide is slower in terms of
number of cycles, because our cycle speed is so fast - so we'll see a
larger gain).
Mark
| |||||
| 1195.5 | re: .4: slightly safer transformation | WIDTH::MDAVIS | Mark Davis - compiler maniac | Thu Feb 27 1997 16:40 | 17 |
I moved the precomputation of the inverse expression outddise of all the loops. However, there's a bunch of conditional code so it IS possible that the inner loops will never be executed (though I assume this case is unlikely). To avoid doing the precomp unnecessarily, you can instead move the precomp loop inside the fourth loop after all the tests for whether to skip the inner loops, and then put: if (initialized .eq. 0 ) then <precomp loop> initialized = 1 endif and of course put "initialized = 0" at the beginning of the routine... | |||||