next up previous contents
Next: 19990728: same as 19990707, Up: MD work on Previous: 19990712: summary for 19990614


19990727: same as 19990709, but with longer equilibration period

GOAL

Influence of the equilibration period: run on t3e.cineca.it exactly the same calc as in 19990709 safe that the equilibration time is ten times longer ( $50000\;\mathit{steps}$ here, $5000\;\mathit{steps}$ in 19990709).

(19990709 is the exact copy of 19990707, safe that it was run on t3e.cineca.it instead of utsj90.univ.trieste.it)

Runtime problems on t3e

Like on utsj90, I've found DL_POLY hanging on t3e: it apparently consumes CPU time without producing anything until the CPU time limit is exceeded. I suspect there is some problem with the MPI calls, as the traceback in the qsub log reports:

...
SIGNAL: CPU limit exceeded ( from process 237038 )

 Beginning of Traceback (PE 4):
  Interrupt at address 0x800252748 in routine '???'.
  Called from line 1726 (address 0x8000f1bbc) in routine 'MPI_Allreduce'.
  Called from line 2508 (address 0x80016a018) in routine 'MPI_ALLREDUCE'.
  Called from line 82 (address 0x800042ce0) in routine 'GISUM'.
  Called from line 79 (address 0x80008e778) in routine 'VERTEST'.
  Called from line 542 (address 0x800016924) in routine 'DLPOLY'.
  Called from line 475 (address 0x800000c98) in routine '$START$'.
 End of Traceback.
...

Anyway, I took the same input and run it on 8 PE's instead of 32 and things were fine. The only change in the DL_POLY input was the job time directive:

sn6320:265> diff -c CONTROL.START ../19990727/CONTROL.START
*** CONTROL.START       Wed Jul 28 16:16:58 1999
-- ../19990727/CONTROL.START     Tue Jul 27 19:58:07 1999
***************
*** 66,72 ****
  # (e.g. terminating the current cycle)
  # When CPU_time>(job_time-close_time), the program will start closing
  # operations and will stop
! job time  2000
  close time  100
  
  finish
-- 66,72 --
  # (e.g. terminating the current cycle)
  # When CPU_time>(job_time-close_time), the program will start closing
  # operations and will stop
! job time  1000
  close time  100
  
  finish

Here is the diff for the qsub scripts:

sn6320:266> diff -c qsub.sh ../19990727/qsub.sh
*** qsub.sh     Wed Jul 28 14:46:01 1999
-- ../19990727/qsub.sh   Wed Jul 28 12:44:26 1999
***************
*** 4,13 ****
  
  # assign this name to the request
  #QSUB -r balducci
! # request this maximum number of PE's
! #QSUB -l mpp_p=8
  # request this per-process connect time (s)
! #QSUB -l p_mpp_t=2500
  
  # output and error log go here
  #QSUB -eo -o qsub.log
-- 4,13 --
  
  # assign this name to the request
  #QSUB -r balducci
! # request this number of PE's
! #QSUB -l mpp_p=32
  # request this per-process connect time (s)
! #QSUB -l p_mpp_t=2000
  
  # output and error log go here
  #QSUB -eo -o qsub.log
***************
*** 23,29 ****
-- 23,34 --
  
  set -x       
  
+ SCACHE_I_STREAMS=1
+ SCACHE_D_STREAMS=1
+ export SCACHE_I_STREAMS
+ export SCACHE_D_STREAMS
  
+
  # enable job accounting
  ja
  
***************
*** 36,42 ****
  # mark current position of the job accounting file
  m1=`ja -m`
  
! mpprun -n 8 /u/tritsb43/c_develop/bin/dlpoly
  
  # mark current position of the job accounting file
  m2=`ja -m`
-- 41,47 --
  # mark current position of the job accounting file
  m1=`ja -m`
  
! mpprun -n 32 /u/tritsb43/c_develop/bin/dlpoly
  
  # mark current position of the job accounting file
  m2=`ja -m`

This unreliability of the code (now emerging also on t3e) is exhausting.

Results

Looks like the increase in equilibration time produces dramatic changes.

\begin{center}\vbox{\input{19990727-01.pslatex}
}\end{center}

The ``long equilibrated'' 19990727 result compares very well with the ``short equilibrated'' 19990702 one. The configurations for these two runs were ``standardized'' with the dlsub code and I did not find them in good accord at all (see 19990702 and 19990707): want to re-run the 19990702 calculation with the same long (50000 steps) equilibration time to see if the ``long equilibrated'' results still agree.

NOTE, however, that 19990727 has been performed on t3e while 19990702 was done on utsj90. I'm still cautious in comparing data from different hardware (19990709 was not completely satisfactory in this regard).

\begin{center}\vbox{\input{19990727-02.pslatex}
}\end{center}


next up previous contents
Next: 19990728: same as 19990707, Up: MD work on Previous: 19990712: summary for 19990614