|  |     As it happens, I've been working on a related problem: the sum of
    of k integers drawn uniformly from the range 0..N-1 *with replacement*.
    This has application to a statistical procedure used widely in
    parapsychology in the last decade or so -- the sum of ranks statistic
    (I'll be glad to describe the statistic and its application if anyone
    is interested).
    Anyway if you let S be a particular sum, the probability of that
    sum is:
		       k  S/N     E  / k \   / S - N*E + k - 1 \
	    p(S) = (1/N ) SUM (-1)  (     ) (                   )
			  E=0        \ E /   \      k - 1      /
    And the cumulative probability is obviously just:
		       k   S    M/N     E  / k \   / M - N*E + k - 1 \
	    P(S) = (1/N ) SUM   SUM (-1)  (     ) (                   )
			  M=0   E=0        \ E /   \      k - 1      /
    If S1 is the sum of k integers drawn uniformly from the range 1..N,
    then simply replace S by S1-k in the above equations.
    What is the effect of non-replacement on the above?
    My intuition is as follows:
	1) The extremes of the lower tail will be "pushed up" towards the
	middle since much of the lower tail is created by repetitions of
	small values (for example, the smallest value with non-zero
	probability in the with replacement case is 0, with probability
	1/N**k; the smallest value with non-zero probability in the
	without replacement case is (k**2 - k)/2 with probability
	k!*(N-k)!/N! ).
	2) The extremes of the upper tail will be "pushed down" towards
	the middle.  The curve will continue to be perfectly symmetric.
	3) The middle of the curve will retain roughly the same relative
	shape, but will become "rougher" in fine texture (the with
	replacement case is piecewise rather "smooth").  Various
	combinations are pulled out rather randomly in relationship
	to position, roughening it all up.
    I am unsure as to whether the without replacement case is more or
    less "normal"; I rather suspect that it depends on your measure of
    similarity.  The major interference with normality in the with
    replacement case for small k is that the extreme tails are too
    heavy, which would argue for the without replacement case being
    "closer to normal".  But this is only in the extreme tails, which
    only concerns very rare events, and so is rarely applicable.  The
    roughening in the middle which I speculate on, would tend to make
    the highly probable center portion less normal.  If you measure
    "normality" in terms of overall shape, therefore, I would guess that
    the without replacement case would be perhaps a bit more normal.
    If, on the other hand, you measure "normality" by something like
    the *expected* deviation then the with replacement would probably
    win.
					Topher
 | 
|  | 	I brute forced the solution as .5 suggested.
	I picked every possible sum (1402410240 of them!) and figured out
	the frequency of each. Now, since this is a distribution, the area
	under the curve should=1. So, I normalized  the frequencies by dividing
	each frequency  by 1402410240 to make their sum =1.
    	(note: although there are only 1947792 different combinations,  )
    	( they can be made in 1402410240 ways...I think! please correct )
        (me if I'm wrong. )
    
	Now, if this where a normal distribution, then its mean would be
	111, and the peak at the mean is
		peak= 1/(sqrt(2*pi)*sigma)
	but we know from the histogram that the peak frequency (which occurs
	at sum=111), is 23136480.
	From this information we can conclude that if the distribution 
        where normal, it would be N(111,24.2)
	So I plotted both the normalized histogram, and the Guassian with
	mean 111, and sigma=24.2. 
	Guess what? they look identical except at the extremes.It is close 
	enough to call it  Gaussian, at least for my purposes!
    
	I've enclosed the histogram of the sums, and the program to plot
	the curves. The program also creates a norm_hist.out which contains
        each sum, it's frequency, the normalized histogram, and the
    	corresponding Guassian. Make sure when you extract the histogram 
        to call it hist.out...the program takes this data and does the
    	rest. Enjoy.
			yousif.
    	p.s. the program is in Fortran...not commented, and could be
             better. But, it works.
    
--------------------histogram data------------------------------------------
 sum     total
   1         0
   2         0
   3         0
   4         0
   5         0
   6         0
   7         0
   8         0
   9         0
  10         0
  11         0
  12         0
  13         0
  14         0
  15         0
  16         0
  17         0
  18         0
  19         0
  20         0
  21       720
  22       720
  23      1440
  24      2160
  25      3600
  26      5040
  27      7920
  28     10080
  29     14400
  30     18720
  31     25200
  32     31680
  33     41760
  34     51120
  35     64800
  36     79200
  37     97920
  38    117360
  39    143280
  40    169200
  41    203040
  42    238320
  43    281520
  44    326880
  45    383040
  46    440640
  47    510480
  48    583920
  49    670320
  50    761040
  51    868320
  52    978480
  53   1107360
  54   1242000
  55   1395360
  56   1555200
  57   1737360
  58   1924560
  59   2136240
  60   2355120
  61   2598480
  62   2849040
  63   3127680
  64   3411360
  65   3724560
  66   4044960
  67   4393440
  68   4748400
  69   5135040
  70   5524560
  71   5946480
  72   6372000
  73   6827760
  74   7285680
  75   7776000
  76   8263440
  77   8782560
  78   9298800
  79   9843120
  80  10380960
  81  10948320
  82  11502000
  83  12083040
  84  12649680
  85  13237920
  86  13807440
  87  14398560
  88  14963040
  89  15546240
  90  16101360
  91  16668720
  92  17202960
  93  17749440
  94  18254880
  95  18768960
  96  19241280
  97  19715040
  98  20142720
  99  20572560
 100  20948400
 101  21323520
 102  21644640
 103  21958560
 104  22215600
 105  22466880
 106  22654800
 107  22834800
 108  22953600
 109  23059440
 110  23102640
 111  23136480
 112  23102640
 113  23059440
 114  22953600
 115  22834800
 116  22654800
 117  22466880
 118  22215600
 119  21958560
 120  21644640
 121  21323520
 122  20948400
 123  20572560
 124  20142720
 125  19715040
 126  19241280
 127  18768960
 128  18254880
 129  17749440
 130  17202960
 131  16668720
 132  16101360
 133  15546240
 134  14963040
 135  14398560
 136  13807440
 137  13237920
 138  12649680
 139  12083040
 140  11502000
 141  10948320
 142  10380960
 143   9843120
 144   9298800
 145   8782560
 146   8263440
 147   7776000
 148   7285680
 149   6827760
 150   6372000
 151   5946480
 152   5524560
 153   5135040
 154   4748400
 155   4393440
 156   4044960
 157   3724560
 158   3411360
 159   3127680
 160   2849040
 161   2598480
 162   2355120
 163   2136240
 164   1924560
 165   1737360
 166   1555200
 167   1395360
 168   1242000
 169   1107360
 170    978480
 171    868320
 172    761040
 173    670320
 174    583920
 175    510480
 176    440640
 177    383040
 178    326880
 179    281520
 180    238320
 181    203040
 182    169200
 183    143280
 184    117360
 185     97920
 186     79200
 187     64800
 188     51120
 189     41760
 190     31680
 191     25200
 192     18720
 193     14400
 194     10080
 195      7920
 196      5040
 197      3600
 198      2160
 199      1440
 200       720
 201       720
 202         0
 203         0
 204         0
 205         0
 206         0
 207         0
 208         0
 209         0
 210         0
----------------------program to do the rest---------------------------------
	include 'sys$library:uisentry'
	include 'sys$library:uisusrdef'
	integer ihist(210),sum
	real    nhist(210),guass(210)
	character*40 ha
	sigma=24.18175709  !calculated as described in note
	sum=0
	open(unit=1,file='hist.out',status='old')
	open(unit=2,file='norm_hist.out',status='new')
	read (1,2)ha
2	format(a40)
	write(2,4)
4	format('  sum  histogram   normalized hist  guassian ')
	do 10 i=1,210
10		read(1,1)k,ihist(i)
1	format(' ',i3,i10)
	do 20 i=1,210
	    nhist(i)=real(ihist(i))/1402410240.0	
	    guass(i)=1/(2.506628274*sigma)*
     *          exp(-((111.0-real(i))/(sigma))**2/2.0)
20          write(2,3)i,ihist(i),nhist(i),guass(i)
3	format(' ',i3,3x,i10,3x,f12.10,3x,f12.10)
	close (2)
	close(1)
	ivd=uis$create_display(0.0,0.0,210.0,0.02,30.0,30.0)
	iwd=uis$create_window(ivd,'sys$workstation')
	
        	call uis$plot(ivd,0,0.0,0.0)
	do 30 i=1,210
		call uis$plot(ivd,0,real(i-1),nhist(i-1),
     *                               real(i),nhist(i))
30	continue
        	call uis$plot(ivd,0,0.0,0.0)
	do 40 i=1,210
		call uis$plot(ivd,0,real(i-1),guass(i-1),
     *                                    real(i),guass(i))
40	continue
	pause
	end
    
 | 
|  |     
    re .7
    
    If I were to count the combinations and not the permutations,
    that would mean that each frequency would really be divided by
    6! =720 (since it would occur 720 times less)
    
    But, the total number of combinations would be 1947792. So to normalize
    the new frequencies, I would divide by 1947792...
    
    So, in effect I've divided the old frequeny by 720*1947792=1402410240
    and should get the same normalized histogram, with the same mean
    at 111 and the same peak of 0.01649...
    
    Using the equation in .5, yields the same sigma, too. We're back
    to the N(111,24.2)
    
    So, nothing really changes, except the effort in obtaining the original
    histogram (which, I admit, is a significant effort).
    
    yousif.
 |