When you generate a set of correlated random number and then calculate the correlation of those samples, you’ll notice that the calculated “sample correlation” can deviate from the correlation value use to generate the samples. This is because of the randomness of the samples, and the limited number of samples you use
Note: You can now download an Excel worksheet containing the model
The plot bellows shows the effect of the sample size on the distribution of the calculated correlation. As the sample size increases, the distribution of the calculated correlation becomes more focused, and the precision of the estimate of the correlation is increased.
Fisher derived a formula for this sample correlation distribution in Fisher, R.A. (1915) “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population”. Biometrika 10 (4): 507–521.
The distribution of the calculated correlation coefficients is given by the rather complicated equation:
where the number of samples used to calculate the correlation from, the calculated correlation, and the actual correlation used to generate the samples.
is the gamma function, an extension to the factorial function to real and complex numbers.
is a hyper-geometric function defined as:
with the factorial, and the Pochhammer symbol.
VBA (Visual Basic for Applications) code
Const PI = 3.14159265358979 Function HyperGeom2F1(a As Double, b As Double, c As Double, z As Double, N As Integer) Dim an As Double Dim bn As Double Dim cn As Double Dim zn As Double Dim fn As Double HyperGeom2F1 = 1 an = a bn = b cn = c zn = z fn = 1 For i = 1 To N HyperGeom2F1 = HyperGeom2F1 + an * bn / cn / fn * zn an = an * (a + i) bn = bn * (b + i) cn = cn * (c + i) fn = fn * (1 + i) zn = zn * z Next i End Function Public Function CorrelationSampleDist(N As Double, p As Double, r As Double) As Double Dim CD As Double CD = (N - 2) * Exp(WorksheetFunction.GammaLn(N - 1)) * (1 - p * p) ^ ((N - 1) / 2) * (1 - r * r) ^ ((N - 4) / 2) CD = CD / (Sqr(2 * PI) * Exp(WorksheetFunction.GammaLn(N - 0.5)) * (1 - p * r) ^ (N - 3 / 2)) CD = CD * HyperGeom2F1(0.5, 0.5, N - 0.5, (p * r + 1) / 2, 5) CorrelationSampleDist = CD End Function