### The Distribution of The Sample Correlation

When you generate a set of correlated random number and then calculate the correlation of those samples, you’ll notice that the calculated “sample correlation” can deviate from the correlation value use to generate the samples. This is because of the randomness of the samples, and the limited number of samples you use

Note: You can now download an Excel worksheet containing the model
sample_correlation.xls

The plot bellows shows the effect of the sample size on the distribution of the calculated correlation. As the sample size increases, the distribution of the calculated correlation becomes more focused, and the precision of the estimate of the correlation is increased.

Sample calculated correlation distribution based on samples size of 10, 20 and 100, all generated with a correlation of 0.50.

Fisher derived a formula for this sample correlation distribution in Fisher, R.A. (1915) “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population”. Biometrika 10 (4): 507–521.

The distribution of the calculated correlation coefficients is given by the rather complicated equation:

where the number of samples used to calculate the correlation from, the calculated correlation, and the actual correlation used to generate the samples.

is the gamma function, an extension to the factorial function to real and complex numbers.

is a hyper-geometric function defined as:

with the factorial, and the Pochhammer symbol.

Sample calculated correlation distribution based on samples size of 10, and correlation of 0.00, 0.25 and 0.50 used in generating the samples.

## VBA (Visual Basic for Applications) code

Const PI = 3.14159265358979   Function HyperGeom2F1(a As Double, b As Double, c As Double, z As Double, N As Integer) Dim an As Double Dim bn As Double Dim cn As Double Dim zn As Double Dim fn As Double HyperGeom2F1 = 1 an = a bn = b cn = c zn = z fn = 1   For i = 1 To N HyperGeom2F1 = HyperGeom2F1 + an * bn / cn / fn * zn an = an * (a + i) bn = bn * (b + i) cn = cn * (c + i) fn = fn * (1 + i) zn = zn * z Next i End Function   Public Function CorrelationSampleDist(N As Double, p As Double, r As Double) As Double Dim CD As Double CD = (N - 2) * Exp(WorksheetFunction.GammaLn(N - 1)) * (1 - p * p) ^ ((N - 1) / 2) * (1 - r * r) ^ ((N - 4) / 2) CD = CD / (Sqr(2 * PI) * Exp(WorksheetFunction.GammaLn(N - 0.5)) * (1 - p * r) ^ (N - 3 / 2)) CD = CD * HyperGeom2F1(0.5, 0.5, N - 0.5, (p * r + 1) / 2, 5) CorrelationSampleDist = CD End Function