The Distribution of The Sample Correlation

When you generate a set of correlated random number and then calculate the correlation of those samples, you’ll notice that the calculated “sample correlation” can deviate from the correlation value use to generate the samples. This is because of the randomness of the samples, and the limited number of samples you use


Note: You can now download an Excel worksheet containing the model
sample_correlation.xls


The plot bellows shows the effect of the sample size N on the distribution of the calculated correlation. As the sample size increases, the distribution of the calculated correlation becomes more focused, and the precision of the estimate of the correlation is increased.

Sample calculated correlation distribution based on samples size of 10, 20 and 100, all generated with a correlation of 0.50.

Fisher derived a formula for this sample correlation distribution in Fisher, R.A. (1915) “Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population”. Biometrika 10 (4): 507–521.

The distribution of the calculated correlation coefficients is given by the rather complicated equation:

    \begin{align*} Pr(c) &= \frac{ (N-2) \Gamma(N-1) (1-\rho^2)^{\frac{N-1}{2}} (1-c^2)^{\frac{N-4}{2}} }{ \sqrt{2\pi} \Gamma(N-\frac{1}{2}) (1-\rho c)^{N-\frac{3}{2}}} \\ &\times {}_2 F_1\left(\frac{1}{2}, \frac{1}{2}, \frac{2N-1}{2}, \frac{\rho c+1}{2}\right) \end{align*}

where N the number of samples used to calculate the correlation from, c the calculated correlation, and \rho the actual correlation used to generate the samples.

\Gamma() is the gamma function, an extension to the factorial function to real and complex numbers.

{}_2 F_1() is a hyper-geometric function defined as:

    \begin{align*} {}_2 F_1(a,b,c,z) &= \sum^{\infty}_{n=0} \frac{(a)_n (b)_n}{(c)_n} \frac{z^n}{n!}\\ &= 1 + \frac{a b}{c} \frac{z}{1!}  + \frac{a(a+1) b(b+1)}{c(c+1)} \frac{z^2}{2!}  + \ldots \end{align*}

with n!= 1\times 2 \times 3 \ldots n the factorial, and (a)_n = a (a+1) (a+2) \ldots (a+n-1) the Pochhammer symbol.

Sample calculated correlation distribution based on samples size of 10, and correlation of 0.00, 0.25 and 0.50 used in generating the samples.

VBA (Visual Basic for Applications) code

Const PI = 3.14159265358979
 
Function HyperGeom2F1(a As Double, b As Double, c As Double, z As Double, N As Integer)
    Dim an As Double
    Dim bn As Double
    Dim cn As Double
    Dim zn As Double
    Dim fn As Double
    HyperGeom2F1 = 1
    an = a
    bn = b
    cn = c
    zn = z
    fn = 1
 
    For i = 1 To N
        HyperGeom2F1 = HyperGeom2F1 + an * bn / cn / fn * zn
        an = an * (a + i)
        bn = bn * (b + i)
        cn = cn * (c + i)
        fn = fn * (1 + i)
        zn = zn * z
    Next i
End Function
 
Public Function CorrelationSampleDist(N As Double, p As Double, r As Double) As Double
    Dim CD As Double
    CD = (N - 2) * Exp(WorksheetFunction.GammaLn(N - 1)) * (1 - p * p) ^ ((N - 1) / 2) * (1 - r * r) ^ ((N - 4) / 2)
    CD = CD / (Sqr(2 * PI) * Exp(WorksheetFunction.GammaLn(N - 0.5)) * (1 - p * r) ^ (N - 3 / 2))
    CD = CD * HyperGeom2F1(0.5, 0.5, N - 0.5, (p * r + 1) / 2, 5)
    CorrelationSampleDist = CD
End Function