![]() |
Home · Pages · Index · Overviews |
Correlation and Regression Module ReferenceFunctions for correlations and regressions over AForm objects. More... #include <correlate.h> Structures
Routines
Detailed DescriptionFor all the routines in this module, the shape of the AForms passed to them is not important. Each AForm is treated as a linear vector (linearized according to innermost-first order) whose length is equal to the number of elements in the array form. So for example, when computing A_Correlation between two arrays a1 and a2, a1 and a2 can have different shapes and types, but they must have the same number of elements. On can compute simple correlation, covariance (correlation with mean bias), and normalized correlation, or "Pearson correlation" (covariance normalized by the standard deviations of the series). For each of these three correlations, there is a routine that takes two AForms and returns the desired quantity as a double value (A_Correlation, A_Covariance, A_Pearson_Correlation). There is also a routine that takes any number, n, of AForms and computes all pairs of the given correlation between them, generating and returning an nxn Double_Matrix of all the pairwise entities (Correlations, Covariances, Pearson_Correlations'); Linear_Regression is a routine for performing a standard linear regression, i.e. finding the set of weights that gives the best least squares fit between a linear, weighted combination of any number of input vectors and an observation vector. The routine can optionally fill in a bundle of standard statistics about the nature of the fit (see Regression_Bundle below). There is also a specialized routine, Simple_Regression, for the common case where an observation is to be explained as an affine function of a single input. Structure Documentation
A bundle of this kind, when passed to either Linear_Regression or Simple_Regression, is filled in with various statistics that provide information about the goodness of the fit of the regression. The deeper statistical meaning of each field is defered to a good text on statistics. Here a simple and hopefully operative description of each statistic is given. We assume that the observation vector is obs, the n input vectors are a1, a2, ... an, all the vectors have N elements, and that the computed linear regression variables are c[0], c[1], ... c[n-1]. The vector of predicted values will be donted by est, where est[t] = c[0]*a1[t] + ... + c[n-1]*an[t]. tss: stands for "total sum of squares" and is the variance in the observation vector obs about its mean, i.e. tss = Σt (obs[t] - μ)2, where μ is the the mean of the elements in obs. mss: stands for "model sum of squares" and is the variance of the estimate of the observation, est about the mean of obs, i.e. mss = Σt (est[t] - μ)2, where μ is the the mean of the elements in obs. rss: stands for "residual sum of squares" and is the covariance between observation vector obs and the predicted values est, i.e. rss = Σt (obs[t] - est[t])2. It is this quantity that is minimized by the linear regression. ser: is the "standard error of the estimate" which is just the standard deviation of the distribution of obs-est, i.e. ser = (rss/(N-n)).5 R2: is "R-squared" or the "coefficient of determination". It ranges from 0 to 1 and indicates how well the observation is explained or predicted by a linear combination of the input factors, where 0 implies no predictive power, and 1 implies perfectly predictive. Mathematically, it is 1-rss/tss where rss and tss are explained above. aR2: is the "adjusted R-squared" coefficient equal to 1 - (rss*(N-1)) / (tss*(N-n)) Note that aR2 = R2 when n = 1 and ar2 < R2 when n > 1. The idea is that the number of degrees of freedom n compensates for the expected greater "predictive" power the more input vectors there are. std_err: is an n-element vector that gives the standard error for each coefficient c[i]. Under the assumption that N is "large enough", then by the central limit theorem the error in each coefficient approaches a normal distribution with 0 mean and a standard deviation of std_err[i] = ser*Q-1ii where Q = Correlations(n,a1,...,an). t_stat: is an n-element vector that gives the t-statistic for each coefficient c[i] under the assumption of asymptotically large N. This statistic, which is t_stat[i] = c[i]/std_err[i], follows the Student t-distribution and can thus be used to compute the p-value that c[i] is non-zero. Routine Documentation
Return the correlation, Σt a1[t]a2[t], between a1 and a2. The only constraint on a1 and a2 is that they have the same number of elements (but not necessarily the same shape or type).
Return the covariance, Σt (a1[t] - μ1)(a2[t] - μ2) / A, between a1 and a2, where μi is the mean of the elements of ai, and A is the number of elements in both a1 and a2. The only constraint on a1 and a2 is that they have the same number of elements (but not necessarily the same shape or type).
Returns the Pearson- or normalized-correlation, Σt (a1[t] - μ1)(a2[t] - μ2) / (σ1σ2A), where μi is the mean of the elements of ai, σi is the standard deviation of the elements in ai, and A is the number of elements in both a1 and a2. The only constraint on a1 and a2 is that they have the same number of elements (but not necessarily the same shape or type).
Generates an n x n correlation matrix, c, where c[i,j] = A_Correlation(ai,aj). The only constraint on the ai's is that they all have the same number of elements (but not necessarily the same shape or type).
Generates an n x n covariance matrix, c, where c[i,j] = A_Covariance(ai,aj). The only constraint on the ai's is that they all have the same number of elements (but not necessarily the same shape or type).
Generates an n x n normalized correlation matrix, c, where c[i,j] = A_Pearson_Correlation(ai,aj). The only constraint on the ai's is that they all have the same number of elements (but not necessarily the same shape or type).
Linear regression generates an n-element vector of coefficients, c, such that the least squares distance between obs and the c-weighted sum of the ai's is minimal, that is, Σt (obs[t] - (c[0]*a1[t] + ... + c[n-1]*an[t]))2 is minimized. The routine will return statistics for the regression in the Regression_Bundle pointed at by stats if and only if it is not NULL. If the arrays pointed at by the fields std_err and/or t_stat within this bundle are NULL on entrance, then vectors of the appropriate size are created, otherwise the arrays they point to are converted to n-element Double_Vectors if they are not already thus, and then filled in with the desired values. The caller is responsible for freeing or killing these arrays.
Simple_Regression returns a pointer to a 2-element Double_Vector, c, that minimizes the sum: Σt (obs[t] - (c[0] + c[1] * inp[t]))2 If vector is NULL then a 2-element Double_Vector is generated and returned as c, otherwise the vector is converted to a 2-element Double_Vector if it is not already one and then returned as c. The point of passing vector rather than generating a Double_Vector containing c, is that if one wants to solve, say thousands of such regressions, then one can generate a 2-element double vector at the start, and then reuse it for each call. This saves time in the regression routine as an Array object need neither be created or re-shaped and re-typed. The routine will return statistics for the regression in the Regression_Bundle pointed at by stats if and only if it is not NULL. If the arrays pointed at by the fields std_err and/or t_stat within this bundle are NULL on entrance, then 2-element Double_Vectors are generated, otherwise they are converted to 2-element Double_Vectors if they are not already thus, and then filled in with the desired values. The caller is responsible for freeing or killing these arrays. |