Anova

This class implements an Anova procedure as derived from 'Mathematical Statistics with Applications' 7th edition by Wackerly, et el.

That resource provides a full explanation, however a summary is given below.

The goal of Anova is to "identify the important independent variables and determine how they affect the response".

Where each independent variable is a factor and the intensity of the variable the level.

The procedure calculates the Total Sum of Squares which is the sum of the squared deviations of each variable from the mean and a remainder random error.

We consider that under the null hypothesis the independent variables are assumed to be unrelated to the response variable, and each portion of the total sum of squares divided by a corresponding constant provides an independent and unbiased estimator of $\sigma^2$ of the experimental error.

If a variable is highly related to the response its contribution to the total sum of squares will be large.

The variable sum of squares total $SST$ is compared with the sum of squares for the error $SSE$

An F test is used to determine whether the null hypothesis should be rejected.

This implementation addresses the case for $k$ variables, and is used to determine the F test for the null hypothesis $\mu_1 = \mu_2 = ... = \mu_k$ and common variance $\sigma^2$.

In this case we are working with matrices provided by the breeze library, at this time. However, it would be useful to later change the implementation to use another structure which allows uneven lengths of samples, since the number of rows for each sample will be equal in the matrix form.

It is possible however to use unqual sample sizes for each $ith$ sample.

An AnovaTable is used to contain the variables for the anova process providing a "one way layout" that comprises of the following elements.

The total sum of squares is computed form the $SSE$ and the $SST$

$$ TotalSS = \sum_{i=1 \in k} \sum_{j=1 \in n_i} (Y_{ij} - \bar{Y})^2 $$

It can be summarised as being the total of all observations squared subtracting the correction for the mean $CM$

$$ \sum_{i=1 \in k} \sum_{j=1 \in n_i} Y_{ij}^2 - CM $$

where the correction for the mean is calculated as the total for all observations squared divided by $n$

$$ CM = \frac{1}{n} ( \sum_{i=1 \in k} \sum_{j=1 \in n_i} Y_{ij} )^2 $$

The total of each sample set is defined as $Y_{i.}$:

$$ Y_i. = \sum_{j=1 \in n_i} Y_{ij} $$

and the mean of each sample set is estimated as $\bar{Y_{i.}}$.

$$ \bar{Y_{i.} } = \frac{1}{n_i} \sum_{j=1 \in n_i} Y_{ij} $$ $$ \frac{1}{n_i} Y_{i.} $$

This is used in calculating the Sum of squares for treatments which will be large if the differences between the treatments is also large.

$$ SST = \sum_{i=1 \in k} n_i (\bar{Y_{i.}} - \bar{Y})^2 $$

Note also that ${Z\ squared} = \frac{SST}{{\sigma^2}}$

having a $\chi^2$

distribution with $k-1$ df for $k$ factors.

$$ SST = \sum_{i=1 \in k} \frac{Y_{i.}^2}{n_i} - CM $$

The second part the sum of squared errors is computed as $$ SSE = Total SS - SST $$

However it can also represent a total of the sample variances multiplied by a degree of freedom as shown in Wackerly.

$$ SSE = \sum_{i=1 \in k} (n_i - 1) S_i^2 $$

where the sample variance $S^2$ is

$$ {S^2} = \frac{1}{n_i-1} \sum_{j=1 \in n_i} (Y_{ij} - \bar{Y_{i.}})^2 $$

which is an unbiased estimator of $ \sigma_i^2 = \sigma^2 $

The Mean squared error is an estimator for the pooled variance $S^2$ with $n-k$ degrees of freedom.

$$ MSE = \frac{SSE}{n-k} $$

The Mean square total is accumulated from the estimates of the mean for each sample with degree of freedom $k-1$. $$ MST = \frac{SST}{k-1} $$

Once the anova table is calculated the F-Test is used to test the null hypothesis $\mu_1 = \mu_2 = ... = \mu_k$ with even variance, and is rejected at the critical level $\alpha$

$$ F = \frac{MST}{MSE} > F_\alpha $$ The statistic is an F distribution with $k-1$ and $n-k$ numerator and denominator degrees of freedom.

The key assumptions are the normal assumption for the $k$ samples, with equal means and variance.

Example Usage

The example is derived from the test case TestAnovaInference which is also derived from an example in Wackerly on page 671.

The test data for the example is the following matrix with $k = 4$ sets of observations.

  /**
columns correspond to k samples
rows correspond to sample observation Y_ij
*/
val table = DenseMatrix(
(65.0, 75.0, 59.0, 94.0),
(87.0, 69.0, 78.0, 89.0),
(73.0, 83.0, 67.0, 80.0),
(79.0, 81.0, 62.0, 88.0),
(81.0, 72.0, 83.0, 0.0),
(69.0, 79.0, 76.0, 0.0),
(0.0, 90.0, 0.0, 0.0))

Note that incomplete examples have been padded with 0. In this implementation it would be best to use a "balanced" set of samples where the number for each observation is equal.

The Anova table is created as

val anova = Anova(table)

And a test for the null hypothesis at the critical level for $\alpha = 0.05$ to be performed using

val testResult = anova.test(0.05)

The test result will contain the anova table which can be printed or inspected for each of the table values. Inspecting the table will give a report for example:

     NumeratorDF: 3
DenominatorDF: 24
SST: 2876.107142857145
SSE: 23604.857142857145
MSE: 983.5357142857143
MST: 958.7023809523816
TotalDF: 27
TotalSS: 26480.96428571429
F-stat (observed statistic): 0.9747509592456765
F-alpha (critical value): 3.0000000000000013
P-Value: 0.4219003172019309
Observed-Prob 0.44613162513336035
alpha (significance level):0.05

In this example the F-stat > F-alpha and the test case rejects the null hypothesis.

The test result also has the "rejected" flag which indicates whether the null hypothesis is rejected. It is possible to use a $k$ fold approach to determine which of the samples may be rejected after the initial test.

The critival value is approximated using the trait CriticalValue for the UpperTail of the FDistribution.

Created by cd on 17/09/2014.

Linear Supertypes

StatisticalTest, AnyRef, Any

Instance Constructors

new Anova(X: DenseMatrix[Double])

Type Members

class Intermediate extends AnyRef

internal class for intermediate results

Value Members

final def !=(arg0: Any): Boolean

Definition Classes
AnyRef → Any
final def ##(): Int

Definition Classes
AnyRef → Any
final def ==(arg0: Any): Boolean

Definition Classes
AnyRef → Any
val X: DenseMatrix[Double]
final def asInstanceOf[T0]: T0

Definition Classes
Any
def clone(): AnyRef

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( ... )
def cm(): Intermediate

the correction for the mean
val criticalVal: (Seq[Double]) ⇒ CriticalValue
final def eq(arg0: AnyRef): Boolean

Definition Classes
AnyRef
def equals(arg0: Any): Boolean

Definition Classes
AnyRef → Any
val fdist: FDistribution
def finalize(): Unit

Attributes
protected[java.lang]
Definition Classes
AnyRef
Annotations
@throws( classOf[java.lang.Throwable] )
final def getClass(): Class[_]

Definition Classes
AnyRef → Any
def hashCode(): Int

Definition Classes
AnyRef → Any
final def isInstanceOf[T0]: Boolean

Definition Classes
Any
val k: Int
def mse(accum: Intermediate): Intermediate

the mean sum of squares
def mst(accum: Intermediate): Intermediate

the mean sum of squares statistic
val n: Int

f distribution n = rows * cols k = cols
f distribution n = rows * cols k = cols
df = (k-1), (n-k)
final def ne(arg0: AnyRef): Boolean

Definition Classes
AnyRef
final def notify(): Unit

Definition Classes
AnyRef
final def notifyAll(): Unit

Definition Classes
AnyRef
def ssTreatment(accum: Intermediate): Intermediate

sum of squares treatment $$
sum of squares treatment $$
SST = \sum_{i=1}^{k n_i (\bar{Y_{i.} } - \bar{Y} )}2 = \sum_{i=1}^{k \frac{Y_{i.}}2}{n_i} - CM
$$
def sse(accum: Intermediate): Intermediate

the sum of squares error
def statistic(): (Double, Intermediate)

compute the F-statistic this is the assertion that $H_0: \mu_1 = \mu_2 = ...
compute the F-statistic this is the assertion that $H_0: \mu_1 = \mu_2 = ... = \mu_k$ vs $H_a: $ none of the means are equal.
final def synchronized[T0](arg0: ⇒ T0): T0

Definition Classes
AnyRef
def test(alpha: Double): TestResult

perform the anova test at the supplied critical level
perform the anova test at the supplied critical level

Definition Classes
Anova → StatisticalTest
def toString(): String

Definition Classes
AnyRef → Any
def totalSS(accum: Intermediate): Intermediate

compute the total SS.
compute the total SS. $$ Total SS = \sum_{i=1}^k\sum_{j=1}{n_i} (Y_{ij} - \bar{Y})^{2 = \sum_{i=1}}k\sum_{j=1}^{n_i}Y_{ij}2 - CM $$
$$ CM = \frac{1}{n} ( \sum_{i=1}^k\sum_{j=1}{n_i} Y_{ij} ) ^2 $$
final def wait(): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long, arg1: Int): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )
final def wait(arg0: Long): Unit

Definition Classes
AnyRef
Annotations
@throws( ... )

Related Docs: object Anova | package analysis

class Anova extends StatisticalTest

Instance Constructors

new Anova(X: DenseMatrix[Double])

Type Members

class Intermediate extends AnyRef

Value Members

final def !=(arg0: Any): Boolean

final def ##(): Int

final def ==(arg0: Any): Boolean

val X: DenseMatrix[Double]

final def asInstanceOf[T0]: T0

def clone(): AnyRef

def cm(): Intermediate

val criticalVal: (Seq[Double]) ⇒ CriticalValue

final def eq(arg0: AnyRef): Boolean

def equals(arg0: Any): Boolean

val fdist: FDistribution

def finalize(): Unit

final def getClass(): Class[_]

def hashCode(): Int

final def isInstanceOf[T0]: Boolean

val k: Int

def mse(accum: Intermediate): Intermediate

def mst(accum: Intermediate): Intermediate

val n: Int

final def ne(arg0: AnyRef): Boolean

final def notify(): Unit

final def notifyAll(): Unit

def ssTreatment(accum: Intermediate): Intermediate

def sse(accum: Intermediate): Intermediate

def statistic(): (Double, Intermediate)

final def synchronized[T0](arg0: ⇒ T0): T0

def test(alpha: Double): TestResult

def toString(): String

def totalSS(accum: Intermediate): Intermediate

final def wait(): Unit

final def wait(arg0: Long, arg1: Int): Unit

final def wait(arg0: Long): Unit

Inherited from StatisticalTest

Inherited from AnyRef

Inherited from Any

Ungrouped