1. Introduction to ANOVAs

tags: #statistics/inferential/anova

What are ANOVAs?

Analysis of Variances (ANOVAs) is an inferential statistical technique used to compare the difference in the means of three or more groups. Two types:

  1. One-way ANOVA
  2. Two-way ANOVA
Problems with ANOVA

ANOVA only tells us whether a statistically significant difference exists. Does NOT tell us where the difference lie (i.e., which groups differ). Therefore, should a significant result be produced, we need to follow up with a Post-Hoc Test e.g., Tukey HSD, to determine which group differ.


Assumptions and Conditions

To use ANOVA, the following assumptions and conditions about the population from which the sample was taken from must be satisfied:

Assumptions Check

  1. The dependent variable must be measured at a continuous level (e.g., years of education, score, salary)

  2. The independent variable must consists of 3 or more categorical, INDEPENDENT groups (note: when there are TWO independent variables -> two-way ANOVA)

  3. Independence of observations i.e., observations are mutually independent of each other, such that there is no relationship between the participants in any of the groups (e.g., the selection of participants in the control group has no effect on individual in the treatment group; each group is INDEPENDENT OF EACH OTHER)[1]

Remaining 3 assumptions related to the shape of the population distribution and how the data fits ANOVA

  1. No significant outliers in the groups of your IDV in terms of the DV

  2. Values of the DV should approximate a NORMAL DISTRIBUTION (if violated, consider: Transformation for Normality)

  3. Homogeneity of the variances (Heterodescasity) i.e., the variance of the DV (spread) in EACH CATEGORICAL GROUP of the IDV is the SAME. Specifically, with respect to the residuals (actual - predicted), for each group[2], but we can run it before. Alternative: use Welch's ANOVA for One-Way ANOVA

  4. Residuals are IID (independent and identically distributed) i.e., each random variable has the same probability distribution as the others but are all mutually independent of each other.

How to check for i.i.d.

We can check for the i.i.d. assumption by creating a scatterplot of the actual vs predicted values. This should be random with NO discerning pattern.


The Multiple Hypothesis Problem

The multiple hypothesis problem in ANOVA (Analysis of Variance) refers to the issue of conducting multiple comparisons between the means of two or more groups.

Why is this the case?

  • WIth each hypothesis test, there is an inherent risk of rejecting the null when it is true (Type I Error).

  • The probability of making a Type I error for a single test is equal to the significance level alpha (e.g., 0.05), but the more tests you perform, the greater the chance of making at least one Type I error.

Example:

If you perform 10 independent tests at the 0.05 significance level, the overall probability of making at least one Type I error is:

Type 1 Error=1(10.05)10=0.401

This means that there is a 40.1% chance of falsely rejecting at least one null hypothesis, even if all the null hypotheses are true.

Solution

  1. Adjust the alpha significance level
  2. Correction methods to account for multiple comparisons

Examples: Bonferroni Correction



  1. This can otherwise lead to confounding variables ↩︎

  2. The Shapiro-Wilk test or other normality tests can be used to assess whether the residuals (i.e., the differences between the observed values and the model's predicted values) are normally distributed. ↩︎

  3. The multiple hypothesis problem in ANOVA is often referred to as a "cumulative error" or "cumulative effect" problem because the probability of making a Type I error increases with each additional hypothesis tested. ↩︎

Powered by Forestry.md