An Introduction to CoDa


Starting Definition

CoDa is a common abbreviation of Compositional Data Analysis.

Compositions are typically sets of values or amounts that quantify the breakdown of a whole into its constitutent parts.  For example, consider the split of the waking day between different activity types  (Sedentary, Light Activity, Moderate to Vigorous Activity) = (60%, 30%, 10%).

Such amounts are inherently interdependent and negatively correlated (if you do an extra hour of vigorous activity you must be doing an hour less of some other activity type(s)), and convey only relative information. Conventional multivariate analysis techniques must be adjusted to allow for this. Compositional data analysis is a formal body of knowledge providing theory and methods to deal with compositional data in a sound and principled way. 

Broadening the Scope

The scope of compositional analysis extends beyond this original motivating problem though. CoDa has been profitably applied to a variety of problems where the constraint of a fixed total does not apply, but where it is the relative amounts of the components, rather than the absolute amounts that are the quantities of concern.

For example, the qualities of a chemical solution may be more dependent on the relative concentrations of a subset of the chemicals, rather than the absolute amounts of these chemicals in the solution. 

Fundamentals of CoDa

CoDa was essentially pioneered by John Aitchison in the 1980s, although earlier statisticians (notably Karl Pearson) had observed the difficulties arising from the analysis of compositional data.

Aitchison's approach rested on defining three fundamental properties that a "good" method of handling compositional data required.

  1. Scale invariance: only the relative size (ratio) of the components matters, e.g. percentage breakdown (25%,15%,60%) and time breakdown (6 hrs, 3.6 hrs,14.4 hrs) of the 24 hour day are equivalent compositions.
  2. Subcompositional coherence: results obtained from the full composition should not contradict results obtained from a subcomposition e.g. results obtained from examining the breakdown of the waking day should be consistent with results examining the breakdown of the full 24 hour day.
  3. Permutation invariance: the order of the components doesn't matter, e.g. (SB=60%,LIPA=25%,MVPA=15%) and (LIPA=25%,MVPA=15%,SB=60%) are equivalent

These properties are not satisfied by ordinary statistical techniques defined for unconstrained real data,and led Aitchison to propose the use of logratio transformations of the original data to be able to apply ordinary statistical methods. In particular he defined the additive logratio transform (alr) and the centred logratio transform (clr), each hacing advantages and drawbacks and useful depending on the particular statistical problem at hand. Generally the alr is useful for parametric model-based methods like regression, whereas clr is more suitable for non-parametric methods like ordinary clustering analysis.

Examples of the two transforms are shown below:

alr(x1, x2, x3) = (ln(x1/x3), ln(x2/x3) ) 

clr(x1,x2,x3) = (ln(x1/g(x)), ln(x2/g(x)), ln(x3/g(x)))

where g(x) = ( x1 * x2 * x3 ) ^ (1/3) is the geometric mean of the components.

Egozcue et al later defined the isometric logratio transform (ilr), which deals with some of the issues with the alr and clr, and has become the most popular choice today.

ilr (x1, x2, x3) = ( (2/3)^(1/2) * ln (x1 / (x2*x3)^(1/2) ), (1/2)^(1/2) * ln (x2 / x3) )  

Applying CoDa to Physical Activity

Data accounting for the breakdown of the day by behaviour type are compositional data. It is not reasonable to assess the impact of the variation in one behavior type in isolation. Allowance should be made for the behavior type(s) replaced. Isotemporal substitution is currently the most popular approach for allowing for this, but compositional analysis makes available the full range of multivariate data analysis techniques to physical activity problems, and adequately considers the co-dependencies between the behavior types.