Building contrasts#

This section explains how to build contrasts with a model that implements the formulaic-contrasts API. We start with the simple example of a pairwise comparison, then moving into more complex contrasts featuring interaction terms.

All examples shown here are still relatively simple and the contrast vectors could be reasonably built manually. However, the approach still works for design matrices that have dozens or even hundreds of columns due to many variables and/or categories.

Note

formulaic-contrasts doesn’t implement any statistcal tests, it just provides tools for conveniently building contrast vectors that can be used with various statistical models.

Let’s consider the following toy dataset that mimicks a 2-arm clinical trial (drugA vs. drugB) with Responders (responder) and Non-responders (non_responder):

import formulaic_contrasts

df = formulaic_contrasts.datasets.treatment_response()
df

	treatment	response	biomarker
0	drugA	non_responder	6.595490
1	drugA	non_responder	7.071509
2	drugA	non_responder	8.537421
3	drugA	non_responder	6.787991
4	drugA	non_responder	10.109717
...	...	...	...
75	drugB	responder	11.167627
76	drugB	responder	9.493773
77	drugB	responder	5.027817
78	drugB	responder	9.800762
79	drugB	responder	9.945963

80 rows × 3 columns

The measured biomarker has different means for responders and non-responders for each treatment, which suggests the biomarker could have predictive value.

df.groupby(["treatment", "response"]).agg("mean")

		biomarker
treatment	response
drugA	non_responder	6.791881
drugA	responder	5.142661
drugB	non_responder	4.720502
drugB	responder	10.282279

Simple pairwise comparison#

Arguably the most commonly used contrast is to compare between two levels of a categorical variable. For instance, we could investigate differences between responders and non-responders, independent of treatment by fitting the model ~ response + treatment and then comparing the category "responder" in the column response with the category "non_responder".

Given the data frame from above and the model ~ response + treatment, the design matrix contains the following distinct entries, encoding the different combinations of response and drug.

from formulaic import model_matrix

model_matrix("~ response + treatment", df).drop_duplicates()

	Intercept	response[T.responder]	treatment[T.drugB]
0	1.0	0	0
10	1.0	1	0
40	1.0	0	1
70	1.0	1	1

The response[T.responder] column encodes "responder" as 1 and "non_responder" as 0. The intercept is always 1 and the other column is irrelevant for our desired comparison. The entries a contrast vector always correspond to the columns of the design matrix. We therefore need a contrast vector that compares (1, 1, 0) vs. (1, 0, 0):

import numpy as np

contrast = np.array((1, 1, 0)) - np.array((1, 0, 0))
contrast

array([0, 1, 0])

Using formulaic-contrast’s cond() function, we can build the same contrast vector by specifying the categories of interest:

from formulaic_contrasts import FormulaicContrasts

mod = FormulaicContrasts(df, "~ response + treatment")

contrast = mod.cond(response="responder") - mod.cond(response="non_responder")
contrast

Intercept                0.0
response[T.responder]    1.0
treatment[T.drugB]       0.0
Name: 0, dtype: float64

For this very common case of comparing two categories of the same variable, contrast() provides a convenient shortcut for building the same contrast:

contrast = mod.contrast(
    column="response",
    baseline="non_responder",
    group_to_compare="responder",
)
contrast

Intercept                0.0
response[T.responder]    1.0
treatment[T.drugB]       0.0
Name: 0, dtype: float64

Comparison within a subset#

Additionally, we could be interested in differences between responders and non-responders in drugB only. While we could subset the data before fitting the model, we could also fit the full model including an interaction term ~ response * treatment and define the contrast such that it compares within drugB only.

mod = FormulaicContrasts(df, "~ response * treatment")

contrast = mod.cond(treatment="drugB", response="responder") - mod.cond(
    treatment="drugB", response="non_responder"
)
contrast

Intercept                                   0.0
response[T.responder]                       1.0
treatment[T.drugB]                          0.0
response[T.responder]:treatment[T.drugB]    1.0
Name: 0, dtype: float64

Drug/response interaction#

Now, we are interested in the drug/response interaction, i.e. the difference of the differences between responders and non-responders in drugA and drugB, respectively. This is captured by the response[T.responder]:treatment[T.drugB] coefficient in the design matrix. This is how we derive this contrast vector using .cond:

mod = FormulaicContrasts(df, "~ response * treatment")

contrast = (
    mod.cond(treatment="drugB", response="responder")
    - mod.cond(treatment="drugB", response="non_responder")
) - (
    mod.cond(treatment="drugA", response="responder")
    - mod.cond(treatment="drugA", response="non_responder")
)
contrast

Intercept                                   0.0
response[T.responder]                       0.0
treatment[T.drugB]                          0.0
response[T.responder]:treatment[T.drugB]    1.0
Name: 0, dtype: float64

Building contrasts

Contents

Building contrasts#

Simple pairwise comparison#

Comparison within a subset#

Drug/response interaction#