Building contrasts#
This section explains how to build contrasts with a model that implements the formulaic-contrasts API. We start
with the simple example of a pairwise comparison, then moving into more complex contrasts featuring
interaction terms.
All examples shown here are still relatively simple and the contrast vectors could be reasonably built manually. However, the approach still works for design matrices that have dozens or even hundreds of columns due to many variables and/or categories.
Note
formulaic-contrasts doesn’t implement any statistcal tests, it just provides tools for conveniently
building contrast vectors that can be used with various statistical models.
Let’s consider the following toy dataset that mimicks a 2-arm clinical trial (drugA vs. drugB)
with Responders (responder) and Non-responders (non_responder):
import formulaic_contrasts
df = formulaic_contrasts.datasets.treatment_response()
df
| treatment | response | biomarker | |
|---|---|---|---|
| 0 | drugA | non_responder | 6.595490 |
| 1 | drugA | non_responder | 7.071509 |
| 2 | drugA | non_responder | 8.537421 |
| 3 | drugA | non_responder | 6.787991 |
| 4 | drugA | non_responder | 10.109717 |
| ... | ... | ... | ... |
| 75 | drugB | responder | 11.167627 |
| 76 | drugB | responder | 9.493773 |
| 77 | drugB | responder | 5.027817 |
| 78 | drugB | responder | 9.800762 |
| 79 | drugB | responder | 9.945963 |
80 rows × 3 columns
The measured biomarker has different means for responders and non-responders for each treatment, which suggests the biomarker could have predictive value.
df.groupby(["treatment", "response"]).agg("mean")
| biomarker | ||
|---|---|---|
| treatment | response | |
| drugA | non_responder | 6.791881 |
| responder | 5.142661 | |
| drugB | non_responder | 4.720502 |
| responder | 10.282279 |
Simple pairwise comparison#
Arguably the most commonly used contrast is to compare between two levels of a categorical variable.
For instance, we could
investigate differences between responders and non-responders, independent of treatment by fitting the model
~ response + treatment and then comparing the category "responder" in the column response with the category "non_responder".
Given the data frame from above and the model ~ response + treatment, the design matrix contains the following distinct
entries, encoding the different combinations of response and drug.
from formulaic import model_matrix
model_matrix("~ response + treatment", df).drop_duplicates()
| Intercept | response[T.responder] | treatment[T.drugB] | |
|---|---|---|---|
| 0 | 1.0 | 0 | 0 |
| 10 | 1.0 | 1 | 0 |
| 40 | 1.0 | 0 | 1 |
| 70 | 1.0 | 1 | 1 |
The response[T.responder] column encodes "responder" as 1 and "non_responder" as 0. The
intercept is always 1 and the other column is irrelevant for our desired comparison. The entries a contrast vector
always correspond to the columns of the design matrix. We therefore need a contrast vector
that compares (1, 1, 0) vs. (1, 0, 0):
import numpy as np
contrast = np.array((1, 1, 0)) - np.array((1, 0, 0))
contrast
array([0, 1, 0])
Using formulaic-contrast’s cond() function, we can build the same
contrast vector by specifying the categories of interest:
from formulaic_contrasts import FormulaicContrasts
mod = FormulaicContrasts(df, "~ response + treatment")
contrast = mod.cond(response="responder") - mod.cond(response="non_responder")
contrast
Intercept 0.0
response[T.responder] 1.0
treatment[T.drugB] 0.0
Name: 0, dtype: float64
For this very common case of comparing two categories of the same variable, contrast()
provides a convenient shortcut for building the same contrast:
contrast = mod.contrast(
column="response",
baseline="non_responder",
group_to_compare="responder",
)
contrast
Intercept 0.0
response[T.responder] 1.0
treatment[T.drugB] 0.0
Name: 0, dtype: float64
Comparison within a subset#
Additionally, we could be interested in differences between responders and non-responders in drugB only.
While we could subset the data before fitting the model, we could also fit the full model including an interaction term ~ response * treatment and define the contrast such that it compares within drugB only.
mod = FormulaicContrasts(df, "~ response * treatment")
contrast = mod.cond(treatment="drugB", response="responder") - mod.cond(
treatment="drugB", response="non_responder"
)
contrast
Intercept 0.0
response[T.responder] 1.0
treatment[T.drugB] 0.0
response[T.responder]:treatment[T.drugB] 1.0
Name: 0, dtype: float64
Drug/response interaction#
Now, we are interested in the drug/response interaction, i.e. the difference of the differences between responders and non-responders
in drugA and drugB, respectively. This is captured by the response[T.responder]:treatment[T.drugB] coefficient
in the design matrix. This is how we derive this contrast vector using .cond:
mod = FormulaicContrasts(df, "~ response * treatment")
contrast = (
mod.cond(treatment="drugB", response="responder")
- mod.cond(treatment="drugB", response="non_responder")
) - (
mod.cond(treatment="drugA", response="responder")
- mod.cond(treatment="drugA", response="non_responder")
)
contrast
Intercept 0.0
response[T.responder] 0.0
treatment[T.drugB] 0.0
response[T.responder]:treatment[T.drugB] 1.0
Name: 0, dtype: float64