Building contrasts#

This section explains how to build contrasts with a model that implements the formulaic-contrasts API. We start with the simple example of a pairwise comparison, then moving into more complex contrasts featuring interaction terms.

All examples shown here are still relatively simple and the contrast vectors could be reasonably built manually. However, the approach still works for design matrices that have dozens or even hundreds of columns due to many variables and/or categories.

Note

formulaic-contrasts doesn’t implement any statistcal tests, it just provides tools for conveniently building contrast vectors that can be used with various statistical models.

Let’s consider the following toy dataset that mimicks a 2-arm clinical trial (drugA vs. drugB) with Responders (responder) and Non-responders (non_responder):

import formulaic_contrasts

df = formulaic_contrasts.datasets.treatment_response()
df
treatment response biomarker
0 drugA non_responder 6.595490
1 drugA non_responder 7.071509
2 drugA non_responder 8.537421
3 drugA non_responder 6.787991
4 drugA non_responder 10.109717
... ... ... ...
75 drugB responder 11.167627
76 drugB responder 9.493773
77 drugB responder 5.027817
78 drugB responder 9.800762
79 drugB responder 9.945963

80 rows × 3 columns

The measured biomarker has different means for responders and non-responders for each treatment, which suggests the biomarker could have predictive value.

df.groupby(["treatment", "response"]).agg("mean")
biomarker
treatment response
drugA non_responder 6.791881
responder 5.142661
drugB non_responder 4.720502
responder 10.282279

Simple pairwise comparison#

Arguably the most commonly used contrast is to compare between two levels of a categorical variable. For instance, we could investigate differences between responders and non-responders, independent of treatment by fitting the model ~ response + treatment and then comparing the category "responder" in the column response with the category "non_responder".

Given the data frame from above and the model ~ response + treatment, the design matrix contains the following distinct entries, encoding the different combinations of response and drug.

from formulaic import model_matrix

model_matrix("~ response + treatment", df).drop_duplicates()
Intercept response[T.responder] treatment[T.drugB]
0 1.0 0 0
10 1.0 1 0
40 1.0 0 1
70 1.0 1 1

The response[T.responder] column encodes "responder" as 1 and "non_responder" as 0. The intercept is always 1 and the other column is irrelevant for our desired comparison. The entries a contrast vector always correspond to the columns of the design matrix. We therefore need a contrast vector that compares (1, 1, 0) vs. (1, 0, 0):

import numpy as np

contrast = np.array((1, 1, 0)) - np.array((1, 0, 0))
contrast
array([0, 1, 0])

Using formulaic-contrast’s cond() function, we can build the same contrast vector by specifying the categories of interest:

from formulaic_contrasts import FormulaicContrasts

mod = FormulaicContrasts(df, "~ response + treatment")

contrast = mod.cond(response="responder") - mod.cond(response="non_responder")
contrast
Intercept                0.0
response[T.responder]    1.0
treatment[T.drugB]       0.0
Name: 0, dtype: float64

For this very common case of comparing two categories of the same variable, contrast() provides a convenient shortcut for building the same contrast:

contrast = mod.contrast(
    column="response",
    baseline="non_responder",
    group_to_compare="responder",
)
contrast
Intercept                0.0
response[T.responder]    1.0
treatment[T.drugB]       0.0
Name: 0, dtype: float64

Comparison within a subset#

Additionally, we could be interested in differences between responders and non-responders in drugB only. While we could subset the data before fitting the model, we could also fit the full model including an interaction term ~ response * treatment and define the contrast such that it compares within drugB only.

mod = FormulaicContrasts(df, "~ response * treatment")

contrast = mod.cond(treatment="drugB", response="responder") - mod.cond(
    treatment="drugB", response="non_responder"
)
contrast
Intercept                                   0.0
response[T.responder]                       1.0
treatment[T.drugB]                          0.0
response[T.responder]:treatment[T.drugB]    1.0
Name: 0, dtype: float64

Drug/response interaction#

Now, we are interested in the drug/response interaction, i.e. the difference of the differences between responders and non-responders in drugA and drugB, respectively. This is captured by the response[T.responder]:treatment[T.drugB] coefficient in the design matrix. This is how we derive this contrast vector using .cond:

mod = FormulaicContrasts(df, "~ response * treatment")

contrast = (
    mod.cond(treatment="drugB", response="responder")
    - mod.cond(treatment="drugB", response="non_responder")
) - (
    mod.cond(treatment="drugA", response="responder")
    - mod.cond(treatment="drugA", response="non_responder")
)
contrast
Intercept                                   0.0
response[T.responder]                       0.0
treatment[T.drugB]                          0.0
response[T.responder]:treatment[T.drugB]    1.0
Name: 0, dtype: float64