Usage in custom model#
Via Inheritance#
The most straightforward way to use formulaic-contrasts with a custom model is to use FormulaicContrasts
as a base class or mixin class.
As an example, let’s wrap an Ordinary Least Squares (OLS) linear model into a custom class
for the use with formulaic-contrasts. The aim is to build a model that takes a pandas DataFrame and a formulaic formula as input
allows to fit the model to a continuous variable from the dataframe and perform a statistical test for a given contrast.
This can be achieved with the following class definition. The constructor, the contrast() and cond() methods are inherited from the FormulaicContrasts
base class:
import formulaic_contrasts
import numpy as np
import statsmodels.api as sm
import pandas as pd
class StatsmodelsOLS(formulaic_contrasts.FormulaicContrasts):
def __init__(self, data: pd.DataFrame, design: str):
super().__init__(data, design)
def fit(self, variable: str):
self.mod = sm.OLS(self.data[variable], self.design_matrix)
self.mod = self.mod.fit()
def t_test(self, contrast: np.ndarray):
return self.mod.t_test(contrast)
Let’s apply our model to an example dataset. The toy data mimicks a 2-arm clinical trial (drugA vs. drugB)
with Responders (responder) and Non-responders (non_responder) and a continuous biomarker that can protentially
predict response of the different treatments.
df = formulaic_contrasts.datasets.treatment_response()
df
| treatment | response | biomarker | |
|---|---|---|---|
| 0 | drugA | non_responder | 6.595490 |
| 1 | drugA | non_responder | 7.071509 |
| 2 | drugA | non_responder | 8.537421 |
| 3 | drugA | non_responder | 6.787991 |
| 4 | drugA | non_responder | 10.109717 |
| ... | ... | ... | ... |
| 75 | drugB | responder | 11.167627 |
| 76 | drugB | responder | 9.493773 |
| 77 | drugB | responder | 5.027817 |
| 78 | drugB | responder | 9.800762 |
| 79 | drugB | responder | 9.945963 |
80 rows × 3 columns
Let’s fit the model an perform the statistical test
model = StatsmodelsOLS(df, "~ treatment + response")
model.fit("biomarker")
model.t_test(
model.contrast("response", baseline="non_responder", group_to_compare="responder")
)
<class 'statsmodels.stats.contrast.ContrastResults'>
Test for Constraints
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
c0 1.9563 0.775 2.525 0.014 0.413 3.499
==============================================================================
Via Composition#
Alternatively, if you prefer to work without inheritance, you can use FormulaicContrast as an attribute.
In this case, you need to define cond()/contrast() yourself, or provide a custom way to define contrasts, calling cond() internally. A minimal implementation could look like:
import pandas as pd
class StatsmodelsOLS:
def __init__(self, data: pd.DataFrame, design: str) -> None:
self.data = data
self.contrast_builder = formulaic_contrasts.FormulaicContrasts(data, design)
def fit(self, variable: str):
self.mod = sm.OLS(self.data[variable], self.contrast_builder.design_matrix)
self.mod = self.mod.fit()
def cond(self, **kwargs):
return self.contrast_builder.cond(**kwargs)
def contrast(self, *args, **kwargs):
return self.contrast_builder.contrast(*args, **kwargs)
def t_test(self, contrast: np.ndarray):
return self.mod.t_test(contrast)
The result is the same:
model = StatsmodelsOLS(df, "~ treatment + response")
model.fit("biomarker")
model.t_test(
model.contrast("response", baseline="non_responder", group_to_compare="responder")
)
<class 'statsmodels.stats.contrast.ContrastResults'>
Test for Constraints
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
c0 1.9563 0.775 2.525 0.014 0.413 3.499
==============================================================================
Manual usage#
Hint
Some helpful definitions for working with formulaic formulas (e.g. ~ 0 + C(donor):treatment + np.log1p(continuous)):
A term refers to an expression in the formula, separated by
+, e.g.C(donor):treatment, ornp.log1p(continuous).A variable refers to a column of the data frame passed to formulaic, e.g.
donor.A factor is the specification of how a certain variable is represented in the design matrix, e.g. treatment coding with base level “A” and reduced rank.
You can use the lower-level interface get_factor_storage_and_materializer() to
introspect formulaic models if the FormulaicContrasts class doesn’t fit your needs.
get_factor_storage_and_materializer will generate a custom materializer
that, while generating the design matrix from the input data and formula, keeps track of certain metadata that are
useful for defining contrasts.
factor_storage, variables_to_factors, materializer_class = (
formulaic_contrasts.get_factor_storage_and_materializer()
)
If we apply the materializer to the example dataset from above…
design_mat = materializer_class(df, record_factor_metadata=True).get_model_matrix(
"~ treatment + response"
)
… factor_storage will keep track of factors used in the formula, while variables_to_factors will keep
track of variables used in the formula. The FactorMetadata objects contain information
that is useful for building custom contrasts, such as the list of categories or the base()
level of categorical variables.
from pprint import pprint
pprint(factor_storage)
defaultdict(<class 'list'>,
{'response': [FactorMetadata(name='response',
reduced_rank=True,
custom_encoder=False,
categories=('non_responder',
'responder'),
kind=<Kind.CATEGORICAL: 'categorical'>,
drop_field='non_responder',
column_names=('non_responder',
'responder'),
colname_format='{name}[{field}]')],
'treatment': [FactorMetadata(name='treatment',
reduced_rank=True,
custom_encoder=False,
categories=('drugA', 'drugB'),
kind=<Kind.CATEGORICAL: 'categorical'>,
drop_field='drugA',
column_names=('drugA', 'drugB'),
colname_format='{name}[{field}]')]})
variables_to_factors
defaultdict(set, {'treatment': {'treatment'}, 'response': {'response'}})
In a slightly more complex (and argueably useless) example, the result would look like this:
factor_storage, variables_to_factors, materializer_class = (
formulaic_contrasts.get_factor_storage_and_materializer()
)
design_mat = materializer_class(df, record_factor_metadata=True).get_model_matrix(
"~ np.log(biomarker) + C(treatment, contr.treatment(base='drugB')) * C(response)"
)
pprint(factor_storage)
defaultdict(<class 'list'>,
{'C(response)': [FactorMetadata(name='C(response)',
reduced_rank=True,
custom_encoder=True,
categories=('non_responder',
'responder'),
kind=<Kind.CATEGORICAL: 'categorical'>,
drop_field=None,
column_names=('responder',),
colname_format='{name}[T.{field}]')],
"C(treatment, contr.treatment(base='drugB'))": [FactorMetadata(name='C(treatment, '
"contr.treatment(base='drugB'))",
reduced_rank=True,
custom_encoder=True,
categories=('drugA',
'drugB'),
kind=<Kind.CATEGORICAL: 'categorical'>,
drop_field=None,
column_names=('drugA',),
colname_format='{name}[T.{field}]')],
'np.log(biomarker)': [FactorMetadata(name='np.log(biomarker)',
reduced_rank=False,
custom_encoder=False,
categories=(np.float64(-0.994572491384816),
np.float64(-0.32292867440331624),
np.float64(-0.241787230648226),
np.float64(-0.07184946402122966),
np.float64(0.5402788527682045),
np.float64(0.5588358549592866),
np.float64(0.5976024872603584),
np.float64(0.6985411914333322),
np.float64(0.8014619184523657),
np.float64(0.9353144690102372),
np.float64(0.9357602734122806),
np.float64(0.9508095585482432),
np.float64(1.0587960677139694),
np.float64(1.1984152459673467),
np.float64(1.250031921525277),
np.float64(1.288736956185549),
np.float64(1.3055160592797277),
np.float64(1.318699316319654),
np.float64(1.3691528193898392),
np.float64(1.4011909677280152),
np.float64(1.4696755716009422),
np.float64(1.4848614510372413),
np.float64(1.5042452056588211),
np.float64(1.5347612081374633),
np.float64(1.5586765895960253),
np.float64(1.5915854505280886),
np.float64(1.6133268972915464),
np.float64(1.6149859295681703),
np.float64(1.6427701026668886),
np.float64(1.6541909597093103),
np.float64(1.6706857016933832),
np.float64(1.6800363836451575),
np.float64(1.6874414702471792),
np.float64(1.7114697163161487),
np.float64(1.712291418836503),
np.float64(1.7302545130901745),
np.float64(1.744639763009614),
np.float64(1.7498086372286121),
np.float64(1.7842160982734327),
np.float64(1.7871408509738518),
np.float64(1.8077571345863106),
np.float64(1.8126986617691452),
np.float64(1.8332062854487814),
np.float64(1.8502652368191275),
np.float64(1.8819102495764126),
np.float64(1.8863860690197658),
np.float64(1.897341561353762),
np.float64(1.9117530958321944),
np.float64(1.9151549833413493),
np.float64(1.918061067186584),
np.float64(1.9188752897607209),
np.float64(1.919357972842738),
np.float64(1.9274811395788238),
np.float64(1.9290675404526416),
np.float64(1.9292525771899476),
np.float64(1.956073906066531),
np.float64(1.9839526394392006),
np.float64(1.9906039715512525),
np.float64(1.9942917406466534),
np.float64(2.027629774683785),
np.float64(2.0416152416364257),
np.float64(2.0646278053535907),
np.float64(2.1167677837306025),
np.float64(2.123298339614755),
np.float64(2.1407653917376552),
np.float64(2.1444589627935478),
np.float64(2.1600468598671436),
np.float64(2.19915225165721),
np.float64(2.250636108272607),
np.float64(2.254466878298736),
np.float64(2.2824601250615855),
np.float64(2.29716673886693),
np.float64(2.3134970542142574),
np.float64(2.413019186969858),
np.float64(2.4438604074304093),
np.float64(2.5154775572854935),
np.float64(2.516131763095562),
np.float64(2.770430770362043)),
kind=<Kind.NUMERICAL: 'numerical'>,
drop_field=None,
column_names=None,
colname_format='{name}[{field}]')]})
variables_to_factors
defaultdict(set,
{'np.log': {'np.log(biomarker)'},
'biomarker': {'np.log(biomarker)'},
'treatment': {"C(treatment, contr.treatment(base='drugB'))"},
'contr.treatment': {"C(treatment, contr.treatment(base='drugB'))"},
'C': {'C(response)',
"C(treatment, contr.treatment(base='drugB'))"},
'response': {'C(response)'}})