epistasis
¶
A Python API for modeling statistical, high-order epistasis in genotype-phenotype maps. You can use this library to:
- Decompose genotype-phenotype maps into high-order epistatic interactions
- Find nonlinear scales in the genotype-phenotype map
- Calculate the contributions of different epistatic orders and
- Estimate the uncertainty in the epistatic coefficients and
For more information about the epistasis models in this library, see our Genetics paper:
This library is built on top of well known tools in the scientific Python stack. It uses packages like matplotlib, numpy, scipy, scikit-learn, and pandas. We strive to follow similar inferface designs present in this ecosytem. If you notice ways we can improve, please open an issue on Github! We’d love to hear your feedback.
Currently, this package works only as an API. There is no command-line interface. Instead, we encourage you use this package inside Jupyter notebooks .
Basic Example¶
Fit an epistasis model to genotype-phenotype map data.
# Import a model and the plotting module
from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisLinearRegression
from epistasis.pyplot import plot_coefs
# Genotype-phenotype map data.
wildtype = "AAA"
genotypes = ["ATT", "AAT", "ATA", "TAA", "ATT", "TAT", "TTA", "TTT"]
phenotypes = [0.1, 0.2, 0.4, 0.3, 0.3, 0.6, 0.8, 1.0]
# Create genotype-phenotype map object.
gpm = GenotypePhenotypeMap(wildtype=wildtype,
genotypes=genotypes,
phenotypes=phenotypes)
# Initialize an epistasis model.
model = EpistasisLinearRegression(order=3)
# Add the genotype phenotype map.
model.add_gpm(gpm)
# Fit model to given genotype-phenotype map.
model.fit()
# Plot coefs.
fig, axes = plot_coefs(model, figsize=(2,4))

Documentation¶
We are still working hard on the Docs! You may notice blank spots in various places. We appreciate your patience as we try to catch up on docs.
Quick Guide¶
Introduction¶
epistasis
is a Python library that includes models to estimate statistical, high-order epistasis in genotype-phenotype maps. Using this library, you can
- Decompose genotype-phenotype maps into high-order epistatic interactions
- Find nonlinear scales in the genotype-phenotype map
- Calculate the contributions of different epistatic orders and
- Estimate the uncertainty in the epistatic coefficients and
For more information about the epistasis models in this library, see our Genetics paper:
Simple Example¶
Follow these five steps for all epistasis models in this library:
- Import a model. There many models available in the
epistasis.models
module. See the full list in the next section.
from epistasis.models import EpistasisLinearRegression
- Initialize a model. Set the order, choose the type of model (see Anatomy of an epistasis model for more info), and set any other parameters in the model.
model = EpistasisLinearRegression(order=3, model_type='global')
- Add some data. There are three basic ways to do this. 1. Pass data directly to the epistasis model using the
add_data
method. 2. Read data from a separate file using one of theread_
methods. 3. (The best option.) load data into a GenotypePhenotypeMap object from the GPMap library and add it to the epistasis model.
from gpmap import GenotypePhenotypeMap
datafile = 'data.csv'
gpm = GenotypePhenotypeMap.read_csv(datafile)
# Add the data.
model.add_gpm(gpm)
# model now has a `gpm` attribute.
- Fit the model. Each model has a simple fit method. Call this to estimate epistatic coefficients. The results are stored the
epistasis
attribute.
# Call fit method
model.fit()
# model now has an ``epistasis`` attribute
- Plot the results. The epistasis library has a
pyplot
module (powered by matplotlib) with a few builtin plotting functions.
from epistasis.pyplot import plot_coefs
fig, ax = plot_coefs(model.epistasis.sites, model.epistasis.values)

Install and dependencies¶
For users¶
This library is now available on PyPi, so it can be installed using pip.
pip install epistasis
For developers¶
For the latest version of the package, you can also clone from Github and install a development version using pip.
git clone https://github.com/harmslab/epistasis
cd epistasis
pip install -e .
Dependencies¶
The following dependencies are required for the epistasis package.
- gpmap: Module for constructing powerful genotype-phenotype map python data-structures.
- Scikit-learn: Simple to use machine-learning API.
- Numpy: Python’s array manipulation package.
- Scipy: Efficient scientific array manipulations and fitting.
- Pandas: High-performance, easy-to-use data structures and data analysis tools.
There are also some additional dependencies for extra features included in the package.
- matplotlib: Python plotting API.
- ipython: interactive python kernel.
- jupyter notebook: interactive notebook application for running python kernels interactively.
- ipywidgets: interactive widgets in python.
Running tests¶
The epistasis package comes with a suite of tests. Running the tests require pytest, so make sure it is installed.
pip install -U pytest
Once pytest is installed, run the tests from the base directory of the epistasis package using the following command.
pytest
Detailed list of models¶
This page lists all models included in the Epistasis Package.
- EpistasisLinearRegression: estimate epistatic coefficents in a linear genotype-phenotype map.
- EpistasisRidge: estimate sparse epistatic coefficients in a linear genotype-phenotype map
- EpistasisLasso: estimate sparse epistatic coefficients in a linear genotype-phenotype map
- EpistasisNonlinearRegression: estimates high-order epistatic coefficients in a nonlinear genotype-phenotype map.
- EpistasisNonlinearLasso: estimate sparse epistatic coefficients in a nonlinear genotype-phenotype map.
- EpistasisPowerTransform: use a power transform function to fit a nonlinear genotype-phenotype map and estimate epistasis.
- EpistasisPowerLasso: use a power transform function to fit a nonlinear genotype-phenotype map and estimate sparse epistasis.
- EpistasisLogisticRegression: use logistic regression to classify phenotypes as dead/alive.
- EpistasisEnsembleRegression: use a statistical ensemble of “states” to decompose variation in a genotype-phenotype map.
EpistasisLinearRegression¶
A linear, high-order epistasis model. This uses an ordinary least-squares regression to estimate high-order, epistatic coefficients in an arbitrary genotype-phenotype map. Simple define the order of the model.
from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisLinearRegression
wildtype = 'AA'
genotypes = ['AA', 'AT', 'TA', 'TT']
phenotypes = [0.1, 0.2, 0.7, 1.2]
# Read genotype-phenotype map.
gpm = GenotypePhenotypeMap(wildtype, genotypes, phenotypes)
# Initialize the data.
model = EpistasisLinearRegression(order=2)
# Add Genotype-phenotype map data.
model.add_gpm(gpm)
# Fit the model.
model.fit()
EpistasisRidge¶
A L2-norm epistasis model for estimating sparse epistatic coefficients. The optimization function imposes a penalty on the number of coefficients and finds the model that maximally explains the data while using the fewest coefficients.
from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisRidge
wildtype = 'AA'
genotypes = ['AA', 'AT', 'TA', 'TT']
phenotypes = [0.1, 0.2, 0.7, 1.2]
# Read genotype-phenotype map.
gpm = GenotypePhenotypeMap(wildtype, genotypes, phenotypes)
# Initialize the data.
model = EpistasisRidge(order=2)
# Add Genotype-phenotype map data.
model.add_gpm(gpm)
# Fit the model.
model.fit()
EpistasisLasso¶
A L1-norm epistasis model for estimating sparse epistatic coefficients. The optimization function imposes a penalty on the number of coefficients and finds the model that maximally explains the data while using the fewest coefficients.
from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisLasso
wildtype = 'AA'
genotypes = ['AA', 'AT', 'TA', 'TT']
phenotypes = [0.1, 0.2, 0.7, 1.2]
# Read genotype-phenotype map.
gpm = GenotypePhenotypeMap(wildtype, genotypes, phenotypes)
# Initialize the data.
model = EpistasisLasso(order=2)
# Add Genotype-phenotype map data.
model.add_gpm(gpm)
# Fit the model.
model.fit()
EpistasisNonlinearRegression¶
A nonlinear, high-order epistasis model. This uses nonlinear, least-squares
regression (provided by lmfit
) to estimate high-order, epistatic
coefficients in an arbitrary genotype-phenotype map.
- This models has three steps:
- Fit an additive, linear regression to approximate the average effect of individual mutations.
- Fit the nonlinear function to the observed phenotypes vs. the additive phenotypes estimated in step 1. This function is defined by the user as a callable python function
- Transform the phenotypes to this linear scale and fit leftover variation with high-order epistasis model.
from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisLinearRegression
wildtype = 'AA'
genotypes = ['AA', 'AT', 'TA', 'TT']
phenotypes = [0.1, 0.2, 0.7, 1.2]
# Read genotype-phenotype map.
gpm = GenotypePhenotypeMap(wildtype, genotypes, phenotypes)
def func(x, A):
return np.exp(A * x)
def reverse(y, A):
return np.log(x) / A
# Initialize the data.
model = EpistasisNonlinearRegression(order=2, function=func, reverse=reverse)
# Add Genotype-phenotype map data.
model.add_gpm(gpm)
# Fit the model.
model.fit(A=1)
EpistasisNonlinearLasso¶
A nonlinear, high-order epistasis model. This uses nonlinear, least-squares
regression (provided by lmfit
) to estimate high-order, epistatic
coefficients in an arbitrary genotype-phenotype map.
- This models has three steps:
- Fit an additive, linear regression to approximate the average effect of individual mutations.
- Fit the nonlinear function to the observed phenotypes vs. the additive phenotypes estimated in step 1. This function is defined by the user as a callable python function
- Transform the phenotypes to this linear scale and fit leftover variation with an EpistasisLasso.
from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisLinearRegression
wildtype = 'AA'
genotypes = ['AA', 'AT', 'TA', 'TT']
phenotypes = [0.1, 0.2, 0.7, 1.2]
# Read genotype-phenotype map.
gpm = GenotypePhenotypeMap(wildtype, genotypes, phenotypes)
def func(x, A):
return np.exp(A * x)
def reverse(y, A):
return np.log(x) / A
# Initialize the data.
model = EpistasisNonlinearLasso(order=3, function=func, reverse=reverse)
# Add Genotype-phenotype map data.
model.add_gpm(gpm)
# Fit the model.
model.fit(A=1)
EpistasisPowerTransform¶
Use power-transform function, via nonlinear least-squares regression, to estimate epistatic coefficients and the nonlinear scale in a nonlinear genotype-phenotype map.
- Like the nonlinear model, this model has three steps:
- Fit an additive, linear regression to approximate the average effect of individual mutations.
- Fit the nonlinear function to the observed phenotypes vs. the additive phenotypes estimated in step 1.
- Transform the phenotypes to this linear scale and fit leftover variation with high-order epistasis model.
Methods are described in the following publication:
Sailer, Z. R. & Harms, M. J. ‘Detecting High-Order Epistasis in Nonlinear Genotype-Phenotype Maps’. Genetics 205, 1079-1088 (2017).
from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisLinearRegression
wildtype = 'AA'
genotypes = ['AA', 'AT', 'TA', 'TT']
phenotypes = [0.1, 0.2, 0.7, 1.2]
# Read genotype-phenotype map.
gpm = GenotypePhenotypeMap(wildtype, genotypes, phenotypes)
# Initialize the data.
model = EpistasisPowerTransform(order=3)
# Add Genotype-phenotype map data.
model.add_gpm(gpm)
# Fit the model.
model.fit(lmbda=1, A=1, B=1)
EpistasisPowerLasso¶
Use power-transform function, via nonlinear least-squares regression, to estimate epistatic coefficients and the nonlinear scale in a nonlinear genotype-phenotype map.
- Like the nonlinear model, this model has three steps:
- Fit an additive, linear regression to approximate the average effect of individual mutations.
- Fit the nonlinear function to the observed phenotypes vs. the additive phenotypes estimated in step 1.
- Transform the phenotypes to this linear scale and fit leftover variation with an EpistasisLasso.
from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisLinearRegression
wildtype = 'AA'
genotypes = ['AA', 'AT', 'TA', 'TT']
phenotypes = [0.1, 0.2, 0.7, 1.2]
# Read genotype-phenotype map.
gpm = GenotypePhenotypeMap(wildtype, genotypes, phenotypes)
# Initialize the data.
model = EpistasisPowerTransformLasso(order=3)
# Add Genotype-phenotype map data.
model.add_gpm(gpm)
# Fit the model.
model.fit(lmbda=1, A=1, B=1)
EpistasisLogisticRegression¶
A high-order epistasis regression that classifies genotypes as viable/nonviable (given some threshold).
from epistasis.models import EpistasisLogisticRegression
wildtype = '00'
genotypes = ['00', '01', '10', '11']
phenotypes = [0, .2, .1, 1]
# Initialize the data.
model = EpistasisLogisticRegression(order=1, threshold=.1)
# Add Genotype-phenotype map data.
model.add_data(wildtype, genotypes, phenotypes)
# Fit the model.
model.fit()
EpistasisEnsembleRegression¶
A regression object that models phenotypes as a statistical (Boltmann-weighted) average of “states”. Mutations are modeled as having different effects in each state.
from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisEnsembleRegression
wildtype = 'AA'
genotypes = ['AA', 'AT', 'TA', 'TT']
phenotypes = [0.1, 0.2, 0.7, 1.2]
# Read genotype-phenotype map.
gpm = GenotypePhenotypeMap(wildtype, genotypes, phenotypes)
# Initialize the data.
model = EpistasisEnsembleRegression(order=1, nstates=1)
# Add Genotype-phenotype map data.
model.add_gpm(gpm)
# Fit the model.
model.fit()
# Print effects in state A.
print(model.state_A.epistasis.values)
Anatomy of an epistasis model¶
The X matrix¶
The most critical piece of an epistasis models is the X (model) matrix. This matrix maps genotypes to epistatic coefficients. You can read about this matrix in this paper.
There are two popular X matrices that exist in the epistasis literature, the
global
model (a.k.a. background-averaged model) and local
model (a.k.a. biochemical model).
All epistasis models in this API takes a model_type
keyword argument
that tells the model which matrix to use. Read the paper mentioned
above for more information on which model to use.
Constructing these matrices for your dataset is no easy task, so each epistasis model can handle this construction internally. Most methods automatically infer X from the genotype-phenotype map.
Any X matrix used by an epistasis model is also stored in the Xbuilt
attribute.
This speeds up fitting algorithms that may need resample fitting many times.
Methods in every epistasis model¶
Every epistasis model includes the following methods:
- fit : fit the model to an attached genotype-phenotype map, or X and y data.
- predict : predict phenotypes using the X matrix or keywords (listed above). If a keyword is used, the phenotypes are in the same order as the genotypes to the corresponding keyword.
- score : the pearson coefficients between the predicted phenotypes and the given data (X/y data or attached genotype-phenotype map).
- thetas : flattened array of 1) nonlinear parameters and 2) epistatic-coefficients estimated by model.
- hypothesis : computes the phenotypes for X data given a
thetas
array.- lnlike_of_data : returns an array of log-likelihoods for each data point given a
thetas
array.- lnlikelihood : returns the total log-likelihood for X/y data given a
thetas
array.
Methods in nonlinear models¶
The extra attributes below are attached to nonlinear epistasis models.
- Additive : a first-order EpistasisLinearRegression used to approximate the additive effects of mutations
- Nonlinear : a
lmfit.MinizerResults
object returned by thelmfit.minimize
function for estimating the nonlinear scale in a genotype-phenotype map.
Building an epistasis pipeline¶
The EpistasisPipeline
object allows you to link epistasis models in series.
Define each mode and add them to a pipeline. When fit
is called,
this object runs a cascade of fit_transforms.
This is particularly useful if you need to remove nonlinearity in a genotype-phenotype map before fitting high-order epistasis (see this paper).
EpistasisPipeline
inherits Python’s list
type. This means you can
append, prepend, pop, etc. from the pipeline after initialization. Each model is
fit in the order it appears in the pipeline.
Simple Example¶
In the example below, the power transform linearizes the map, then fits specific high-order epistasis on the linear scale. The fitted model is then used to predict the phenotype of an unknown genotype.
from epistasis import EpistasisPipeline
from epistasis.models import (EpistasisPowerTransform,
EpistasisLinearRegression)
# Define genotype-phenotype map.
gpm = GenotypePhenotyeMap(
wildtype='AA'
genotypes=['AA', 'AV','VV'], # Note that we're missing the 'VA' genotype
phenotypes=[0, .5, 1]
)
# Construct pipeline.
model = EpistasisPipeline([
EpistasisPowerTransform(lmbda=1, A=0, B=0),
EpistasisLinearRegression(order=2)
])
# Fit pipeline.
model.fit()
# Predict missing phenotype of missing genotype.
model.predict(['VA'])
Simulating epistatic genotype-phenotype maps¶
Simulate rough, epistatic genotype-phenotype maps using the simulate
module.
LinearSimulation¶
The following examples show a variety ways to simulate a genotype-phenotype map with linear, high-order epistatic interactions. The simulation interface provides methods to easily dictate the construction of a simulated genotype-phenotype map.
from epistasis.simulate import LinearSimulation
# Define the wildtype sequence and possible mutations at each site.
wildtype = "0000"
mutations = {
0: ["0", "1"], # mutation alphabet for site 0
1: ["0", "1"],
2: ["0", "1"],
3: ["0", "1"]
}
# Initialize a simulation
gpm = LinearSimulation(wildtype, mutations)
# Set the order of epistasis
gpm.set_coefs_order(4)
# Generate random epistatic coefficients
coef_range = (-1, 1) # coefs between -1 and 1
gpm.set_coefs_random(coef_range)
Alternatively, you can quickly simulate a binary genotype-phenotype map if you’re fine with a simple, binary alphabet at each site.
# define the length of genotypes and the order of epistasis
length = 4
gpm = LinearSimulation.from_length(length)
# Generate random epistatic coefs
gpm.set_coefs_order(4)
gpm.set_coefs_random(coef_range)
For all simulated genotype-phenotype maps, one can initialize a genotype-phenotype
map from an existing dataset. Scroll through class methods that start with from_
to
see all options for initializing simulated genotype-phenotype maps.
NonlinearSimulation¶
Simulate a nonlinear, epistatic genotype-phenotype map using NonlinearSimulation
.
Simply define a function which transforms a linear genotype-phenotype map onto
a nonlinear scale. Note, the function must have x
as the first argument. This
argument represents the linearized phenotypes to be transformed.
from epistasis.simulate import NonlinearSimulation
def saturating_scale(x, K):
return ((K+1)*x)/(K+x)
# Define the initial value for the paramter, K
p0 = [2]
gpm = NonlinearSimulation.from_length(4, function=saturating_scale, p0=p0)
gpm.set_coefs_order(4)
gpm.set_coefs_random((0,1))
Multiplicative Example
Multiplicative epistasis is a common nonlinear, phenotypic scale. Simulate this
type of map using the NonlinearSimulation
class.
Using the epistasis
package, this looks like the following example. First, define
the exponential function as the nonlinear scale passed into the Simulation class.
import numpy as np
from epistasis.simulation import NonlinearSimulation
def multiplicative(x):
return np.exp(x)
gpm = NonlinearSimulation.from_length(4, function=multiplicative)
Then, define the epistatic coefficients, take their log, and pass them into the simulation object.
# Set the order of epistasis
gpm.set_coefs_order(4)
# generate random coefs
coefs = np.random.uniform(0,3, size=len(gpm.epistasis.labels))
# Take the log of the coefs
log_coefs = np.log(coefs)
# Pass coefs into the simulation class.
gpm.set_coefs_values(log_coefs)
Plotting in the epistasis package¶
The epistasis
package comes with a few functions to plot epistasis data.
Plot epistatic coefficients¶
The plotting module comes with a default function for plotting epistatic coefficients. It plots the value of the coefficient as bar graphs, the label as a box plot (see example below), and signficicance as stars using a t-test.
import matplotlib.pyplot as plt
from gpmap.simulate import MountFujiSimulation
from epistasis.models.linear import EpistasisLinearRegression
from epistasis.pyplot.coefs import plot_coefs
gpm = MountFujiSimulation.from_length(4, field_strength=-1, roughness=(-2,2))
model = EpistasisLinearRegression(order=4)
model.add_gpm(gpm)
model.fit()
# Plot coefs
fig, axes = plot_coefs(model, figsize=(4,5))
plt.show()

Plot nonlinear scale¶
Plot a nonlinear scale using the epistasis.pyplot.nonlinear
module.
# Import matplotlib
import matplotlib.pyplot as plt
# Import epistasis package
from gpmap.simulate import MountFujiSimulation
from epistasis.models.nonlinear import EpistasisPowerTransform
from epistasis.pyplot.nonlinear import plot_power_transform
# Simulate a Mt. Fuji fitness landscape
gpm = MountFujiSimulation.from_length(4, field_strength=-1, roughness=(-2,2))
# Fit Power transform
model = EpistasisPowerTransform(lmbda=1, A=0, B=0)
model.add_gpm(gpm)
model.fit()
# Create plot
fig, ax = plt.subplots(figsize=(3,3))
plot_power_transform(model, cmap='plasma', ax=ax, yerr=0.6)
ax.set_xlabel('Padd')
ax.set_ylabel('Pobs')
plt.show()

Advanced topics¶
Setting bounds on nonlinear fits¶
All nonlinear epistasis models use lmfit to estimate a nonlinear scale in an
arbitrary genotype-phenotype map. Each model creates lmfit.Parameter
objects
for each parameter in the input function and contains them in the parameters
attribute as lmfit.Parameters
object. Thus, you can set the bounds, initial
guesses, etc. on the parameters following lmfit’s API. The model, then, minimizes
the squared residuals using the lmfit.minimize
function. The results are
stored in the Nonlinear
object.
In the example below, we use a EpistasisPowerTransform
to demonstrate how to
access the lmfit API.
# Import a nonlinear model (this case, Power transform)
from epistasis.models import NonlinearPowerTransform
model = NonlinearPowerTransform(order=1)
model.parameters['lmbda'].set(value=1, min=0, max=10)
model.parameters['A'].set(value=10, min=0, max=100)
model.fit()
Access information about the minimizer results using the Nonlinear
attribute.
# Pretty print the results!
model.Nonlinear.params.pretty_print()
Large genotype-phenotype maps¶
We have not tested the epistasis
package on large genotype-phenotype maps (>5000 genotypes). In principle,
it should be no problem as long as you have the resources (i.e. tons of RAM and time). However, it’s possible there may be issues with convergence
and numerical rounding for these large spaces.
Reach out!¶
If you have a large dataset, please get in touch! We’d love to hear from you. Feel free to try cranking our models on your large dataset and let us know if you run into issues.
My nonlinear fit is slow and does not converge.¶
Try fitting the scale of your map using a fraction of your data. We’ve found that you can typically estimate the nonlinear scale of a genotype-phenotype map from a small fraction of the genotypes. Choose a random subset of your data and fit it using a first order nonlinear model. Then use that model to linearize all your phenotype
from gpmap import GenotypePhenotypeMap
from epistasis.models import (EpistasisPowerTransform,
EpistasisLinearRegression)
# Load data.
gpm = GenotypePhenotypeMap.read_csv('data.csv')
# Subset the data
data_subset = gpm.data.sample(frac=0.5)
gpm_subset = GenotypePhenotypeMap.read_dataframe(data_subset)
# Fit the subset
nonlinear = EpistasisPowerTransform(order=1, lmbda=1, A=0, B=0)
nonlinear.add_gpm(gpm_subset).fit()
# Linearize the original phenotypes to estimate epistasis.
#
# Note: power transform calculate the geometric mean of the additive
# phenotypes, so we need to pass those phenotypes to the reverse transform.
padd = nonlinear.Additive.predict(X='fit')
linear_phenotypes = nonlinear.reverse(gpm.phenotypes,
*nonlinear.parameters.values(),
data=padd)
# Change phenotypes (note this changes the original dataframe)
gpm.data.phenotypes = linear_phenotypes
model = EpistasisLinearRegression(order=10)
model.add_gpm(gpm)
# Fit the model
model.fit()
Estimating model uncertainty¶
The epistasis package includes a sampling
module for estimating uncertainty in
all coefficients in (Non)linear epistasis models. It follows a Bayesian approach,
and uses the emcee python package to approximate the posterior distributions
for each coefficient.
Basic example¶
Use the BayesianSampler
object to sample your epistasis model. The sampler
stores an MCMC chain
(The plot below was created using the corner
package.)
# Imports
import matplotlib.pyplot as plt
import numpy as np
import corner
from epistasis.simulate import LinearSimulation
from epistasis.models import EpistasisLinearRegression
from epistasis.sampling.bayesian import BayesianSampler
# Create a simulated genotype-phenotype map with epistasis.
sim = LinearSimulation.from_length(3, model_type="local")
sim.set_coefs_order(3)
sim.set_coefs_random((-1,1))
sim.set_stdeviations([0.01])
# Initialize an epistasis model and fit a ML model.
model = EpistasisLinearRegression(order=3, model_type="local")
model.add_gpm(sim)
model.fit()
# Initialize a sampler.
sampler = BayesianSampler(model)
samples, rstate = sampler.sample(500)
# Plot the Posterior
fig = corner.corner(samples, truths=sim.epistasis.values)

Defining a prior¶
The default prior for a BayesianSampler is a flat prior (BayesianSampler.lnprior()
returns a log-prior equal to 0). To set your own prior, define your own function
that called lnprior
that returns a log prior for a set of coefs and reset
the BayesianSampler static method:
def lnprior(coefs):
# Set bound on the first coefficient.
if coefs[0] < 0:
return -np.inf
return 0
# Apply to fitter from above
fitter.lnprior = lnprior
Saving an epistasis model¶
All epistasis models/simulators store epistatic coefficients in Pandas Series/DataFrames, so the coefficients can be written to various file formats. This page lists a few.
Pickle¶
The recommended way to save an epistasis model to be used again is by pickling the model(See Python’s pickle
library).
# Import pickle module
import pickle
from epistasis.models import EpistasisLinearRegression
# Simple model object
model = EpistasisLinearRegression(model=1)
# Save to disk to open later.
with open('model.pickle', 'wb') as f:
pickle.dump(f, model)
To load the saved model,
# Import pickle module
import pickle
# Read from file.
with open('model.pickle', 'rb') as f:
model = pickle.load(f)
Warning
Pickled models will only work with the same version of the epistasis
package that created it. If you save a model and upgrade the library, you likely
won’t be able to use the model anymore.
Excel Spreadsheet¶
Epistatic coefficients can be written to an excel file using the to_excel
method
model.to_excel('data.xlsx')
sites | values | |
---|---|---|
0 | [0] | 0.501191 |
1 | [1] | -0.600019 |
2 | [2] | 0.064983 |
3 | [3] | 0.609166 |
4 | [4] | 0.242095 |
5 | [1, 2] | 0.286914 |
6 | [1, 3] | -0.264455 |
7 | [1, 4] | -0.464212 |
8 | [2, 3] | 0.638260 |
9 | [2, 4] | 0.235989 |
10 | [3, 4] | 0.717954 |
11 | [1, 2, 3] | -0.473122 |
12 | [1, 2, 4] | -0.041919 |
13 | [1, 3, 4] | -0.309124 |
14 | [2, 3, 4] | 0.606674 |
15 | [1, 2, 3, 4] | -0.268982 |
CSV File¶
Epistatic coefficients can be written to a csv file using the to_csv
method
model.epistasis.to_csv('epistasis.csv')
,sites,values
0,[0],0.5011910655025966
1,[1],-0.6000186681513706
2,[2],0.06498276930060931
3,[3],0.6091656756721153
4,[4],0.24209508436556937
5,"[1, 2]",0.2869142038187855
6,"[1, 3]",-0.26445514455225094
7,"[1, 4]",-0.4642116520437949
8,"[2, 3]",0.638260262428922
9,"[2, 4]",0.23598864236123118
10,"[3, 4]",0.7179538630349485
11,"[1, 2, 3]",-0.47312160287366267
12,"[1, 2, 4]",-0.04191888437610514
13,"[1, 3, 4]",-0.30912353449573415
14,"[2, 3, 4]",0.6066739725656609
15,"[1, 2, 3, 4]",-0.2689818206753505
API documentation¶
This page lists all modules in the epitasis library. If you find missing documentation, feel free to open an issue or (better yet) submit a pull request!
epistasis.models package¶
See subpackages for specific models.
epistasis.models.linear package¶
epistasis.models.linear.lasso module¶
-
class
epistasis.models.linear.lasso.
EpistasisLasso
(order=1, model_type='global', alpha=1.0, precompute=False, max_iter=1000, tol=0.0001, warm_start=False, positive=False, random_state=None, selection='cyclic', **kwargs)¶ Bases:
sklearn.linear_model.coordinate_descent.Lasso
,epistasis.models.base.AbstractModel
A scikit-learn Lasso Regression class for discovering sparse epistatic coefficients.
- Methods are described in the following publication:
- Poelwijk FJ, Socolich M, and Ranganathan R. ‘Learning the pattern of epistasis linking enotype and phenotype in a protein’. bioRxiv. (2017).
Parameters: - order (int) – order of epistasis
- model_type (str (default="global")) – model matrix type. See publication above for more information
- alpha (float) – Constant that multiplies the L1 term. Defaults to 1.0. alpha = 0 is equivalent to an ordinary least square, solved by the EpistasisLinearRegression object.
- precompute – Whether to use a precomputed Gram matrix to speed up calculations. If set to ‘auto’ let us decide. The Gram matrix can also be passed as argument. For sparse input this option is always True to preserve sparsity.
- max_iter (int) – The maximum number of iterations.
- tol (float) – The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
- warm_start (bool) – When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.
- positive (bool) – When set to True, forces the coefficients to be positive.
- random_state (int) – The seed of the pseudo random number generator that selects a random feature to update. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when selection == ‘random’.
- selection (str) – If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.
-
compression_ratio
()¶ Compute the compression ratio for the Lasso regression
-
fit
(X=None, y=None, **kwargs)¶
-
fit_transform
(X=None, y=None, **kwargs)¶
-
hypothesis
(X=None, thetas=None)¶
-
hypothesis_transform
(X=None, y=None, thetas=None)¶
-
lnlike_of_data
(X=None, y=None, yerr=None, thetas=None)¶
-
lnlike_transform
(X=None, y=None, yerr=None, lnprior=None, thetas=None)¶
-
num_of_params
¶
-
predict
(X=None)¶
-
predict_transform
(X=None, y=None)¶
-
score
(X=None, y=None)¶
-
thetas
¶
epistasis.models.linear.ordinary module¶
-
class
epistasis.models.linear.ordinary.
EpistasisLinearRegression
(order=1, model_type='global', n_jobs=1, **kwargs)¶ Bases:
sklearn.linear_model.base.LinearRegression
,epistasis.models.base.AbstractModel
Ordinary least-squares regression for estimating high-order, epistatic interactions in a genotype-phenotype map.
- Methods are described in the following publication:
- Sailer, Z. R. & Harms, M. J. ‘Detecting High-Order Epistasis in Nonlinear Genotype-Phenotype Maps’. Genetics 205, 1079-1088 (2017).
Parameters: - order (int) – order of epistasis
- model_type (str (default="global")) – model matrix type. See publication above for more information
-
fit
(X=None, y=None, **kwargs)¶
-
fit_transform
(X=None, y=None, **kwargs)¶
-
hypothesis
(X=None, thetas=None)¶
-
hypothesis_transform
(X=None, y=None, thetas=None)¶
-
lnlike_of_data
(X=None, y=None, yerr=None, thetas=None)¶
-
lnlike_transform
(X=None, y=None, yerr=None, lnprior=None, thetas=None)¶
-
num_of_params
¶
-
predict
(X=None)¶
-
predict_transform
(X=None, y=None)¶
-
score
(X=None, y=None)¶
-
thetas
¶
epistasis.models.linear.ridge module¶
-
class
epistasis.models.linear.ridge.
EpistasisRidge
(order=1, model_type='global', alpha=1.0, max_iter=1000, tol=0.0001, random_state=None, solver='auto', **kwargs)¶ Bases:
sklearn.linear_model.ridge.Ridge
,epistasis.models.base.AbstractModel
A scikit-learn Ridge Regression class for discovering sparse epistatic coefficients.
Parameters: - order (int) – order of epistasis
- model_type (str (default="global")) – model matrix type. See publication above for more information
- alpha (float) – Constant that multiplies the L1 term. Defaults to 1.0. alpha = 0 is equivalent to an ordinary least square, solved by the EpistasisLinearRegression object.
- max_iter (int) – The maximum number of iterations.
- tol (float) – The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
- random_state (int) – The seed of the pseudo random number generator that selects a random feature to update. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when selection == ‘random’.
- solver (str) – See scikit learn docs for Ridge.
-
compression_ratio
()¶ Compute the compression ratio for the Lasso regression
-
fit
(X=None, y=None, **kwargs)¶
-
fit_transform
(X=None, y=None, **kwargs)¶
-
hypothesis
(X=None, thetas=None)¶
-
hypothesis_transform
(X=None, y=None, thetas=None)¶
-
lnlike_of_data
(X=None, y=None, yerr=None, thetas=None)¶
-
lnlike_transform
(X=None, y=None, yerr=None, lnprior=None, thetas=None)¶
-
num_of_params
¶
-
predict
(X=None)¶
-
predict_transform
(X=None, y=None)¶
-
score
(X=None, y=None)¶
-
thetas
¶
epistasis.models.base module¶
-
class
epistasis.models.base.
AbstractModel
¶ Bases:
abc.ABC
Abstract Base Class for all epistasis models.
This class sets all docstrings not given in subclasses.
-
static
__new__
(*args, **kwargs)¶ Replace the docstrings of a subclass with docstrings in this base class.
-
add_X
(X=None, key=None)¶ Add X to Xbuilt
Keyword arguments for X:
- None :
- Uses
gpm.binary
to construct X. If genotypes are missing they will not be included in fit. At the end of fitting, an epistasis map attribute is attached to the model class.
Parameters: - X – see above for details.
- key (str) – name for storing the matrix.
Returns: Xbuilt – newly built 2d array matrix
Return type: numpy.ndarray
-
add_gpm
(gpm)¶ Add a GenotypePhenotypeMap object to the epistasis model.
-
fit
(X=None, y=None, **kwargs)¶ Fit model to data.
Parameters: - X (None, ndarray, or list of genotypes. (default=None)) – data used to construct X matrix that maps genotypes to model coefficients. If None, the model uses genotypes in the attached genotype-phenotype map. If a list of strings, the strings are genotypes that will be converted to an X matrix. If ndarray, the function assumes X is the X matrix used by the epistasis model.
- y (None or ndarray (default=None)) – array of phenotypes. If None, the phenotypes in the attached genotype-phenotype map is used.
Returns: The model is returned. Allows chaining methods.
Return type: self
-
fit_transform
(X=None, y=None, **kwargs)¶ Fit model to data and transform output according to model.
Parameters: - X (None, ndarray, or list of genotypes. (default=None)) – data used to construct X matrix that maps genotypes to model coefficients. If None, the model uses genotypes in the attached genotype-phenotype map. If a list of strings, the strings are genotypes that will be converted to an X matrix. If ndarray, the function assumes X is the X matrix used by the epistasis model.
- y (None or ndarray (default=None)) – array of phenotypes. If None, the phenotypes in the attached genotype-phenotype map is used.
Returns: gpm – The genotype-phenotype map object with transformed genotypes.
Return type: GenotypePhenotypeMap
-
gpm
¶ Data stored in a GenotypePhenotypeMap object.
-
hypothesis
(X=None, thetas=None)¶ Compute phenotypes from given model parameters.
Parameters: - X (None, ndarray, or list of genotypes. (default=None)) – data used to construct X matrix that maps genotypes to model coefficients. If None, the model uses genotypes in the attached genotype-phenotype map. If a list of strings, the strings are genotypes that will be converted to an X matrix. If ndarray, the function assumes X is the X matrix used by the epistasis model.
- thetas (ndarray) – array of model parameters. See thetas property for specifics.
Returns: y – array of phenotypes predicted by model parameters.
Return type: ndarray
-
hypothesis_transform
(X=None, y=None, thetas=None)¶ Transform phenotypes with given model parameters.
Parameters: - X (None, ndarray, or list of genotypes. (default=None)) – data used to construct X matrix that maps genotypes to model coefficients. If None, the model uses genotypes in the attached genotype-phenotype map. If a list of strings, the strings are genotypes that will be converted to an X matrix. If ndarray, the function assumes X is the X matrix used by the epistasis model.
- y (ndarray) – An array of phenotypes to transform.
- thetas (ndarray) – array of model parameters. See thetas property for specifics.
Returns: y – array of phenotypes predicted by model parameters.
Return type: ndarray
-
lnlike_of_data
(X=None, y=None, yerr=None, thetas=None)¶ Compute the individUal log-likelihoods for each datapoint from a set of model parameters.
Parameters: - X (None, ndarray, or list of genotypes. (default=None)) – data used to construct X matrix that maps genotypes to model coefficients. If None, the model uses genotypes in the attached genotype-phenotype map. If a list of strings, the strings are genotypes that will be converted to an X matrix. If ndarray, the function assumes X is the X matrix used by the epistasis model.
- y (ndarray) – An array of phenotypes to transform.
- yerr (ndarray) – An array of the measured phenotypes standard deviations.
- thetas (ndarray) – array of model parameters. See thetas property for specifics.
Returns: y – array of phenotypes predicted by model parameters.
Return type: ndarray
-
lnlike_transform
(X=None, y=None, yerr=None, lnprior=None, thetas=None)¶ Compute the individual log-likelihoods for each datapoint from a set of model parameters and a prior.
Parameters: - X (None, ndarray, or list of genotypes. (default=None)) – data used to construct X matrix that maps genotypes to model coefficients. If None, the model uses genotypes in the attached genotype-phenotype map. If a list of strings, the strings are genotypes that will be converted to an X matrix. If ndarray, the function assumes X is the X matrix used by the epistasis model.
- y (ndarray) – An array of phenotypes to transform.
- yerr (ndarray) – An array of the measured phenotypes standard deviations.
- lnprior (ndarray) – An array of priors for a given datapoint.
- thetas (ndarray) – array of model parameters. See thetas property for specifics.
Returns: y – array of phenotypes predicted by model parameters.
Return type: ndarray
-
lnlikelihood
(X=None, y=None, yerr=None, thetas=None)¶ Compute the individal log-likelihoods for each datapoint from a set of model parameters.
Parameters: - X (None, ndarray, or list of genotypes. (default=None)) – data used to construct X matrix that maps genotypes to model coefficients. If None, the model uses genotypes in the attached genotype-phenotype map. If a list of strings, the strings are genotypes that will be converted to an X matrix. If ndarray, the function assumes X is the X matrix used by the epistasis model.
- y (ndarray) – An array of phenotypes to transform.
- yerr (ndarray) – An array of the measured phenotypes standard deviations.
- thetas (ndarray) – array of model parameters. See thetas property for specifics.
Returns: lnlike – log-likelihood of the model parameters.
Return type: float
-
num_of_params
¶ Number of parameters in model.
-
predict
(X=None)¶ Use model to predict phenotypes for a given list of genotypes.
Parameters: X (None, ndarray, or list of genotypes. (default=None)) – data used to construct X matrix that maps genotypes to model coefficients. If None, the model uses genotypes in the attached genotype-phenotype map. If a list of strings, the strings are genotypes that will be converted to an X matrix. If ndarray, the function assumes X is the X matrix used by the epistasis model. Returns: y – array of phenotypes. Return type: ndarray
-
predict_transform
(X=None, y=None, **kwargs)¶ Transform a set of phenotypes according to the model.
Parameters: - X (None, ndarray, or list of genotypes. (default=None)) – data used to construct X matrix that maps genotypes to model coefficients. If None, the model uses genotypes in the attached genotype-phenotype map. If a list of strings, the strings are genotypes that will be converted to an X matrix. If ndarray, the function assumes X is the X matrix used by the epistasis model.
- y (ndarray) – An array of phenotypes to transform.
Returns: y_transform – array of phenotypes.
Return type: ndarray
-
static
-
class
epistasis.models.base.
BaseModel
¶ Bases:
epistasis.models.base.AbstractModel
,sklearn.base.RegressorMixin
,sklearn.base.BaseEstimator
Base model for defining an epistasis model.
-
exception
epistasis.models.base.
SubclassException
¶ Bases:
Exception
Subclass Exception for parent classes.
-
epistasis.models.base.
use_sklearn
(sklearn_class)¶ Swap out base classes in an Epistasis model class with a sklearn_class + AbstractModel.
epistasis.models.classifiers module¶
epistasis.models.ensemble module¶
epistasis.models.pipeline module¶
epistasis.models.utils module¶
-
exception
epistasis.models.utils.
FittingError
¶ Bases:
Exception
Exception Subclass for X matrix errors.
-
exception
epistasis.models.utils.
XMatrixException
¶ Bases:
Exception
Exception Subclass for X matrix errors.
-
epistasis.models.utils.
arghandler
(method)¶ Points methods to argument handlers. Assumes each argument has a corresponding method attached to the object named “_{argument}”. These methods given default values to arguments.
Ignores self and kwargs
Module contents¶
epistasis.pyplot package¶
epistasis.pyplot.coefs module¶
-
epistasis.pyplot.coefs.
plot_coefs
(model=None, sites=None, values=None, errors=None, **kwargs)¶ Create a barplot with the values from model, drawing the x-axis as a grid of boxes indicating the coordinate of the epistatic parameter. Should automatically generate an almost publication-quality figure.
Parameters: - model (BaseModel object) – epistasis model.
- sites (array) – array of epistatic indices/sites.
- values (array) – an array of epistatic coefficients
- errors (2d array or list) – upper and lower bounds for each beta.
Keyword Arguments: - logbase (numpy.ufunc (default=np.log10)) – function to transform into log space
- log_transform (bool (default=False)) – transform the values if true.
- order_colors – list/tuple of colors for each order (rgb,html string-like)
- significance – how to treat signifiance. should be 1. “bon” -> Bonferroni corrected p-values (default) 2. “p” -> raw p-values 3. None -> ignore significance
- significance_cutoff – value above which to consider a term significant
- sigmas – number of sigmas to show for each error bar
- y_scalar – how much to scale the y-axis above and beyond y-max
- y_axis_name – what to put on the y-axis of the barplot
- figsize – tuple of figure width,height
- height_ratio – how much to scale barplot relative to xbox
- star_cutoffs –
- signifiance cutoffs for star stack. should go from highest
- p to lowest p (least to most significant)
- star_spacer – constant that scales how closely stacked stars are from one another
- ybounds (tuple (default=None)) –
- bar_borders (bool (default=True)) –
- xgrid (bool (default=True)) –
- ecolor (color (default='black')) –
- elinewidth (float (default=1)) –
- capthick (float (default=1)) –
- capsize (float (default=1)) –
- gridlines (float (default=1)) – x grid linewidth
Returns: - fig (matplotlib.pyplot.Figure) – Figure object
- ax (matplotlib.pyplot.Axes) – Axes object
epistasis.pyplot.nonlinear module¶
-
epistasis.pyplot.nonlinear.
plot_power_transform
(model=None, yadd=None, yobs=None, yerr=None, function=None, cmap=None, color=None, s=50, ax=None, function_line=True, line_color='r', **kwargs)¶ Plot a Y_obs vs. Y_add showing the nonlinear scale in a genotype-phenotype map.
Parameters: - model ((default=None)) – Epistasis model.
- yadd (array (default=None)) – x-axis data. The additive model phenotypes.
- yobs (array (default=None)) – y-axis data. The observed phenotypes.
- yerr (array (default=None)) – y-axis error. Error in observed phenotypes.
- function (callable) – Nonlinear function.
- cmap (str) – Colormap name to map onto scatter points.
- color (str, array,) – color of phenotypes.
- s (int) – size of scatter points.
- ax (Axes) – Axes object to plot points on.
- function_line (bool) – If true, plots nonlinear function on top of points.
- line_color (matplotlib color.) – color of function line.
Returns: ax – Axes object with plot.
Return type: matplotlib.Axes
-
epistasis.pyplot.nonlinear.
plot_scale
(model=None, yadd=None, yobs=None, yerr=None, function=None, cmap=None, color=None, s=50, ax=None, function_line=True, line_color='r', **kwargs)¶ Plot a Y_obs vs. Y_add showing the nonlinear scale in a genotype-phenotype map.
Parameters: - model ((default=None)) – Epistasis model.
- yadd (array (default=None)) – x-axis data. The additive model phenotypes.
- yobs (array (default=None)) – y-axis data. The observed phenotypes.
- yerr (array (default=None)) – y-axis error. Error in observed phenotypes.
- function (callable) – Nonlinear function.
- cmap (str) – Colormap name to map onto scatter points.
- color (str, array,) – color of phenotypes.
- s (int) – size of scatter points.
- ax (Axes) – Axes object to plot points on.
- function_line (bool) – If true, plots nonlinear function on top of points.
- line_color (matplotlib color.) – color of function line.
Returns: ax – Axes object with plot.
Return type: matplotlib.Axes
epistasis.sampling package¶
epistasis.sampling.bayesian module¶
-
class
epistasis.sampling.bayesian.
BayesianSampler
(model, lnprior=None)¶ Bases:
object
A sampling class to estimate the uncertainties in an epistasis model’s coefficients using a Bayesian approach. This object samples from the experimental uncertainty in the phenotypes to estimate confidence intervals for the coefficients in an epistasis model according to Bayes Theorem:
\[P(H|E) = \frac{ P(E|H) \cdot P(H) }{ P(E) }\]This reads: “the probability of epistasis model \(H\) given the data \(E\) is equal to the probability of the data given the model times the probability of the model.”
Parameters: model – Epistasis model to run a bootstrap calculation. -
get_initial_walkers
(relative_widths=0.01)¶ Place the walkers in Gaussian balls in parameter space around the ML values for each coefficient.
-
static
lnprior
(thetas)¶ Prior probability for the given set of model parameters.
-
static
lnprob
(thetas, lnlike)¶ The posterior probability of a given set of model parameters and likelihood function.
-
sample
(n_steps=100, n_burn=0, previous_state=None)¶ Sample the likelihood of the model by walking n_steps with each walker.
-
epistasis.simulate package¶
epistasis.simulate.base module¶
epistasis.simulate.linear module¶
-
class
epistasis.simulate.linear.
LinearSimulation
(wildtype, mutations, model_type='global', **kwargs)¶ Bases:
epistasis.simulate.base.BaseSimulation
Construct an genotype-phenotype from linear building blocks and epistatic coefficients.
Example
Phenotype = b0 + b1 + b2 + b3 + b12 + b13 + b13 + b123
Parameters: - wildtype (strß) – Wildtype genotype
- mutations (dict) – Mapping for each site to its alphabet
- order (int) – Order of epistasis in simulated genotype-phenotype map
- betas (array-like) – values of epistatic coefficients (must be positive for this function to work. Log is taken)
- model_type (str) – Use a local or global (i.e. Walsh space) epistasis model to construct phenotypes
-
build
()¶ Build the phenotype map from epistatic interactions.
epistasis.simulate.power module¶
epistasis.mapping module¶
-
class
epistasis.mapping.
EpistasisMap
(sites, order=1, values=None, model_type='global')¶ Bases:
object
DataFrame for epistatic interactions.
-
get_orders
(*orders)¶ Get epistasis of a given order.
-
index
¶ Get the interaction index in interaction matrix.
-
map
(attr1, attr2)¶ Dictionary that maps attr1 to attr2.
-
model_type
¶ Type of epistasis model used to get coefficients.
-
n
¶ Return the number of Interactions.
-
order
¶ Get order of epistasis in system.
-
classmethod
read_dataframe
(df, model_type='global')¶ Create an epistasis model from dataframe
-
sites
¶ Get the interaction sites, which describe the position of interacting mutations in the genotypes. (type==list of lists, see self._build_interaction_sites)
-
to_csv
(filename)¶ Write data to a csv file.
-
to_dict
()¶ Get data as dictionary.
-
to_excel
(filename)¶ Write data to excel file.
-
values
¶ Get the values of the interaction in the system
-
-
class
epistasis.mapping.
Orders
(epistasismap, orders)¶ Bases:
map
An object that provides API for easily calling epistasis of a given order in an epistasis map.
-
__call__
()¶ return a dictionary
-
df
¶ Dataframe for orders object.
-
index
¶ Get indices of epistasis from this order.
-
sites
¶ Get epistatic sites
-
values
¶ Get values of epistasis for this order.
-
-
epistasis.mapping.
assert_epistasis
(method)¶ Assert that an epistasis map has been attached to the object.
-
epistasis.mapping.
genotype_coeffs
(genotype, order=None)¶ List the possible epistatic coefficients (as label form) for a binary genotype up to a given order.
-
epistasis.mapping.
key_to_site
(key)¶ Convert an interaction key to label.
-
epistasis.mapping.
mutations_to_sites
(order, mutations, start_order=0)¶ Build interaction sites up to nth order given a mutation alphabet.
Parameters: - order (int) – order of interactions
- mutations (dict) – mutations = { site_number : [“mutation-1”, “mutation-2”] }. If the site alphabet is note included, the model will assume binary between wildtype and derived.
Example
mutations = { 0: ["A", "V"], 1: ["A", "V"], ... }
Returns: sites – list of all interaction sites for system with sequences of a given length and epistasis with given order. Return type: list
-
epistasis.mapping.
site_to_key
(site, state='')¶ Convert site to key. state is added to end of key.
epistasis.matrix module¶
-
epistasis.matrix.
encode_vectors
(binary_genotypes, model_type='global')¶ Encode a set of binary genotypes is input vectors for the given model
-
epistasis.matrix.
get_model_matrix
(binary_genotypes, sites, model_type='global')¶ Get a model matrix for a given set of genotypes and coefficients.
epistasis.stats module¶
Submodule with useful statistics functions for epistasis model.
-
epistasis.stats.
aic
(model)¶ Given a model, calculates an AIC score.
-
epistasis.stats.
chi_squared
(y_obs, y_pred)¶ Calculate the chi squared between observed and predicted y.
-
epistasis.stats.
explained_variance
(y_obs, y_pred)¶ Returns the explained variance
-
epistasis.stats.
false_negative_rate
(y_obs, y_pred, upper_ci, lower_ci, sigmas=2)¶ Calculate the false negative rate of predicted values. Finds all values that equal zero in the known array and calculates the number of false negatives found in the predicted given the number of samples and sigmas.
- The defined bounds are:
- (number of sigmas) * errors / sqrt(number of samples)
Parameters: - known (array-like) – Known values for comparing false negatives
- predicted (array-like) – Predicted values
- errors (array-like) – Standard error from model
- n_samples (int) – number of replicate samples
- sigma (int (default=2)) – How many standard errors away (2 == 0.05 false negative rate)
Returns: rate – False negative rate in data
Return type: float
-
epistasis.stats.
false_positive_rate
(y_obs, y_pred, upper_ci, lower_ci, sigmas=2)¶ Calculate the false positive rate of predicted values. Finds all values that equal zero in the known array and calculates the number of false positives found in the predicted given the number of samples and sigmas.
- The defined bounds are:
- (number of sigmas) * errors / sqrt(number of samples)
Parameters: - known (array-like) – Known values for comparing false positives
- predicted (array-like) – Predicted values
- errors (array-like) – Standard error from model
- n_samples (int) – number of replicate samples
- sigma (int (default=2)) – How many standard errors away (2 == 0.05 false positive rate)
Returns: rate – False positive rate in data
Return type: float
-
epistasis.stats.
generalized_r2
(y_obs, y_pred)¶ Calculate the rquared between the observed and predicted y. See wikipedia definition of coefficient of determination.
-
epistasis.stats.
gmean
(x)¶ Calculate a geometric mean with zero and negative values.
Following the gmean calculation from this paper:
Habib, Elsayed AE. “Geometric mean for negative and zero values.” International Journal of Research and Reviews in Applied Sciences 11 (2012): 419-432.
-
epistasis.stats.
incremental_mean
(old_mean, samples, M, N)¶ Calculate an incremental running mean.
Parameters: - old_mean (float or array) – current running mean(s) before adding samples
- samples (ndarray) – array containing the samples. Each column is a sample. Rows are independent values. Mean is taken across row.
- M (int) – number of samples in new chunk
- N (int) – number of previous samples in old mean
-
epistasis.stats.
incremental_std
(old_mean, old_std, new_mean, samples, M, N)¶ Calculate an incremental standard deviation.
Parameters: - old_mean (float or array) – current running mean(s) before adding samples
- samples (ndarray) – array containing the samples. Each column is a sample. Rows are independent values. Mean is taken across row.
- M (int) – number of samples in new chunk
- N (int) – number of previous samples in old mean
-
epistasis.stats.
incremental_var
(old_mean, old_var, new_mean, samples, M, N)¶ Calculate an incremental variance.
Parameters: - old_mean (float or array) – current running mean(s) before adding samples
- old_var (float or array) – current running variance(s) before adding samples
- new_mean (float) – updated mean
- samples (ndarray) – array containing the samples. Each column is a sample. Rows are independent values. Mean is taken across row.
- M (int) – number of samples in new chunk
- N (int) – number of previous samples in old mean
-
epistasis.stats.
pearson
(y_obs, y_pred)¶ Calculate pearson coefficient between two variables.
-
epistasis.stats.
rmsd
(yobs, ypred)¶ Calculate the root mean squared deviation of an estimator.
-
epistasis.stats.
split_data
(data, fraction=1.0)¶ Split DataFrame into two sets, a training and a test set.
Parameters: - data (pandas.DataFrame) – full dataset to split.
- fraction (float) – fraction in training set.
Returns: - train_set (pandas.DataFrame) – training set.
- test_set (pandas.DataFrame) – test set.
-
epistasis.stats.
ss_residuals
(y_obs, y_pred)¶ calculate residuals
Example Gallery¶
Fitting nonlinear genotype-phenotype maps¶
Use a linear, logistic regression model to estimate the positive/negative effects of mutations.

# Imports
import matplotlib.pyplot as plt
from gpmap.simulate import MountFujiSimulation
from epistasis.models import EpistasisPowerTransform
from epistasis.pyplot import plot_power_transform
# The data
gpm = MountFujiSimulation.from_length(4, field_strength=-1, roughness=(-1,1))
# Initialize a model
model = EpistasisPowerTransform(lmbda=1, A=0, B=0)
model.add_gpm(gpm)
# Fit the model
model.fit()
fig, ax = plt.subplots(figsize=(2.5,2.5))
ax = plot_power_transform(model, ax=ax)
plt.show()
Total running time of the script: ( 0 minutes 1.979 seconds)
High-order epistasis model¶
Estimate high-order epistatic coefficients in arbitrary genotype-phenotype maps. A linear epistasis model fits high-order interaction terms to capture variation in phenotype.

# Imports
import matplotlib.pyplot as plt
from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisLinearRegression
from epistasis.pyplot import plot_coefs
# The data
wildtype = "000"
genotypes = ['000', '001', '010', '011', '100', '101', '110', '111']
phenotypes = [ 0.366, -0.593, 1.595, -0.753, 0.38 , 1.296, 1.025, -0.519]
gpm = GenotypePhenotypeMap(wildtype, genotypes, phenotypes)
# Initialize a model
model = EpistasisLinearRegression(order=3)
model.add_gpm(gpm)
# Fit the model
model.fit()
fig, ax = plot_coefs(model, figsize=(2,3))
plt.show()
Total running time of the script: ( 0 minutes 0.719 seconds)
Handling dead phenotypes¶
Use a linear, logistic regression model to estimate the positive/negative effects of mutations.

# Imports
import matplotlib.pyplot as plt
from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisLogisticRegression
from epistasis.pyplot import plot_coefs
# The data
wildtype = "000"
genotypes = ['000', '001', '010', '011', '100', '101', '110', '111']
phenotypes = [ 0.366, -0.593, 1.595, -0.753, 0.38 , 1.296, 1.025, -0.519]
gpm = GenotypePhenotypeMap(wildtype, genotypes, phenotypes)
# Threshold
threshold = 1.0
# Initialize a model
model = EpistasisLogisticRegression(threshold=threshold)
model.add_gpm(gpm)
# Fit the model
model.fit()
fig, ax = plot_coefs(model, figsize=(1,3))
plt.show()
Total running time of the script: ( 0 minutes 0.085 seconds)