Advanced topics¶
Building your own X matrix¶
The X matrix in an epistasis model maps genotypes to epistatic coefficients.
The rows represent the binary representation of genotypes, and the columns
represent the epistatic coefficients. In a “local” epistasis model, an element
in the matrix is 1
if the genotype in that element’s row has the epistatic
interaction in that element’s column. Otherwise, the element is a 0
(see the example below).
If you want to generate your own X matrix, you’ll need a list of genotypes (in their
binary representation) and a list of epistatic coefficients you’d like to fit.
Then, you can generate X using the get_model_matrix
function. Here is an example:
Input
# imports
from epistasis.matrix import get_model_matrix
# Epistasis coefficients
coefs = [
[0], # Reference state
[1], [2], [3], # Additive coefficients
[1, 2], [1, 3], [2, 3], # Pairwise coefficients
[1, 2, 3] # Third order coefficients
]
# List of binary coefficients
binary_genotypes = [
'000',
'001',
'010',
'100',
'011',
'101',
'110',
'111'
]
# Build X matrix
X = get_model_matrix(binary_genotypes, coefs, model_type='local')
Output
X = array([
[1, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 1, 0, 0, 0, 0],
[1, 0, 1, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 0],
[1, 0, 1, 1, 0, 0, 1, 0],
[1, 1, 0, 1, 0, 1, 0, 0],
[1, 1, 1, 0, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1]
])
Input
You can easily generate the coefficient list using a few helper functions:
from epistasis.mapping import mutations_to_sites
mutations = {
0: ['0', '1'],
1: ['0', '1'],
2: ['0', '1']
}
# Same as the list above!
coefs = mutations_to_sites(order=3, mutations=mutations)
Output
coefs = [
[0],
[1], [2], [3],
[1, 2], [1, 3], [2, 3],
[1, 2, 3]
]
Write your own linear model¶
A linear, high-order epistasis model is a linear transformation of phenotypes \(\vec{P}\) (length L) to (Lth order) epistatic coefficients \(\vec{\beta}\) using the \(\mathbf{X}\) matrix described in the previous section.
In Python 3, this looks like:
from epistasis.matrix import get_model_matrix
# See the section above to get `coefs`
coefs = [
[0],
[1], [2], [3],
[1, 2], [1, 3], [2, 3],
[1, 2, 3]
]
# Numerical values for each coefficient
beta = [
1.0,
0.2, 0.3, 0.1,
0.1, 0.05, 0.01,
0.05
]
# Build the X matrix
X = get_model_matrix(binary_genotypes, coefs, model_type='local')
# Do the dot product
phenotypes = X @ beta
To compute linear epistatic coefficients from phenotypes, take the inverse of the equation above:
In Python 3, this looks like:
import numpy as np
from epistasis.matrix import get_model_matrix
# See the section above to get `coefs`
coefs = [
[0],
[1], [2], [3],
[1, 2], [1, 3], [2, 3],
[1, 2, 3]
]
# Binary genotypes
binary_genotypes = ['000','001','010','100','011','101','110', '111']
# Quantitative phenotype values.
phenotypes = [1, 1.1, 1.2, 1.2, 1.3, 1.5, 1.7, 1.8]
# Build the X matrix
X = get_model_matrix(binary_genotypes, coefs, model_type='local')
# Take the inverse
X_inv = np.inv(X)
# Do the dot product
beta = X_inv @ phenotypes
Setting bounds on nonlinear fits¶
All nonlinear epistasis models use lmfit to estimate a nonlinear scale in an
arbitrary genotype-phenotype map. Each model creates lmfit.Parameter
objects
for each parameter in the input function and contains them in the parameters
attribute as lmfit.Parameters
object. Thus, you can set the bounds, initial
guesses, etc. on the parameters following lmfit’s API. The model, then, minimizes
the squared residuals using the lmfit.minimize
function. The results are
stored in the Nonlinear
object.
In the example below, we use a EpistasisPowerTransform
to demonstrate how to
access the lmfit API.
# Import a nonlinear model (this case, Power transform)
from epistasis.models import NonlinearPowerTransform
model = NonlinearPowerTransform(order=1)
model.parameters['lmbda'].set(value=1, min=0, max=10)
model.parameters['A'].set(value=10, min=0, max=100)
model.fit()
Access information about the minimizer results using the Nonlinear
attribute.
# Pretty print the results!
model.Nonlinear.params.pretty_print()
Large genotype-phenotype maps¶
We have not tested the epistasis
package on large genotype-phenotype maps (>5000 genotypes). In principle,
it should be no problem as long as you have the resources (i.e. tons of RAM and time). However, it’s possible there may be issues with convergence
and numerical rounding for these large spaces.
Reach out!¶
If you have a large dataset, please get in touch! We’d love to hear from you. Feel free to try cranking our models on your large dataset and let us know if you run into issues.
My nonlinear fit is slow and does not converge.¶
Try fitting the scale of your map using a fraction of your data. We’ve found that you can typically estimate the nonlinear scale of a genotype-phenotype map from a small fraction of the genotypes. Choose a random subset of your data and fit it using a first order nonlinear model. Then use that model to linearize all your phenotype
from gpmap import GenotypePhenotypeMap
from epistasis.models import (EpistasisPowerTransform,
EpistasisLinearRegression)
# Load data.
gpm = GenotypePhenotypeMap.read_csv('data.csv')
# Subset the data
data_subset = gpm.data.sample(frac=0.5)
gpm_subset = GenotypePhenotypeMap.read_dataframe(data_subset)
# Fit the subset
nonlinear = EpistasisPowerTransform(order=1, lmbda=1, A=0, B=0)
nonlinear.add_gpm(gpm_subset).fit()
# Linearize the original phenotypes to estimate epistasis.
#
# Note: power transform calculate the geometric mean of the additive
# phenotypes, so we need to pass those phenotypes to the reverse transform.
padd = nonlinear.Additive.predict(X='fit')
linear_phenotypes = nonlinear.reverse(gpm.phenotypes,
*nonlinear.parameters.values(),
data=padd)
# Change phenotypes (note this changes the original dataframe)
gpm.data.phenotypes = linear_phenotypes
model = EpistasisLinearRegression(order=10)
model.add_gpm(gpm)
# Fit the model
model.fit()
Estimating model uncertainty¶
The epistasis package includes a sampling
module for estimating uncertainty in
all coefficients in (Non)linear epistasis models. It follows a Bayesian approach,
and uses the emcee python package to approximate the posterior distributions
for each coefficient.
Basic example¶
Use the BayesianSampler
object to sample your epistasis model. The sampler
stores an MCMC chain
(The plot below was created using the corner
package.)
# Imports
import matplotlib.pyplot as plt
import numpy as np
import corner
from epistasis.simulate import LinearSimulation
from epistasis.models import EpistasisLinearRegression
from epistasis.sampling.bayesian import BayesianSampler
# Create a simulated genotype-phenotype map with epistasis.
sim = LinearSimulation.from_length(3, model_type="local")
sim.set_coefs_order(3)
sim.set_coefs_random((-1,1))
sim.set_stdeviations([0.01])
# Initialize an epistasis model and fit a ML model.
model = EpistasisLinearRegression(order=3, model_type="local")
model.add_gpm(sim)
model.fit()
# Initialize a sampler.
sampler = BayesianSampler(model)
samples, rstate = sampler.sample(500)
# Plot the Posterior
fig = corner.corner(samples, truths=sim.epistasis.values)

Defining a prior¶
The default prior for a BayesianSampler is a flat prior (BayesianSampler.lnprior()
returns a log-prior equal to 0). To set your own prior, define your own function
that called lnprior
that returns a log prior for a set of coefs and reset
the BayesianSampler static method:
def lnprior(coefs):
# Set bound on the first coefficient.
if coefs[0] < 0:
return -np.inf
return 0
# Apply to fitter from above
fitter.lnprior = lnprior