# Advanced topics¶

## Building your own X matrix¶

The X matrix in an epistasis model maps genotypes to epistatic coefficients.
The rows represent the binary representation of genotypes, and the columns
represent the epistatic coefficients. In a “local” epistasis model, an element
in the matrix is `1`

if the genotype in that element’s row has the epistatic
interaction in that element’s column. Otherwise, the element is a `0`

(see the example below).

If you want to generate your own X matrix, you’ll need a list of genotypes (in their
binary representation) and a list of epistatic coefficients you’d like to fit.
Then, you can generate X using the `get_model_matrix`

function. Here is an example:

**Input**

```
# imports
from epistasis.matrix import get_model_matrix
# Epistasis coefficients
coefs = [
[0], # Reference state
[1], [2], [3], # Additive coefficients
[1, 2], [1, 3], [2, 3], # Pairwise coefficients
[1, 2, 3] # Third order coefficients
]
# List of binary coefficients
binary_genotypes = [
'000',
'001',
'010',
'100',
'011',
'101',
'110',
'111'
]
# Build X matrix
X = get_model_matrix(binary_genotypes, coefs, model_type='local')
```

**Output**

```
X = array([
[1, 0, 0, 0, 0, 0, 0, 0],
[1, 0, 0, 1, 0, 0, 0, 0],
[1, 0, 1, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 0],
[1, 0, 1, 1, 0, 0, 1, 0],
[1, 1, 0, 1, 0, 1, 0, 0],
[1, 1, 1, 0, 1, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1]
])
```

**Input**

You can easily generate the coefficient list using a few helper functions:

```
from epistasis.mapping import mutations_to_sites
mutations = {
0: ['0', '1'],
1: ['0', '1'],
2: ['0', '1']
}
# Same as the list above!
coefs = mutations_to_sites(order=3, mutations=mutations)
```

**Output**

```
coefs = [
[0],
[1], [2], [3],
[1, 2], [1, 3], [2, 3],
[1, 2, 3]
]
```

## Write your own linear model¶

A linear, high-order epistasis model is a linear transformation of phenotypes \(\vec{P}\) (length L) to (Lth order) epistatic coefficients \(\vec{\beta}\) using the \(\mathbf{X}\) matrix described in the previous section.

In Python 3, this looks like:

```
from epistasis.matrix import get_model_matrix
# See the section above to get `coefs`
coefs = [
[0],
[1], [2], [3],
[1, 2], [1, 3], [2, 3],
[1, 2, 3]
]
# Numerical values for each coefficient
beta = [
1.0,
0.2, 0.3, 0.1,
0.1, 0.05, 0.01,
0.05
]
# Build the X matrix
X = get_model_matrix(binary_genotypes, coefs, model_type='local')
# Do the dot product
phenotypes = X @ beta
```

To compute linear epistatic coefficients from phenotypes, take the inverse of the equation above:

In Python 3, this looks like:

```
import numpy as np
from epistasis.matrix import get_model_matrix
# See the section above to get `coefs`
coefs = [
[0],
[1], [2], [3],
[1, 2], [1, 3], [2, 3],
[1, 2, 3]
]
# Binary genotypes
binary_genotypes = ['000','001','010','100','011','101','110', '111']
# Quantitative phenotype values.
phenotypes = [1, 1.1, 1.2, 1.2, 1.3, 1.5, 1.7, 1.8]
# Build the X matrix
X = get_model_matrix(binary_genotypes, coefs, model_type='local')
# Take the inverse
X_inv = np.inv(X)
# Do the dot product
beta = X_inv @ phenotypes
```

## Setting bounds on nonlinear fits¶

All nonlinear epistasis models use lmfit to estimate a nonlinear scale in an
arbitrary genotype-phenotype map. Each model creates `lmfit.Parameter`

objects
for each parameter in the input function and contains them in the `parameters`

attribute as `lmfit.Parameters`

object. Thus, you can set the bounds, initial
guesses, etc. on the parameters following lmfit’s API. The model, then, minimizes
the squared residuals using the `lmfit.minimize`

function. The results are
stored in the `Nonlinear`

object.

In the example below, we use a `EpistasisPowerTransform`

to demonstrate how to
access the lmfit API.

```
# Import a nonlinear model (this case, Power transform)
from epistasis.models import NonlinearPowerTransform
model = NonlinearPowerTransform(order=1)
model.parameters['lmbda'].set(value=1, min=0, max=10)
model.parameters['A'].set(value=10, min=0, max=100)
model.fit()
```

Access information about the minimizer results using the `Nonlinear`

attribute.

```
# Pretty print the results!
model.Nonlinear.params.pretty_print()
```

## Large genotype-phenotype maps¶

We have not tested the `epistasis`

package on large genotype-phenotype maps (>5000 genotypes). In principle,
it should be no problem as long as you have the resources (i.e. tons of RAM and time). However, it’s possible there may be issues with convergence
and numerical rounding for these large spaces.

### Reach out!¶

If you have a large dataset, please get in touch! We’d love to hear from you. Feel free to try cranking our models on your large dataset and let us know if you run into issues.

### My nonlinear fit is slow and does not converge.¶

Try fitting the scale of your map using a fraction of your data. We’ve found that you can typically estimate the nonlinear scale of a genotype-phenotype map from a small fraction of the genotypes. Choose a random subset of your data and fit it using a first order nonlinear model. Then use that model to linearize all your phenotype

```
from gpmap import GenotypePhenotypeMap
from epistasis.models import (EpistasisPowerTransform,
EpistasisLinearRegression)
# Load data.
gpm = GenotypePhenotypeMap.read_csv('data.csv')
# Subset the data
data_subset = gpm.data.sample(frac=0.5)
gpm_subset = GenotypePhenotypeMap.read_dataframe(data_subset)
# Fit the subset
nonlinear = EpistasisPowerTransform(order=1, lmbda=1, A=0, B=0)
nonlinear.add_gpm(gpm_subset).fit()
# Linearize the original phenotypes to estimate epistasis.
#
# Note: power transform calculate the geometric mean of the additive
# phenotypes, so we need to pass those phenotypes to the reverse transform.
padd = nonlinear.Additive.predict(X='fit')
linear_phenotypes = nonlinear.reverse(gpm.phenotypes,
*nonlinear.parameters.values(),
data=padd)
# Change phenotypes (note this changes the original dataframe)
gpm.data.phenotypes = linear_phenotypes
model = EpistasisLinearRegression(order=10)
model.add_gpm(gpm)
# Fit the model
model.fit()
```

## Estimating model uncertainty¶

The epistasis package includes a `sampling`

module for estimating uncertainty in
all coefficients in (Non)linear epistasis models. It follows a Bayesian approach,
and uses the emcee python package to approximate the posterior distributions
for each coefficient.

### Basic example¶

Use the `BayesianSampler`

object to sample your epistasis model. The sampler
stores an MCMC chain

(The plot below was created using the `corner`

package.)

```
# Imports
import matplotlib.pyplot as plt
import numpy as np
import corner
from epistasis.simulate import LinearSimulation
from epistasis.models import EpistasisLinearRegression
from epistasis.sampling.bayesian import BayesianSampler
# Create a simulated genotype-phenotype map with epistasis.
sim = LinearSimulation.from_length(3, model_type="local")
sim.set_coefs_order(3)
sim.set_coefs_random((-1,1))
sim.set_stdeviations([0.01])
# Initialize an epistasis model and fit a ML model.
model = EpistasisLinearRegression(order=3, model_type="local")
model.add_gpm(sim)
model.fit()
# Initialize a sampler.
sampler = BayesianSampler(model)
samples, rstate = sampler.sample(500)
# Plot the Posterior
fig = corner.corner(samples, truths=sim.epistasis.values)
```

### Defining a prior¶

The default prior for a BayesianSampler is a flat prior (`BayesianSampler.lnprior()`

returns a log-prior equal to 0). To set your own prior, define your own function
that called `lnprior`

that returns a log prior for a set of coefs and reset
the BayesianSampler static method:

```
def lnprior(coefs):
# Set bound on the first coefficient.
if coefs[0] < 0:
return -np.inf
return 0
# Apply to fitter from above
fitter.lnprior = lnprior
```