epistasis

A Python API for modeling statistical, high-order epistasis in genotype-phenotype maps. You can use this library to:

  1. Decompose genotype-phenotype maps into high-order epistatic interactions
  2. Find nonlinear scales in the genotype-phenotype map
  3. Calculate the contributions of different epistatic orders and
  4. Estimate the uncertainty in the epistatic coefficients and

For more information about the epistasis models in this library, see our Genetics paper:

This library is built on top of well known tools in the scientific Python stack. It uses packages like matplotlib, numpy, scipy, scikit-learn, and pandas. We strive to follow similar inferface designs present in this ecosytem. If you notice ways we can improve, please open an issue on Github! We’d love to hear your feedback.

Currently, this package works only as an API. There is no command-line interface. Instead, we encourage you use this package inside Jupyter notebooks .

Basic Example

Fit an epistasis model to genotype-phenotype map data.

# Import a model and the plotting module
from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisLinearRegression
from epistasis.pyplot import plot_coefs

# Genotype-phenotype map data.
wildtype = "AAA"
genotypes = ["ATT", "AAT", "ATA", "TAA", "ATT", "TAT", "TTA", "TTT"]
phenotypes = [0.1, 0.2, 0.4, 0.3, 0.3, 0.6, 0.8, 1.0]

# Create genotype-phenotype map object.
gpm = GenotypePhenotypeMap(wildtype=wildtype,
                           genotypes=genotypes,
                           phenotypes=phenotypes)

# Initialize an epistasis model.
model = EpistasisLinearRegression(order=3)

# Add the genotype phenotype map.
model.add_gpm(gpm)

# Fit model to given genotype-phenotype map.
model.fit()

# Plot coefs.
fig, axes = plot_coefs(model, figsize=(2,4))
_images/basic-example.png

Documentation

We are still working hard on the Docs! You may notice blank spots in various places. We appreciate your patience as we try to catch up on docs.

Quick Guide

Introduction

epistasis is a Python library that includes models to estimate statistical, high-order epistasis in genotype-phenotype maps. Using this library, you can

  1. Decompose genotype-phenotype maps into high-order epistatic interactions
  2. Find nonlinear scales in the genotype-phenotype map
  3. Calculate the contributions of different epistatic orders and
  4. Estimate the uncertainty in the epistatic coefficients and

For more information about the epistasis models in this library, see our Genetics paper:

Simple Example

Follow these five steps for all epistasis models in this library:

  1. Import a model. There many models available in the epistasis.models module. See the full list in the next section.
from epistasis.models import EpistasisLinearRegression
  1. Initialize a model. Set the order, choose the type of model (see Anatomy of an epistasis model for more info), and set any other parameters in the model.
model = EpistasisLinearRegression(order=3, model_type='global')
  1. Add some data. There are three basic ways to do this. 1. Pass data directly to the epistasis model using the add_data method. 2. Read data from a separate file using one of the read_ methods. 3. (The best option.) load data into a GenotypePhenotypeMap object from the GPMap library and add it to the epistasis model.
from gpmap import GenotypePhenotypeMap

datafile = 'data.csv'
gpm = GenotypePhenotypeMap.read_csv(datafile)

# Add the data.
model.add_gpm(gpm)

# model now has a `gpm` attribute.
  1. Fit the model. Each model has a simple fit method. Call this to estimate epistatic coefficients. The results are stored the epistasis attribute.
# Call fit method
model.fit()

# model now has an ``epistasis`` attribute
  1. Plot the results. The epistasis library has a pyplot module (powered by matplotlib) with a few builtin plotting functions.
from epistasis.pyplot import plot_coefs

fig, ax = plot_coefs(model.epistasis.sites, model.epistasis.values)
_images/basic-example1.png

Install and dependencies

For users

This library is now available on PyPi, so it can be installed using pip.

pip install epistasis
For developers

For the latest version of the package, you can also clone from Github and install a development version using pip.

git clone https://github.com/harmslab/epistasis
cd epistasis
pip install -e .
Dependencies

The following dependencies are required for the epistasis package.

  • gpmap: Module for constructing powerful genotype-phenotype map python data-structures.
  • Scikit-learn: Simple to use machine-learning API.
  • Numpy: Python’s array manipulation package.
  • Scipy: Efficient scientific array manipulations and fitting.
  • Pandas: High-performance, easy-to-use data structures and data analysis tools.

There are also some additional dependencies for extra features included in the package.

Running tests

The epistasis package comes with a suite of tests. Running the tests require pytest, so make sure it is installed.

pip install -U pytest

Once pytest is installed, run the tests from the base directory of the epistasis package using the following command.

pytest

Detailed list of models

This page lists all models included in the Epistasis Package.

EpistasisLinearRegression

A linear, high-order epistasis model. This uses an ordinary least-squares regression to estimate high-order, epistatic coefficients in an arbitrary genotype-phenotype map. Simple define the order of the model.

from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisLinearRegression

wildtype = 'AA'
genotypes = ['AA', 'AT', 'TA', 'TT']
phenotypes = [0.1, 0.2, 0.7, 1.2]

# Read genotype-phenotype map.
gpm = GenotypePhenotypeMap(wildtype, genotypes, phenotypes)

# Initialize the data.
model = EpistasisLinearRegression(order=2)

# Add Genotype-phenotype map data.
model.add_gpm(gpm)

# Fit the model.
model.fit()

EpistasisRidge

A L2-norm epistasis model for estimating sparse epistatic coefficients. The optimization function imposes a penalty on the number of coefficients and finds the model that maximally explains the data while using the fewest coefficients.

from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisRidge

wildtype = 'AA'
genotypes = ['AA', 'AT', 'TA', 'TT']
phenotypes = [0.1, 0.2, 0.7, 1.2]

# Read genotype-phenotype map.
gpm = GenotypePhenotypeMap(wildtype, genotypes, phenotypes)

# Initialize the data.
model = EpistasisRidge(order=2)

# Add Genotype-phenotype map data.
model.add_gpm(gpm)

# Fit the model.
model.fit()

EpistasisLasso

A L1-norm epistasis model for estimating sparse epistatic coefficients. The optimization function imposes a penalty on the number of coefficients and finds the model that maximally explains the data while using the fewest coefficients.

from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisLasso

wildtype = 'AA'
genotypes = ['AA', 'AT', 'TA', 'TT']
phenotypes = [0.1, 0.2, 0.7, 1.2]

# Read genotype-phenotype map.
gpm = GenotypePhenotypeMap(wildtype, genotypes, phenotypes)

# Initialize the data.
model = EpistasisLasso(order=2)

# Add Genotype-phenotype map data.
model.add_gpm(gpm)

# Fit the model.
model.fit()

EpistasisNonlinearRegression

A nonlinear, high-order epistasis model. This uses nonlinear, least-squares regression (provided by lmfit) to estimate high-order, epistatic coefficients in an arbitrary genotype-phenotype map.

This models has three steps:
  1. Fit an additive, linear regression to approximate the average effect of individual mutations.
  2. Fit the nonlinear function to the observed phenotypes vs. the additive phenotypes estimated in step 1. This function is defined by the user as a callable python function
  3. Transform the phenotypes to this linear scale and fit leftover variation with high-order epistasis model.
from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisLinearRegression

wildtype = 'AA'
genotypes = ['AA', 'AT', 'TA', 'TT']
phenotypes = [0.1, 0.2, 0.7, 1.2]

# Read genotype-phenotype map.
gpm = GenotypePhenotypeMap(wildtype, genotypes, phenotypes)

def func(x, A):
    return np.exp(A * x)

def reverse(y, A):
    return np.log(x) / A

# Initialize the data.
model = EpistasisNonlinearRegression(order=2, function=func, reverse=reverse)

# Add Genotype-phenotype map data.
model.add_gpm(gpm)

# Fit the model.
model.fit(A=1)

EpistasisNonlinearLasso

A nonlinear, high-order epistasis model. This uses nonlinear, least-squares regression (provided by lmfit) to estimate high-order, epistatic coefficients in an arbitrary genotype-phenotype map.

This models has three steps:
  1. Fit an additive, linear regression to approximate the average effect of individual mutations.
  2. Fit the nonlinear function to the observed phenotypes vs. the additive phenotypes estimated in step 1. This function is defined by the user as a callable python function
  3. Transform the phenotypes to this linear scale and fit leftover variation with an EpistasisLasso.
from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisLinearRegression

wildtype = 'AA'
genotypes = ['AA', 'AT', 'TA', 'TT']
phenotypes = [0.1, 0.2, 0.7, 1.2]

# Read genotype-phenotype map.
gpm = GenotypePhenotypeMap(wildtype, genotypes, phenotypes)

  def func(x, A):
      return np.exp(A * x)

  def reverse(y, A):
      return np.log(x) / A

  # Initialize the data.
  model = EpistasisNonlinearLasso(order=3, function=func, reverse=reverse)

  # Add Genotype-phenotype map data.
  model.add_gpm(gpm)

  # Fit the model.
  model.fit(A=1)

EpistasisPowerTransform

Use power-transform function, via nonlinear least-squares regression, to estimate epistatic coefficients and the nonlinear scale in a nonlinear genotype-phenotype map.

Like the nonlinear model, this model has three steps:
  1. Fit an additive, linear regression to approximate the average effect of individual mutations.
  2. Fit the nonlinear function to the observed phenotypes vs. the additive phenotypes estimated in step 1.
  3. Transform the phenotypes to this linear scale and fit leftover variation with high-order epistasis model.

Methods are described in the following publication:

Sailer, Z. R. & Harms, M. J. ‘Detecting High-Order Epistasis in Nonlinear Genotype-Phenotype Maps’. Genetics 205, 1079-1088 (2017).
from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisLinearRegression

wildtype = 'AA'
genotypes = ['AA', 'AT', 'TA', 'TT']
phenotypes = [0.1, 0.2, 0.7, 1.2]

# Read genotype-phenotype map.
gpm = GenotypePhenotypeMap(wildtype, genotypes, phenotypes)

# Initialize the data.
model = EpistasisPowerTransform(order=3)

# Add Genotype-phenotype map data.
model.add_gpm(gpm)

# Fit the model.
model.fit(lmbda=1, A=1, B=1)

EpistasisPowerLasso

Use power-transform function, via nonlinear least-squares regression, to estimate epistatic coefficients and the nonlinear scale in a nonlinear genotype-phenotype map.

Like the nonlinear model, this model has three steps:
  1. Fit an additive, linear regression to approximate the average effect of individual mutations.
  2. Fit the nonlinear function to the observed phenotypes vs. the additive phenotypes estimated in step 1.
  3. Transform the phenotypes to this linear scale and fit leftover variation with an EpistasisLasso.
from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisLinearRegression

wildtype = 'AA'
genotypes = ['AA', 'AT', 'TA', 'TT']
phenotypes = [0.1, 0.2, 0.7, 1.2]

# Read genotype-phenotype map.
gpm = GenotypePhenotypeMap(wildtype, genotypes, phenotypes)

# Initialize the data.
model = EpistasisPowerTransformLasso(order=3)

# Add Genotype-phenotype map data.
model.add_gpm(gpm)

# Fit the model.
model.fit(lmbda=1, A=1, B=1)

EpistasisLogisticRegression

A high-order epistasis regression that classifies genotypes as viable/nonviable (given some threshold).

from epistasis.models import EpistasisLogisticRegression

wildtype = '00'
genotypes = ['00', '01', '10', '11']
phenotypes = [0, .2, .1, 1]

# Initialize the data.
model = EpistasisLogisticRegression(order=1, threshold=.1)

# Add Genotype-phenotype map data.
model.add_data(wildtype, genotypes, phenotypes)

# Fit the model.
model.fit()

EpistasisEnsembleRegression

A regression object that models phenotypes as a statistical (Boltmann-weighted) average of “states”. Mutations are modeled as having different effects in each state.

\[P = \text{ln} ( \sum_{x=\{\text{A,B,...}\}} - \text{exp}(\beta_{0; x} + \beta_{1; x} + ... + \beta_{1,2; x}+ ...) )\]
from gpmap import GenotypePhenotypeMap
from epistasis.models import EpistasisEnsembleRegression

wildtype = 'AA'
genotypes = ['AA', 'AT', 'TA', 'TT']
phenotypes = [0.1, 0.2, 0.7, 1.2]

# Read genotype-phenotype map.
gpm = GenotypePhenotypeMap(wildtype, genotypes, phenotypes)

# Initialize the data.
model = EpistasisEnsembleRegression(order=1, nstates=1)

# Add Genotype-phenotype map data.
model.add_gpm(gpm)

# Fit the model.
model.fit()

# Print effects in state A.
print(model.state_A.epistasis.values)

Anatomy of an epistasis model

The X matrix

The most critical piece of an epistasis models is the X (model) matrix. This matrix maps genotypes to epistatic coefficients. You can read about this matrix in this paper.

There are two popular X matrices that exist in the epistasis literature, the global model (a.k.a. background-averaged model) and local model (a.k.a. biochemical model). All epistasis models in this API takes a model_type keyword argument that tells the model which matrix to use. Read the paper mentioned above for more information on which model to use.

Constructing these matrices for your dataset is no easy task, so each epistasis model can handle this construction internally. Most methods automatically infer X from the genotype-phenotype map.

Any X matrix used by an epistasis model is also stored in the Xbuilt attribute. This speeds up fitting algorithms that may need resample fitting many times.

Methods in every epistasis model

Every epistasis model includes the following methods:

  • fit : fit the model to an attached genotype-phenotype map, or X and y data.
  • predict : predict phenotypes using the X matrix or keywords (listed above). If a keyword is used, the phenotypes are in the same order as the genotypes to the corresponding keyword.
  • score : the pearson coefficients between the predicted phenotypes and the given data (X/y data or attached genotype-phenotype map).
  • thetas : flattened array of 1) nonlinear parameters and 2) epistatic-coefficients estimated by model.
  • hypothesis : computes the phenotypes for X data given a thetas array.
  • lnlike_of_data : returns an array of log-likelihoods for each data point given a thetas array.
  • lnlikelihood : returns the total log-likelihood for X/y data given a thetas array.

Methods in nonlinear models

The extra attributes below are attached to nonlinear epistasis models.

  • Additive : a first-order EpistasisLinearRegression used to approximate the additive effects of mutations
  • Nonlinear : a lmfit.MinizerResults object returned by the lmfit.minimize function for estimating the nonlinear scale in a genotype-phenotype map.

Building an epistasis pipeline

The EpistasisPipeline object allows you to link epistasis models in series. Define each mode and add them to a pipeline. When fit is called, this object runs a cascade of fit_transforms.

This is particularly useful if you need to remove nonlinearity in a genotype-phenotype map before fitting high-order epistasis (see this paper).

EpistasisPipeline inherits Python’s list type. This means you can append, prepend, pop, etc. from the pipeline after initialization. Each model is fit in the order it appears in the pipeline.

Simple Example

In the example below, the power transform linearizes the map, then fits specific high-order epistasis on the linear scale. The fitted model is then used to predict the phenotype of an unknown genotype.

from epistasis import EpistasisPipeline
from epistasis.models import (EpistasisPowerTransform,
                              EpistasisLinearRegression)

# Define genotype-phenotype map.
gpm = GenotypePhenotyeMap(
  wildtype='AA'
  genotypes=['AA', 'AV','VV'],  # Note that we're missing the 'VA' genotype
  phenotypes=[0, .5, 1]
)

# Construct pipeline.
model = EpistasisPipeline([
  EpistasisPowerTransform(lmbda=1, A=0, B=0),
  EpistasisLinearRegression(order=2)
])

# Fit pipeline.
model.fit()

# Predict missing phenotype of missing genotype.
model.predict(['VA'])

Simulating epistatic genotype-phenotype maps

Simulate rough, epistatic genotype-phenotype maps using the simulate module.

LinearSimulation

The following examples show a variety ways to simulate a genotype-phenotype map with linear, high-order epistatic interactions. The simulation interface provides methods to easily dictate the construction of a simulated genotype-phenotype map.

from epistasis.simulate import LinearSimulation

# Define the wildtype sequence and possible mutations at each site.
wildtype = "0000"
mutations = {
    0: ["0", "1"], # mutation alphabet for site 0
    1: ["0", "1"],
    2: ["0", "1"],
    3: ["0", "1"]
}
# Initialize a simulation
gpm = LinearSimulation(wildtype, mutations)

# Set the order of epistasis
gpm.set_coefs_order(4)

# Generate random epistatic coefficients
coef_range = (-1, 1) # coefs between -1 and 1
gpm.set_coefs_random(coef_range)

Alternatively, you can quickly simulate a binary genotype-phenotype map if you’re fine with a simple, binary alphabet at each site.

# define the length of genotypes and the order of epistasis
length = 4
gpm = LinearSimulation.from_length(length)

# Generate random epistatic coefs
gpm.set_coefs_order(4)
gpm.set_coefs_random(coef_range)

For all simulated genotype-phenotype maps, one can initialize a genotype-phenotype map from an existing dataset. Scroll through class methods that start with from_ to see all options for initializing simulated genotype-phenotype maps.

NonlinearSimulation

Simulate a nonlinear, epistatic genotype-phenotype map using NonlinearSimulation. Simply define a function which transforms a linear genotype-phenotype map onto a nonlinear scale. Note, the function must have x as the first argument. This argument represents the linearized phenotypes to be transformed.

from epistasis.simulate import NonlinearSimulation

def saturating_scale(x, K):
    return ((K+1)*x)/(K+x)

# Define the initial value for the paramter, K
p0 = [2]

gpm = NonlinearSimulation.from_length(4, function=saturating_scale, p0=p0)
gpm.set_coefs_order(4)
gpm.set_coefs_random((0,1))

Multiplicative Example

Multiplicative epistasis is a common nonlinear, phenotypic scale. Simulate this type of map using the NonlinearSimulation class.

\[\begin{split}\begin{eqnarray} p & = & \beta_1 \beta_2 \beta_{1,2} \\ p & = & e^{ln(\beta_1) + ln(\beta_2) + ln(\beta_{1,2})} \end{eqnarray}\end{split}\]

Using the epistasis package, this looks like the following example. First, define the exponential function as the nonlinear scale passed into the Simulation class.

import numpy as np
from epistasis.simulation import NonlinearSimulation

def multiplicative(x):
    return np.exp(x)

gpm = NonlinearSimulation.from_length(4, function=multiplicative)

Then, define the epistatic coefficients, take their log, and pass them into the simulation object.

# Set the order of epistasis
gpm.set_coefs_order(4)

# generate random coefs
coefs = np.random.uniform(0,3, size=len(gpm.epistasis.labels))

# Take the log of the coefs
log_coefs = np.log(coefs)

# Pass coefs into the simulation class.
gpm.set_coefs_values(log_coefs)

Plotting in the epistasis package

The epistasis package comes with a few functions to plot epistasis data.

Plot epistatic coefficients

The plotting module comes with a default function for plotting epistatic coefficients. It plots the value of the coefficient as bar graphs, the label as a box plot (see example below), and signficicance as stars using a t-test.

import matplotlib.pyplot as plt

from gpmap.simulate import MountFujiSimulation
from epistasis.models.linear import EpistasisLinearRegression
from epistasis.pyplot.coefs import plot_coefs

gpm = MountFujiSimulation.from_length(4, field_strength=-1, roughness=(-2,2))

model = EpistasisLinearRegression(order=4)
model.add_gpm(gpm)
model.fit()

# Plot coefs
fig, axes = plot_coefs(model, figsize=(4,5))
plt.show()
_images/plot_coefs.png

Plot nonlinear scale

Plot a nonlinear scale using the epistasis.pyplot.nonlinear module.

# Import matplotlib
import matplotlib.pyplot as plt

# Import epistasis package
from gpmap.simulate import MountFujiSimulation
from epistasis.models.nonlinear import EpistasisPowerTransform
from epistasis.pyplot.nonlinear import plot_power_transform

# Simulate a Mt. Fuji fitness landscape
gpm = MountFujiSimulation.from_length(4, field_strength=-1, roughness=(-2,2))

# Fit Power transform
model = EpistasisPowerTransform(lmbda=1, A=0, B=0)
model.add_gpm(gpm)
model.fit()

# Create plot
fig, ax = plt.subplots(figsize=(3,3))
plot_power_transform(model, cmap='plasma', ax=ax, yerr=0.6)

ax.set_xlabel('Padd')
ax.set_ylabel('Pobs')
plt.show()
_images/plot_scale.png

Advanced topics

Setting bounds on nonlinear fits

All nonlinear epistasis models use lmfit to estimate a nonlinear scale in an arbitrary genotype-phenotype map. Each model creates lmfit.Parameter objects for each parameter in the input function and contains them in the parameters attribute as lmfit.Parameters object. Thus, you can set the bounds, initial guesses, etc. on the parameters following lmfit’s API. The model, then, minimizes the squared residuals using the lmfit.minimize function. The results are stored in the Nonlinear object.

In the example below, we use a EpistasisPowerTransform to demonstrate how to access the lmfit API.

# Import a nonlinear model (this case, Power transform)
from epistasis.models import NonlinearPowerTransform

model = NonlinearPowerTransform(order=1)
model.parameters['lmbda'].set(value=1, min=0, max=10)
model.parameters['A'].set(value=10, min=0, max=100)

model.fit()

Access information about the minimizer results using the Nonlinear attribute.

# Pretty print the results!
model.Nonlinear.params.pretty_print()

Large genotype-phenotype maps

We have not tested the epistasis package on large genotype-phenotype maps (>5000 genotypes). In principle, it should be no problem as long as you have the resources (i.e. tons of RAM and time). However, it’s possible there may be issues with convergence and numerical rounding for these large spaces.

Reach out!

If you have a large dataset, please get in touch! We’d love to hear from you. Feel free to try cranking our models on your large dataset and let us know if you run into issues.

My nonlinear fit is slow and does not converge.

Try fitting the scale of your map using a fraction of your data. We’ve found that you can typically estimate the nonlinear scale of a genotype-phenotype map from a small fraction of the genotypes. Choose a random subset of your data and fit it using a first order nonlinear model. Then use that model to linearize all your phenotype

from gpmap import GenotypePhenotypeMap
from epistasis.models import (EpistasisPowerTransform,
                              EpistasisLinearRegression)

# Load data.
gpm = GenotypePhenotypeMap.read_csv('data.csv')

# Subset the data
data_subset = gpm.data.sample(frac=0.5)
gpm_subset = GenotypePhenotypeMap.read_dataframe(data_subset)

# Fit the subset
nonlinear = EpistasisPowerTransform(order=1, lmbda=1, A=0, B=0)
nonlinear.add_gpm(gpm_subset).fit()

# Linearize the original phenotypes to estimate epistasis.
#
# Note: power transform calculate the geometric mean of the additive
# phenotypes, so we need to pass those phenotypes to the reverse transform.
padd = nonlinear.Additive.predict(X='fit')
linear_phenotypes = nonlinear.reverse(gpm.phenotypes,
                                      *nonlinear.parameters.values(),
                                      data=padd)

# Change phenotypes (note this changes the original dataframe)
gpm.data.phenotypes = linear_phenotypes
model = EpistasisLinearRegression(order=10)
model.add_gpm(gpm)

# Fit the model
model.fit()

Estimating model uncertainty

The epistasis package includes a sampling module for estimating uncertainty in all coefficients in (Non)linear epistasis models. It follows a Bayesian approach, and uses the emcee python package to approximate the posterior distributions for each coefficient.

Basic example

Use the BayesianSampler object to sample your epistasis model. The sampler stores an MCMC chain

(The plot below was created using the corner package.)

# Imports
import matplotlib.pyplot as plt
import numpy as np
import corner

from epistasis.simulate import LinearSimulation
from epistasis.models import EpistasisLinearRegression
from epistasis.sampling.bayesian import BayesianSampler

# Create a simulated genotype-phenotype map with epistasis.
sim = LinearSimulation.from_length(3, model_type="local")
sim.set_coefs_order(3)
sim.set_coefs_random((-1,1))
sim.set_stdeviations([0.01])

# Initialize an epistasis model and fit a ML model.
model = EpistasisLinearRegression(order=3, model_type="local")
model.add_gpm(sim)
model.fit()

# Initialize a sampler.
sampler = BayesianSampler(model)
samples, rstate = sampler.sample(500)

# Plot the Posterior
fig = corner.corner(samples, truths=sim.epistasis.values)
_images/bayes-estimate-uncertainty.png
Defining a prior

The default prior for a BayesianSampler is a flat prior (BayesianSampler.lnprior() returns a log-prior equal to 0). To set your own prior, define your own function that called lnprior that returns a log prior for a set of coefs and reset the BayesianSampler static method:

def lnprior(coefs):
    # Set bound on the first coefficient.
    if coefs[0] < 0:
        return -np.inf
    return 0

# Apply to fitter from above
fitter.lnprior = lnprior

Saving an epistasis model

All epistasis models/simulators store epistatic coefficients in Pandas Series/DataFrames, so the coefficients can be written to various file formats. This page lists a few.

Pickle

The recommended way to save an epistasis model to be used again is by pickling the model(See Python’s pickle library).

# Import pickle module
import pickle
from epistasis.models import EpistasisLinearRegression

# Simple model object
model = EpistasisLinearRegression(model=1)

# Save to disk to open later.
with open('model.pickle', 'wb') as f:
    pickle.dump(f, model)

To load the saved model,

# Import pickle module
import pickle

# Read from file.
with open('model.pickle', 'rb') as f:
    model = pickle.load(f)

Warning

Pickled models will only work with the same version of the epistasis package that created it. If you save a model and upgrade the library, you likely won’t be able to use the model anymore.

Excel Spreadsheet

Epistatic coefficients can be written to an excel file using the to_excel method

model.to_excel('data.xlsx')
sites values
0 [0] 0.501191
1 [1] -0.600019
2 [2] 0.064983
3 [3] 0.609166
4 [4] 0.242095
5 [1, 2] 0.286914
6 [1, 3] -0.264455
7 [1, 4] -0.464212
8 [2, 3] 0.638260
9 [2, 4] 0.235989
10 [3, 4] 0.717954
11 [1, 2, 3] -0.473122
12 [1, 2, 4] -0.041919
13 [1, 3, 4] -0.309124
14 [2, 3, 4] 0.606674
15 [1, 2, 3, 4] -0.268982

CSV File

Epistatic coefficients can be written to a csv file using the to_csv method

model.epistasis.to_csv('epistasis.csv')
,sites,values
0,[0],0.5011910655025966
1,[1],-0.6000186681513706
2,[2],0.06498276930060931
3,[3],0.6091656756721153
4,[4],0.24209508436556937
5,"[1, 2]",0.2869142038187855
6,"[1, 3]",-0.26445514455225094
7,"[1, 4]",-0.4642116520437949
8,"[2, 3]",0.638260262428922
9,"[2, 4]",0.23598864236123118
10,"[3, 4]",0.7179538630349485
11,"[1, 2, 3]",-0.47312160287366267
12,"[1, 2, 4]",-0.04191888437610514
13,"[1, 3, 4]",-0.30912353449573415
14,"[2, 3, 4]",0.6066739725656609
15,"[1, 2, 3, 4]",-0.2689818206753505

API documentation

This page lists all modules in the epitasis library. If you find missing documentation, feel free to open an issue or (better yet) submit a pull request!

epistasis.models package

See subpackages for specific models.

epistasis.models.linear package
epistasis.models.linear.lasso module
class epistasis.models.linear.lasso.EpistasisLasso(order=1, model_type='global', alpha=1.0, precompute=False, max_iter=1000, tol=0.0001, warm_start=False, positive=False, random_state=None, selection='cyclic', **kwargs)

Bases: sklearn.linear_model.coordinate_descent.Lasso, epistasis.models.base.AbstractModel

A scikit-learn Lasso Regression class for discovering sparse epistatic coefficients.

Methods are described in the following publication:
Poelwijk FJ, Socolich M, and Ranganathan R. ‘Learning the pattern of epistasis linking enotype and phenotype in a protein’. bioRxiv. (2017).
Parameters:
  • order (int) – order of epistasis
  • model_type (str (default="global")) – model matrix type. See publication above for more information
  • alpha (float) – Constant that multiplies the L1 term. Defaults to 1.0. alpha = 0 is equivalent to an ordinary least square, solved by the EpistasisLinearRegression object.
  • precompute – Whether to use a precomputed Gram matrix to speed up calculations. If set to ‘auto’ let us decide. The Gram matrix can also be passed as argument. For sparse input this option is always True to preserve sparsity.
  • max_iter (int) – The maximum number of iterations.
  • tol (float) – The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
  • warm_start (bool) – When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.
  • positive (bool) – When set to True, forces the coefficients to be positive.
  • random_state (int) – The seed of the pseudo random number generator that selects a random feature to update. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when selection == ‘random’.
  • selection (str) – If set to ‘random’, a random coefficient is updated every iteration rather than looping over features sequentially by default. This (setting to ‘random’) often leads to significantly faster convergence especially when tol is higher than 1e-4.
compression_ratio()

Compute the compression ratio for the Lasso regression

fit(X=None, y=None, **kwargs)
fit_transform(X=None, y=None, **kwargs)
hypothesis(X=None, thetas=None)
hypothesis_transform(X=None, y=None, thetas=None)
lnlike_of_data(X=None, y=None, yerr=None, thetas=None)
lnlike_transform(X=None, y=None, yerr=None, lnprior=None, thetas=None)
num_of_params
predict(X=None)
predict_transform(X=None, y=None)
score(X=None, y=None)
thetas
epistasis.models.linear.ordinary module
class epistasis.models.linear.ordinary.EpistasisLinearRegression(order=1, model_type='global', n_jobs=1, **kwargs)

Bases: sklearn.linear_model.base.LinearRegression, epistasis.models.base.AbstractModel

Ordinary least-squares regression for estimating high-order, epistatic interactions in a genotype-phenotype map.

Methods are described in the following publication:
Sailer, Z. R. & Harms, M. J. ‘Detecting High-Order Epistasis in Nonlinear Genotype-Phenotype Maps’. Genetics 205, 1079-1088 (2017).
Parameters:
  • order (int) – order of epistasis
  • model_type (str (default="global")) – model matrix type. See publication above for more information
fit(X=None, y=None, **kwargs)
fit_transform(X=None, y=None, **kwargs)
hypothesis(X=None, thetas=None)
hypothesis_transform(X=None, y=None, thetas=None)
lnlike_of_data(X=None, y=None, yerr=None, thetas=None)
lnlike_transform(X=None, y=None, yerr=None, lnprior=None, thetas=None)
num_of_params
predict(X=None)
predict_transform(X=None, y=None)
score(X=None, y=None)
thetas
epistasis.models.linear.ridge module
class epistasis.models.linear.ridge.EpistasisRidge(order=1, model_type='global', alpha=1.0, max_iter=1000, tol=0.0001, random_state=None, solver='auto', **kwargs)

Bases: sklearn.linear_model.ridge.Ridge, epistasis.models.base.AbstractModel

A scikit-learn Ridge Regression class for discovering sparse epistatic coefficients.

Parameters:
  • order (int) – order of epistasis
  • model_type (str (default="global")) – model matrix type. See publication above for more information
  • alpha (float) – Constant that multiplies the L1 term. Defaults to 1.0. alpha = 0 is equivalent to an ordinary least square, solved by the EpistasisLinearRegression object.
  • max_iter (int) – The maximum number of iterations.
  • tol (float) – The tolerance for the optimization: if the updates are smaller than tol, the optimization code checks the dual gap for optimality and continues until it is smaller than tol.
  • random_state (int) – The seed of the pseudo random number generator that selects a random feature to update. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random. Used when selection == ‘random’.
  • solver (str) – See scikit learn docs for Ridge.
compression_ratio()

Compute the compression ratio for the Lasso regression

fit(X=None, y=None, **kwargs)
fit_transform(X=None, y=None, **kwargs)
hypothesis(X=None, thetas=None)
hypothesis_transform(X=None, y=None, thetas=None)
lnlike_of_data(X=None, y=None, yerr=None, thetas=None)
lnlike_transform(X=None, y=None, yerr=None, lnprior=None, thetas=None)
num_of_params
predict(X=None)
predict_transform(X=None, y=None)
score(X=None, y=None)
thetas
epistasis.models.nonlinear package
epistasis.models.nonlinear.ordinary module
epistasis.models.nonlinear.power module
Module contents
epistasis.models.base module
class epistasis.models.base.AbstractModel

Bases: abc.ABC

Abstract Base Class for all epistasis models.

This class sets all docstrings not given in subclasses.

static __new__(*args, **kwargs)

Replace the docstrings of a subclass with docstrings in this base class.

add_X(X=None, key=None)

Add X to Xbuilt

Keyword arguments for X:

  • None :
    Uses gpm.binary to construct X. If genotypes are missing they will not be included in fit. At the end of fitting, an epistasis map attribute is attached to the model class.
Parameters:
  • X – see above for details.
  • key (str) – name for storing the matrix.
Returns:

Xbuilt – newly built 2d array matrix

Return type:

numpy.ndarray

add_gpm(gpm)

Add a GenotypePhenotypeMap object to the epistasis model.

fit(X=None, y=None, **kwargs)

Fit model to data.

Parameters:
  • X (None, ndarray, or list of genotypes. (default=None)) – data used to construct X matrix that maps genotypes to model coefficients. If None, the model uses genotypes in the attached genotype-phenotype map. If a list of strings, the strings are genotypes that will be converted to an X matrix. If ndarray, the function assumes X is the X matrix used by the epistasis model.
  • y (None or ndarray (default=None)) – array of phenotypes. If None, the phenotypes in the attached genotype-phenotype map is used.
Returns:

The model is returned. Allows chaining methods.

Return type:

self

fit_transform(X=None, y=None, **kwargs)

Fit model to data and transform output according to model.

Parameters:
  • X (None, ndarray, or list of genotypes. (default=None)) – data used to construct X matrix that maps genotypes to model coefficients. If None, the model uses genotypes in the attached genotype-phenotype map. If a list of strings, the strings are genotypes that will be converted to an X matrix. If ndarray, the function assumes X is the X matrix used by the epistasis model.
  • y (None or ndarray (default=None)) – array of phenotypes. If None, the phenotypes in the attached genotype-phenotype map is used.
Returns:

gpm – The genotype-phenotype map object with transformed genotypes.

Return type:

GenotypePhenotypeMap

gpm

Data stored in a GenotypePhenotypeMap object.

hypothesis(X=None, thetas=None)

Compute phenotypes from given model parameters.

Parameters:
  • X (None, ndarray, or list of genotypes. (default=None)) – data used to construct X matrix that maps genotypes to model coefficients. If None, the model uses genotypes in the attached genotype-phenotype map. If a list of strings, the strings are genotypes that will be converted to an X matrix. If ndarray, the function assumes X is the X matrix used by the epistasis model.
  • thetas (ndarray) – array of model parameters. See thetas property for specifics.
Returns:

y – array of phenotypes predicted by model parameters.

Return type:

ndarray

hypothesis_transform(X=None, y=None, thetas=None)

Transform phenotypes with given model parameters.

Parameters:
  • X (None, ndarray, or list of genotypes. (default=None)) – data used to construct X matrix that maps genotypes to model coefficients. If None, the model uses genotypes in the attached genotype-phenotype map. If a list of strings, the strings are genotypes that will be converted to an X matrix. If ndarray, the function assumes X is the X matrix used by the epistasis model.
  • y (ndarray) – An array of phenotypes to transform.
  • thetas (ndarray) – array of model parameters. See thetas property for specifics.
Returns:

y – array of phenotypes predicted by model parameters.

Return type:

ndarray

lnlike_of_data(X=None, y=None, yerr=None, thetas=None)

Compute the individUal log-likelihoods for each datapoint from a set of model parameters.

Parameters:
  • X (None, ndarray, or list of genotypes. (default=None)) – data used to construct X matrix that maps genotypes to model coefficients. If None, the model uses genotypes in the attached genotype-phenotype map. If a list of strings, the strings are genotypes that will be converted to an X matrix. If ndarray, the function assumes X is the X matrix used by the epistasis model.
  • y (ndarray) – An array of phenotypes to transform.
  • yerr (ndarray) – An array of the measured phenotypes standard deviations.
  • thetas (ndarray) – array of model parameters. See thetas property for specifics.
Returns:

y – array of phenotypes predicted by model parameters.

Return type:

ndarray

lnlike_transform(X=None, y=None, yerr=None, lnprior=None, thetas=None)

Compute the individual log-likelihoods for each datapoint from a set of model parameters and a prior.

Parameters:
  • X (None, ndarray, or list of genotypes. (default=None)) – data used to construct X matrix that maps genotypes to model coefficients. If None, the model uses genotypes in the attached genotype-phenotype map. If a list of strings, the strings are genotypes that will be converted to an X matrix. If ndarray, the function assumes X is the X matrix used by the epistasis model.
  • y (ndarray) – An array of phenotypes to transform.
  • yerr (ndarray) – An array of the measured phenotypes standard deviations.
  • lnprior (ndarray) – An array of priors for a given datapoint.
  • thetas (ndarray) – array of model parameters. See thetas property for specifics.
Returns:

y – array of phenotypes predicted by model parameters.

Return type:

ndarray

lnlikelihood(X=None, y=None, yerr=None, thetas=None)

Compute the individal log-likelihoods for each datapoint from a set of model parameters.

Parameters:
  • X (None, ndarray, or list of genotypes. (default=None)) – data used to construct X matrix that maps genotypes to model coefficients. If None, the model uses genotypes in the attached genotype-phenotype map. If a list of strings, the strings are genotypes that will be converted to an X matrix. If ndarray, the function assumes X is the X matrix used by the epistasis model.
  • y (ndarray) – An array of phenotypes to transform.
  • yerr (ndarray) – An array of the measured phenotypes standard deviations.
  • thetas (ndarray) – array of model parameters. See thetas property for specifics.
Returns:

lnlike – log-likelihood of the model parameters.

Return type:

float

num_of_params

Number of parameters in model.

predict(X=None)

Use model to predict phenotypes for a given list of genotypes.

Parameters:X (None, ndarray, or list of genotypes. (default=None)) – data used to construct X matrix that maps genotypes to model coefficients. If None, the model uses genotypes in the attached genotype-phenotype map. If a list of strings, the strings are genotypes that will be converted to an X matrix. If ndarray, the function assumes X is the X matrix used by the epistasis model.
Returns:y – array of phenotypes.
Return type:ndarray
predict_transform(X=None, y=None, **kwargs)

Transform a set of phenotypes according to the model.

Parameters:
  • X (None, ndarray, or list of genotypes. (default=None)) – data used to construct X matrix that maps genotypes to model coefficients. If None, the model uses genotypes in the attached genotype-phenotype map. If a list of strings, the strings are genotypes that will be converted to an X matrix. If ndarray, the function assumes X is the X matrix used by the epistasis model.
  • y (ndarray) – An array of phenotypes to transform.
Returns:

y_transform – array of phenotypes.

Return type:

ndarray

class epistasis.models.base.BaseModel

Bases: epistasis.models.base.AbstractModel, sklearn.base.RegressorMixin, sklearn.base.BaseEstimator

Base model for defining an epistasis model.

exception epistasis.models.base.SubclassException

Bases: Exception

Subclass Exception for parent classes.

epistasis.models.base.use_sklearn(sklearn_class)

Swap out base classes in an Epistasis model class with a sklearn_class + AbstractModel.

epistasis.models.classifiers module
epistasis.models.ensemble module
epistasis.models.pipeline module
epistasis.models.utils module
exception epistasis.models.utils.FittingError

Bases: Exception

Exception Subclass for X matrix errors.

exception epistasis.models.utils.XMatrixException

Bases: Exception

Exception Subclass for X matrix errors.

epistasis.models.utils.arghandler(method)

Points methods to argument handlers. Assumes each argument has a corresponding method attached to the object named “_{argument}”. These methods given default values to arguments.

Ignores self and kwargs

Module contents

epistasis.pyplot package

epistasis.pyplot.coefs module
epistasis.pyplot.coefs.plot_coefs(model=None, sites=None, values=None, errors=None, **kwargs)

Create a barplot with the values from model, drawing the x-axis as a grid of boxes indicating the coordinate of the epistatic parameter. Should automatically generate an almost publication-quality figure.

Parameters:
  • model (BaseModel object) – epistasis model.
  • sites (array) – array of epistatic indices/sites.
  • values (array) – an array of epistatic coefficients
  • errors (2d array or list) – upper and lower bounds for each beta.
Keyword Arguments:
 
  • logbase (numpy.ufunc (default=np.log10)) – function to transform into log space
  • log_transform (bool (default=False)) – transform the values if true.
  • order_colors – list/tuple of colors for each order (rgb,html string-like)
  • significance – how to treat signifiance. should be 1. “bon” -> Bonferroni corrected p-values (default) 2. “p” -> raw p-values 3. None -> ignore significance
  • significance_cutoff – value above which to consider a term significant
  • sigmas – number of sigmas to show for each error bar
  • y_scalar – how much to scale the y-axis above and beyond y-max
  • y_axis_name – what to put on the y-axis of the barplot
  • figsize – tuple of figure width,height
  • height_ratio – how much to scale barplot relative to xbox
  • star_cutoffs
    signifiance cutoffs for star stack. should go from highest
    p to lowest p (least to most significant)
  • star_spacer – constant that scales how closely stacked stars are from one another
  • ybounds (tuple (default=None)) –
  • bar_borders (bool (default=True)) –
  • xgrid (bool (default=True)) –
  • ecolor (color (default='black')) –
  • elinewidth (float (default=1)) –
  • capthick (float (default=1)) –
  • capsize (float (default=1)) –
  • gridlines (float (default=1)) – x grid linewidth
Returns:

  • fig (matplotlib.pyplot.Figure) – Figure object
  • ax (matplotlib.pyplot.Axes) – Axes object

epistasis.pyplot.nonlinear module
epistasis.pyplot.nonlinear.plot_power_transform(model=None, yadd=None, yobs=None, yerr=None, function=None, cmap=None, color=None, s=50, ax=None, function_line=True, line_color='r', **kwargs)

Plot a Y_obs vs. Y_add showing the nonlinear scale in a genotype-phenotype map.

Parameters:
  • model ((default=None)) – Epistasis model.
  • yadd (array (default=None)) – x-axis data. The additive model phenotypes.
  • yobs (array (default=None)) – y-axis data. The observed phenotypes.
  • yerr (array (default=None)) – y-axis error. Error in observed phenotypes.
  • function (callable) – Nonlinear function.
  • cmap (str) – Colormap name to map onto scatter points.
  • color (str, array,) – color of phenotypes.
  • s (int) – size of scatter points.
  • ax (Axes) – Axes object to plot points on.
  • function_line (bool) – If true, plots nonlinear function on top of points.
  • line_color (matplotlib color.) – color of function line.
Returns:

ax – Axes object with plot.

Return type:

matplotlib.Axes

epistasis.pyplot.nonlinear.plot_scale(model=None, yadd=None, yobs=None, yerr=None, function=None, cmap=None, color=None, s=50, ax=None, function_line=True, line_color='r', **kwargs)

Plot a Y_obs vs. Y_add showing the nonlinear scale in a genotype-phenotype map.

Parameters:
  • model ((default=None)) – Epistasis model.
  • yadd (array (default=None)) – x-axis data. The additive model phenotypes.
  • yobs (array (default=None)) – y-axis data. The observed phenotypes.
  • yerr (array (default=None)) – y-axis error. Error in observed phenotypes.
  • function (callable) – Nonlinear function.
  • cmap (str) – Colormap name to map onto scatter points.
  • color (str, array,) – color of phenotypes.
  • s (int) – size of scatter points.
  • ax (Axes) – Axes object to plot points on.
  • function_line (bool) – If true, plots nonlinear function on top of points.
  • line_color (matplotlib color.) – color of function line.
Returns:

ax – Axes object with plot.

Return type:

matplotlib.Axes

epistasis.sampling package

epistasis.sampling.bayesian module
class epistasis.sampling.bayesian.BayesianSampler(model, lnprior=None)

Bases: object

A sampling class to estimate the uncertainties in an epistasis model’s coefficients using a Bayesian approach. This object samples from the experimental uncertainty in the phenotypes to estimate confidence intervals for the coefficients in an epistasis model according to Bayes Theorem:

\[P(H|E) = \frac{ P(E|H) \cdot P(H) }{ P(E) }\]

This reads: “the probability of epistasis model \(H\) given the data \(E\) is equal to the probability of the data given the model times the probability of the model.”

Parameters:model – Epistasis model to run a bootstrap calculation.
get_initial_walkers(relative_widths=0.01)

Place the walkers in Gaussian balls in parameter space around the ML values for each coefficient.

static lnprior(thetas)

Prior probability for the given set of model parameters.

static lnprob(thetas, lnlike)

The posterior probability of a given set of model parameters and likelihood function.

sample(n_steps=100, n_burn=0, previous_state=None)

Sample the likelihood of the model by walking n_steps with each walker.

epistasis.simulate package

epistasis.simulate.base module
epistasis.simulate.linear module
class epistasis.simulate.linear.LinearSimulation(wildtype, mutations, model_type='global', **kwargs)

Bases: epistasis.simulate.base.BaseSimulation

Construct an genotype-phenotype from linear building blocks and epistatic coefficients.

Example

Phenotype = b0 + b1 + b2 + b3 + b12 + b13 + b13 + b123

Parameters:
  • wildtype (strß) – Wildtype genotype
  • mutations (dict) – Mapping for each site to its alphabet
  • order (int) – Order of epistasis in simulated genotype-phenotype map
  • betas (array-like) – values of epistatic coefficients (must be positive for this function to work. Log is taken)
  • model_type (str) – Use a local or global (i.e. Walsh space) epistasis model to construct phenotypes
build()

Build the phenotype map from epistatic interactions.

epistasis.simulate.power module

epistasis.mapping module

class epistasis.mapping.EpistasisMap(sites, order=1, values=None, model_type='global')

Bases: object

DataFrame for epistatic interactions.

get_orders(*orders)

Get epistasis of a given order.

index

Get the interaction index in interaction matrix.

map(attr1, attr2)

Dictionary that maps attr1 to attr2.

model_type

Type of epistasis model used to get coefficients.

n

Return the number of Interactions.

order

Get order of epistasis in system.

classmethod read_dataframe(df, model_type='global')

Create an epistasis model from dataframe

sites

Get the interaction sites, which describe the position of interacting mutations in the genotypes. (type==list of lists, see self._build_interaction_sites)

to_csv(filename)

Write data to a csv file.

to_dict()

Get data as dictionary.

to_excel(filename)

Write data to excel file.

values

Get the values of the interaction in the system

class epistasis.mapping.Orders(epistasismap, orders)

Bases: map

An object that provides API for easily calling epistasis of a given order in an epistasis map.

__call__()

return a dictionary

df

Dataframe for orders object.

index

Get indices of epistasis from this order.

sites

Get epistatic sites

values

Get values of epistasis for this order.

epistasis.mapping.assert_epistasis(method)

Assert that an epistasis map has been attached to the object.

epistasis.mapping.genotype_coeffs(genotype, order=None)

List the possible epistatic coefficients (as label form) for a binary genotype up to a given order.

epistasis.mapping.key_to_site(key)

Convert an interaction key to label.

epistasis.mapping.mutations_to_sites(order, mutations, start_order=0)

Build interaction sites up to nth order given a mutation alphabet.

Parameters:
  • order (int) – order of interactions
  • mutations (dict) – mutations = { site_number : [“mutation-1”, “mutation-2”] }. If the site alphabet is note included, the model will assume binary between wildtype and derived.

Example

mutations = {
    0: ["A", "V"],
    1: ["A", "V"],
    ...
}
Returns:sites – list of all interaction sites for system with sequences of a given length and epistasis with given order.
Return type:list
epistasis.mapping.site_to_key(site, state='')

Convert site to key. state is added to end of key.

epistasis.matrix module

epistasis.matrix.encode_vectors(binary_genotypes, model_type='global')

Encode a set of binary genotypes is input vectors for the given model

epistasis.matrix.get_model_matrix(binary_genotypes, sites, model_type='global')

Get a model matrix for a given set of genotypes and coefficients.

epistasis.stats module

Submodule with useful statistics functions for epistasis model.

epistasis.stats.aic(model)

Given a model, calculates an AIC score.

epistasis.stats.chi_squared(y_obs, y_pred)

Calculate the chi squared between observed and predicted y.

epistasis.stats.explained_variance(y_obs, y_pred)

Returns the explained variance

epistasis.stats.false_negative_rate(y_obs, y_pred, upper_ci, lower_ci, sigmas=2)

Calculate the false negative rate of predicted values. Finds all values that equal zero in the known array and calculates the number of false negatives found in the predicted given the number of samples and sigmas.

The defined bounds are:
(number of sigmas) * errors / sqrt(number of samples)
Parameters:
  • known (array-like) – Known values for comparing false negatives
  • predicted (array-like) – Predicted values
  • errors (array-like) – Standard error from model
  • n_samples (int) – number of replicate samples
  • sigma (int (default=2)) – How many standard errors away (2 == 0.05 false negative rate)
Returns:

rate – False negative rate in data

Return type:

float

epistasis.stats.false_positive_rate(y_obs, y_pred, upper_ci, lower_ci, sigmas=2)

Calculate the false positive rate of predicted values. Finds all values that equal zero in the known array and calculates the number of false positives found in the predicted given the number of samples and sigmas.

The defined bounds are:
(number of sigmas) * errors / sqrt(number of samples)
Parameters:
  • known (array-like) – Known values for comparing false positives
  • predicted (array-like) – Predicted values
  • errors (array-like) – Standard error from model
  • n_samples (int) – number of replicate samples
  • sigma (int (default=2)) – How many standard errors away (2 == 0.05 false positive rate)
Returns:

rate – False positive rate in data

Return type:

float

epistasis.stats.generalized_r2(y_obs, y_pred)

Calculate the rquared between the observed and predicted y. See wikipedia definition of coefficient of determination.

epistasis.stats.gmean(x)

Calculate a geometric mean with zero and negative values.

Following the gmean calculation from this paper:

Habib, Elsayed AE. “Geometric mean for negative and zero values.” International Journal of Research and Reviews in Applied Sciences 11 (2012): 419-432.

epistasis.stats.incremental_mean(old_mean, samples, M, N)

Calculate an incremental running mean.

Parameters:
  • old_mean (float or array) – current running mean(s) before adding samples
  • samples (ndarray) – array containing the samples. Each column is a sample. Rows are independent values. Mean is taken across row.
  • M (int) – number of samples in new chunk
  • N (int) – number of previous samples in old mean
epistasis.stats.incremental_std(old_mean, old_std, new_mean, samples, M, N)

Calculate an incremental standard deviation.

Parameters:
  • old_mean (float or array) – current running mean(s) before adding samples
  • samples (ndarray) – array containing the samples. Each column is a sample. Rows are independent values. Mean is taken across row.
  • M (int) – number of samples in new chunk
  • N (int) – number of previous samples in old mean
epistasis.stats.incremental_var(old_mean, old_var, new_mean, samples, M, N)

Calculate an incremental variance.

Parameters:
  • old_mean (float or array) – current running mean(s) before adding samples
  • old_var (float or array) – current running variance(s) before adding samples
  • new_mean (float) – updated mean
  • samples (ndarray) – array containing the samples. Each column is a sample. Rows are independent values. Mean is taken across row.
  • M (int) – number of samples in new chunk
  • N (int) – number of previous samples in old mean
epistasis.stats.pearson(y_obs, y_pred)

Calculate pearson coefficient between two variables.

epistasis.stats.rmsd(yobs, ypred)

Calculate the root mean squared deviation of an estimator.

epistasis.stats.split_data(data, fraction=1.0)

Split DataFrame into two sets, a training and a test set.

Parameters:
  • data (pandas.DataFrame) – full dataset to split.
  • fraction (float) – fraction in training set.
Returns:

  • train_set (pandas.DataFrame) – training set.
  • test_set (pandas.DataFrame) – test set.

epistasis.stats.ss_residuals(y_obs, y_pred)

calculate residuals

Indices and tables