Advanced topics =============== Building your own X matrix -------------------------- The X matrix in an epistasis model maps genotypes to epistatic coefficients. The rows represent the binary representation of genotypes, and the columns represent the epistatic coefficients. In a "local" epistasis model, an element in the matrix is ``1`` if the genotype in that element's row has the epistatic interaction in that element's column. Otherwise, the element is a ``0`` (see the example below). If you want to generate your own X matrix, you'll need a list of genotypes (in their binary representation) and a list of epistatic coefficients you'd like to fit. Then, you can generate X using the ``get_model_matrix`` function. Here is an example: **Input** .. code-block:: python # imports from epistasis.matrix import get_model_matrix # Epistasis coefficients coefs = [ [0], # Reference state [1], [2], [3], # Additive coefficients [1, 2], [1, 3], [2, 3], # Pairwise coefficients [1, 2, 3] # Third order coefficients ] # List of binary coefficients binary_genotypes = [ '000', '001', '010', '100', '011', '101', '110', '111' ] # Build X matrix X = get_model_matrix(binary_genotypes, coefs, model_type='local') **Output** .. code-block:: python X = array([ [1, 0, 0, 0, 0, 0, 0, 0], [1, 0, 0, 1, 0, 0, 0, 0], [1, 0, 1, 0, 0, 0, 0, 0], [1, 1, 0, 0, 0, 0, 0, 0], [1, 0, 1, 1, 0, 0, 1, 0], [1, 1, 0, 1, 0, 1, 0, 0], [1, 1, 1, 0, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1] ]) **Input** You can easily generate the coefficient list using a few helper functions: .. code-block:: python from epistasis.mapping import mutations_to_sites mutations = { 0: ['0', '1'], 1: ['0', '1'], 2: ['0', '1'] } # Same as the list above! coefs = mutations_to_sites(order=3, mutations=mutations) **Output** .. code-block:: python coefs = [ [0], [1], [2], [3], [1, 2], [1, 3], [2, 3], [1, 2, 3] ] Write your own linear model --------------------------- A linear, high-order epistasis model is a linear transformation of phenotypes :math:`\vec{P}` (length L) to (Lth order) epistatic coefficients :math:`\vec{\beta}` using the :math:`\mathbf{X}` matrix described in the previous section. .. math:: \vec{P} = \mathbf{X} \cdot \vec{\beta} In Python 3, this looks like: .. code-block:: python3 from epistasis.matrix import get_model_matrix # See the section above to get `coefs` coefs = [ [0], [1], [2], [3], [1, 2], [1, 3], [2, 3], [1, 2, 3] ] # Numerical values for each coefficient beta = [ 1.0, 0.2, 0.3, 0.1, 0.1, 0.05, 0.01, 0.05 ] # Build the X matrix X = get_model_matrix(binary_genotypes, coefs, model_type='local') # Do the dot product phenotypes = X @ beta To compute linear epistatic coefficients from phenotypes, take the inverse of the equation above: .. math:: \vec{\beta} = \mathbf{X^{-1}} \cdot \vec{P} In Python 3, this looks like: .. code-block:: python3 import numpy as np from epistasis.matrix import get_model_matrix # See the section above to get `coefs` coefs = [ [0], [1], [2], [3], [1, 2], [1, 3], [2, 3], [1, 2, 3] ] # Binary genotypes binary_genotypes = ['000','001','010','100','011','101','110', '111'] # Quantitative phenotype values. phenotypes = [1, 1.1, 1.2, 1.2, 1.3, 1.5, 1.7, 1.8] # Build the X matrix X = get_model_matrix(binary_genotypes, coefs, model_type='local') # Take the inverse X_inv = np.inv(X) # Do the dot product beta = X_inv @ phenotypes Setting bounds on nonlinear fits -------------------------------- All nonlinear epistasis models use lmfit_ to estimate a nonlinear scale in an arbitrary genotype-phenotype map. Each model creates ``lmfit.Parameter`` objects for each parameter in the input function and contains them in the ``parameters`` attribute as ``lmfit.Parameters`` object. Thus, you can set the bounds, initial guesses, etc. on the parameters following lmfit's API. The model, then, minimizes the squared residuals using the ``lmfit.minimize`` function. The results are stored in the ``Nonlinear`` object. In the example below, we use a ``EpistasisPowerTransform`` to demonstrate how to access the lmfit API. .. _lmfit: https://lmfit.github.io/lmfit-py/ .. code-block:: python # Import a nonlinear model (this case, Power transform) from epistasis.models import NonlinearPowerTransform model = NonlinearPowerTransform(order=1) model.parameters['lmbda'].set(value=1, min=0, max=10) model.parameters['A'].set(value=10, min=0, max=100) model.fit() Access information about the minimizer results using the ``Nonlinear`` attribute. .. code-block:: python # Pretty print the results! model.Nonlinear.params.pretty_print() Large genotype-phenotype maps ----------------------------- We have not tested the ``epistasis`` package on large genotype-phenotype maps (>5000 genotypes). In principle, it should be no problem as long as you have the resources (i.e. tons of RAM and time). However, it's possible there may be issues with convergence and numerical rounding for these large spaces. Reach out! ~~~~~~~~~~ If you have a large dataset, please get in touch! We'd love to hear from you. Feel free to try cranking our models on your large dataset and let us know if you run into issues. My nonlinear fit is slow and does not converge. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Try fitting the scale of your map using a fraction of your data. We've found that you can typically estimate the nonlinear scale of a genotype-phenotype map from a small fraction of the genotypes. Choose a random subset of your data and fit it using a first order nonlinear model. Then use that model to linearize all your phenotype .. code-block:: python from gpmap import GenotypePhenotypeMap from epistasis.models import (EpistasisPowerTransform, EpistasisLinearRegression) # Load data. gpm = GenotypePhenotypeMap.read_csv('data.csv') # Subset the data data_subset = gpm.data.sample(frac=0.5) gpm_subset = GenotypePhenotypeMap.read_dataframe(data_subset) # Fit the subset nonlinear = EpistasisPowerTransform(order=1, lmbda=1, A=0, B=0) nonlinear.add_gpm(gpm_subset).fit() # Linearize the original phenotypes to estimate epistasis. # # Note: power transform calculate the geometric mean of the additive # phenotypes, so we need to pass those phenotypes to the reverse transform. padd = nonlinear.Additive.predict(X='fit') linear_phenotypes = nonlinear.reverse(gpm.phenotypes, *nonlinear.parameters.values(), data=padd) # Change phenotypes (note this changes the original dataframe) gpm.data.phenotypes = linear_phenotypes model = EpistasisLinearRegression(order=10) model.add_gpm(gpm) # Fit the model model.fit() Estimating model uncertainty ---------------------------- The epistasis package includes a ``sampling`` module for estimating uncertainty in all coefficients in (Non)linear epistasis models. It follows a Bayesian approach, and uses the `emcee` python package to approximate the posterior distributions for each coefficient. Basic example ~~~~~~~~~~~~~ Use the ``BayesianSampler`` object to sample your epistasis model. The sampler stores an MCMC chain (The plot below was created using the ``corner`` package.) .. code-block:: python # Imports import matplotlib.pyplot as plt import numpy as np import corner from epistasis.simulate import LinearSimulation from epistasis.models import EpistasisLinearRegression from epistasis.sampling.bayesian import BayesianSampler # Create a simulated genotype-phenotype map with epistasis. sim = LinearSimulation.from_length(3, model_type="local") sim.set_coefs_order(3) sim.set_coefs_random((-1,1)) sim.set_stdeviations([0.01]) # Initialize an epistasis model and fit a ML model. model = EpistasisLinearRegression(order=3, model_type="local") model.add_gpm(sim) model.fit() # Initialize a sampler. sampler = BayesianSampler(model) samples, rstate = sampler.sample(500) # Plot the Posterior fig = corner.corner(samples, truths=sim.epistasis.values) .. image:: ../img/bayes-estimate-uncertainty.png Defining a prior ~~~~~~~~~~~~~~~~ The default prior for a BayesianSampler is a flat prior (``BayesianSampler.lnprior()`` returns a log-prior equal to 0). To set your own prior, define your own function that called ``lnprior`` that returns a log prior for a set of `coefs` and reset the BayesianSampler static method: .. code-block:: python def lnprior(coefs): # Set bound on the first coefficient. if coefs[0] < 0: return -np.inf return 0 # Apply to fitter from above fitter.lnprior = lnprior