# epistasis.stats module¶

Submodule with useful statistics functions for epistasis model.

epistasis.stats.aic(model)

Given a model, calculates an AIC score.

epistasis.stats.chi_squared(y_obs, y_pred)

Calculate the chi squared between observed and predicted y.

epistasis.stats.explained_variance(y_obs, y_pred)

Returns the explained variance

epistasis.stats.false_negative_rate(y_obs, y_pred, upper_ci, lower_ci, sigmas=2)

Calculate the false negative rate of predicted values. Finds all values that equal zero in the known array and calculates the number of false negatives found in the predicted given the number of samples and sigmas.

The defined bounds are:
(number of sigmas) * errors / sqrt(number of samples)
Parameters: known (array-like) – Known values for comparing false negatives predicted (array-like) – Predicted values errors (array-like) – Standard error from model n_samples (int) – number of replicate samples sigma (int (default=2)) – How many standard errors away (2 == 0.05 false negative rate) rate – False negative rate in data float
epistasis.stats.false_positive_rate(y_obs, y_pred, upper_ci, lower_ci, sigmas=2)

Calculate the false positive rate of predicted values. Finds all values that equal zero in the known array and calculates the number of false positives found in the predicted given the number of samples and sigmas.

The defined bounds are:
(number of sigmas) * errors / sqrt(number of samples)
Parameters: known (array-like) – Known values for comparing false positives predicted (array-like) – Predicted values errors (array-like) – Standard error from model n_samples (int) – number of replicate samples sigma (int (default=2)) – How many standard errors away (2 == 0.05 false positive rate) rate – False positive rate in data float
epistasis.stats.generalized_r2(y_obs, y_pred)

Calculate the rquared between the observed and predicted y. See wikipedia definition of coefficient of determination.

epistasis.stats.gmean(x)

Calculate a geometric mean with zero and negative values.

Following the gmean calculation from this paper:

Habib, Elsayed AE. “Geometric mean for negative and zero values.” International Journal of Research and Reviews in Applied Sciences 11 (2012): 419-432.

epistasis.stats.incremental_mean(old_mean, samples, M, N)

Calculate an incremental running mean.

Parameters: old_mean (float or array) – current running mean(s) before adding samples samples (ndarray) – array containing the samples. Each column is a sample. Rows are independent values. Mean is taken across row. M (int) – number of samples in new chunk N (int) – number of previous samples in old mean
epistasis.stats.incremental_std(old_mean, old_std, new_mean, samples, M, N)

Calculate an incremental standard deviation.

Parameters: old_mean (float or array) – current running mean(s) before adding samples samples (ndarray) – array containing the samples. Each column is a sample. Rows are independent values. Mean is taken across row. M (int) – number of samples in new chunk N (int) – number of previous samples in old mean
epistasis.stats.incremental_var(old_mean, old_var, new_mean, samples, M, N)

Calculate an incremental variance.

Parameters: old_mean (float or array) – current running mean(s) before adding samples old_var (float or array) – current running variance(s) before adding samples new_mean (float) – updated mean samples (ndarray) – array containing the samples. Each column is a sample. Rows are independent values. Mean is taken across row. M (int) – number of samples in new chunk N (int) – number of previous samples in old mean
epistasis.stats.pearson(y_obs, y_pred)

Calculate pearson coefficient between two variables.

epistasis.stats.rmsd(yobs, ypred)

Calculate the root mean squared deviation of an estimator.

epistasis.stats.split_data(data, idx=None, nobs=None, fraction=None)

Split DataFrame into two sets, a training and a test set.

Parameters: data (pandas.DataFrame) – full dataset to split. idx (list) – List of indices to include in training set nobs (int) – number of observations in training. If nobs is given, fraction is ignored. fraction (float) – fraction in training set. train_set (pandas.DataFrame) – training set. test_set (pandas.DataFrame) – test set.
epistasis.stats.split_gpm(gpm, idx=None, nobs=None, fraction=None)

Split GenotypePhenotypeMap into two sets, a training and a test set.

Parameters: data (pandas.DataFrame) – full dataset to split. idx (list) – List of indices to include in training set nobs (int) – number of observations in training. fraction (float) – fraction in training set. train_gpm (GenotypePhenotypeMap) – training set. test_gpm (GenotypePhenotypeMap) – test set.
epistasis.stats.ss_residuals(y_obs, y_pred)

calculate residuals