Skip to content

Latest commit

 

History

History
659 lines (394 loc) · 14.7 KB

File metadata and controls

659 lines (394 loc) · 14.7 KB

SegmentQTL documentation

cis

Cis Objects

class Cis()

load_and_validate_file

def load_and_validate_file(file_path: str, index_col: int)

Load a CSV file and validate its existence and content.

Arguments:

  • file_path: Path to file

Returns:

  • Dataframe from contents of the CSV file

Raises:

  • FileNotFoundError: If the file does not exist at the given path.
  • ValueError: If the CSV file is empty (i.e., has no rows).

start_end_gene_window

def start_end_gene_window(gene_index: int)

Find position of the window of a given gene.

Arguments:

  • gene_index: Index of the desired gene on the quantification file

Returns:

  • Tuple of window_start and window_end, which define the start and end positions of the window

get_variants_for_gene_window

def get_variants_for_gene_window(current_start: int, current_end: int)

Find all the variants inside a window of a gene.

Arguments:

  • current_start: Start position of a window
  • current_end: End position of a window

Returns:

  • variants: Subset of genotype dataframe that contains only those variants that are inside the given window

gene_variants_common_segment

def gene_variants_common_segment(start: int, end: int, variants: pd.DataFrame)

Filter variants to ensure that the gene and variants that are in the same window are also on a same segment.

Arguments:

  • start: Start position of a window
  • end: End position of a window
  • variants: Subset of genotype file. Only variants that are in the same window as the gene of interest

Returns:

  • variants: Subset of genotype dataframe that is filtered and masked by segmentation and window.

gene_variant_regressions_permutations

def gene_variant_regressions_permutations(gene_index: int,
                                          transf_variants: pd.DataFrame,
                                          variant: str,
                                          regression_data: pd.DataFrame)

Perform permutations to obtain adjusted p-values. In case of 0 permutations, do only nominal pass.

Arguments:

  • gene_index: Index of a gene of interest on the quantification file.
  • transf_variants: Dataframe of transformed variants that are processed for window and segmentation.
  • variant: Variant id
  • regression_data: Dataframe with current gene expression levels, genotypes, and covariates

Returns:

  • actual_associations: Dataframe of association testing results for a gene. When > 0 permutations are used, also adjusted p-values are provided.

permutation_data

def permutation_data(gene_index: int, perm_index: int,
                     transf_variants: pd.DataFrame, variant: str)

Find data for association testing for permutations. In this case all dependent variable values are fixed, only the phenotype levels are permuted.

Arguments:

  • gene_index: Index of the actual gene on the quantification file.
  • perm_index: Index of a gene on the quantification file that is used for permutation.
  • transf_variants: Dataframe of transformed variants that are processed for window and segmentation.
  • variant: Variant ID

Returns:

  • perm_gex, perm_genotypes: Arrays of residualized permuted phenotype levels and residualized fixed genotype values

check_grouping

def check_grouping(cur_genotypes_filtered: np.ndarray)

Find if the genotype dosages have adequate variation in the data.

Arguments:

  • cur_genotypes_filtered: Array of genotype dosages

Returns:

  • Boolean value showing if there are enough instances in the different genotype groups.

filter_arrays

def filter_arrays(GEX: np.ndarray, CN: np.ndarray, cur_genotypes: np.ndarray,
                  cov_values: np.ndarray)

Filter data arrays and do validity checks.

Arguments:

  • GEX: Gene expression levels
  • CN: Gene copy numbers
  • cur_genotypes: Genotype dosages
  • cov_values: All other covariate values
  • group_check: Whether to check representation of values in the middle and in the tails in genotypes

Returns:

Tuple of:

  • GEX_filtered: Filtered gene expression values
  • CN_filtered: Filtered copy numbers
  • cur_genotypes_filtered: Filtered genotypes dosages
  • cov_values_filtered: Filtered covariate values

best_variant_data

def best_variant_data(gene_index: int, transf_variants: pd.DataFrame,
                      quantifications: pd.DataFrame)

Find variant and linked data for a gene that has strongest Pearson correlation with the independent variable.

Arguments:

  • gene_index: Index of a gene of interest on the quantification file.
  • transf_variants: Dataframe of transformed variants that are processed for window and segmentation.
  • quantifications: Dataframe of quantifications.

Returns:

  • best_variant: Id of the variant with strongest correlation
  • data_best_corr: Dataframe of data linked with the chosen variant

data_all_variants

def data_all_variants(GEX: np.ndarray, CN: np.ndarray, cov_values: np.ndarray,
                      cur_genotypes: np.ndarray)

Process data for association testing when in all variants mode.

Arguments:

  • GEX: Gene expression levels.
  • CN: Gene copy numbers
  • cov_values: All other covariate values
  • cur_genotypes: Genotype dosages

Returns:

  • Dataframe of filtered regression data.

process_all_variants

def process_all_variants(gene_index: int, transf_variants: pd.DataFrame)

Conduct association testing for all variants in a window instead of selecting only best correlated variant. Construct regression data and then run the regressions.

Arguments:

  • gene_index: Index of a gene of interest on the quantification file.
  • transf_variants: Dataframe of transformed variants that are processed for window and segmentation.

Returns:

  • Dataframe with all association testing results for a gene.

gene_variant_regressions

def gene_variant_regressions(gene_index: int, quantifications: pd.DataFrame,
                             variant: str, regression_data: pd.DataFrame)

Find associations between the gene expression values of a gene and variants by performing regressions. Using ordinary least square regression, log-likelihood calculations, and likelihood ratio test to pinpoint the effect of genotypes.

Arguments:

  • gene_index: Index of a gene of interest on the quantification file.
  • quantifications: Dataframe of quantifications.
  • variant: Variant ID
  • regression_data: Regression data for current gene variant pair including covariates

Returns:

  • associations dataframe with statistics of the strenghts of associations

calculate_associations

def calculate_associations()

Calculate associations for gene indices using multiprocessing.

Steps:

  1. Initializes the multiprocessing pool with the specified number of cores.
  2. Maps gene indices to the helper function using the pool.
  3. Closes the pool and waits for the processes to complete.
  4. Concatenates the resulting DataFrames from each process into one DataFrame.

Returns:

  • full_associations: A concatenated dataframe containing the association results for all gene indices.

calculate_associations_helper

def calculate_associations_helper(gene_index: int)

Helper function to calculate associations for a single gene index.

This function performs several steps to calculate the associations for a specific gene index:

  1. Prints the current progress of the calculation.
  2. Determines the start and end positions for the gene window.
  3. Retrieves the variants within the gene window.
  4. Transforms the variants based on a common segment.
  5. Performs regressions to calculate associations.

Arguments:

  • gene_index (int): The index of the gene for which associations are being calculated.

Returns:

  • A dataframe containing the association results for the specified gene index.

fdr_correction

combine_chromosome

def combine_chromosome(outdir: str)

Combine all csv files from the given directory.

Arguments:

  • outdir: Directory to which the mapping results have been saved.

Returns:

  • combined_df: Dataframe with data from all csv files from the folder.

fdr

def fdr(outdir: str)

Perform Benjamini Hochberg false discovery rate correction to mapping results.

Arguments:

  • outdir: Directory to which the mapping results have been saved.
  • threshold: Cutoff value for fdr correction.

Returns:

  • full_res: Dataframe with all mapping results including a column for fdr corrected p-values.

plotting_utils

box_and_whisker

def box_and_whisker(df: pd.DataFrame, gene_name: str, variant: str,
                    output_folder: str)

Create a box-and-whisker plot with significance bars and Kruskal-Wallis test for grouped data.

Arguments:

  • df: Dataframe containing 'GEX' and 'cur_genotypes' columns.
  • gene_name: Name of the gene that is used for the plot title and file name.
  • output_folder: Path to the folder where the plot should be saved.

statistical_utils

residualize

def residualize(regression_data: pd.DataFrame)

Residualize the GEX and cur_genotypes columns by removing the variance explained by covariates.

Arguments:

  • regression_data: The input dataframe with GEX, cur_genotypes, and covariates.

Returns:

  • Residualized GEX and genotypes.

get_tstat2

def get_tstat2(corr: float, df: int)

Calculate t-statistic squared from correlation and degrees of freedom.

Arguments:

  • corr: Pearson correlation
  • df: Degrees of freedom

Returns:

  • t-statistic squared

get_pvalue_from_tstat2

def get_pvalue_from_tstat2(tstat2: float, df: int)

Calculate the p-value from the t-statistic and degrees of freedom.

Arguments:

  • tstat2: t-statistic squared
  • df: Degrees of freedom

Returns:

  • p-value

get_pvalue_from_corr

def get_pvalue_from_corr(r2: float, df: int)

Calculate p-value from correlation r2 and degrees of freedom.

Arguments:

  • r2: R² value
  • dof: Degrees of freedom

Returns:

  • p-value

beta_shape1_from_dof

def beta_shape1_from_dof(r2_values: np.ndarray, dof: float)

Estimate Beta shape1 parameter from moment matching.

Arguments:

  • r2_perm: Array of permutation R² values
  • dof: Optimized degrees of freedom

beta_log_likelihood

def beta_log_likelihood(pvals: np.ndarray, shape1: float, shape2: float)

Negative log-likelihood for the Beta distribution.

Arguments:

  • pvals : Array of permutation p-values
  • shape1 : Beta shape parameter 1
  • shape2 : Beta shape parameter 2

Returns

  • The negative log-likelihood of the observed p-values given the specified Beta distribution parameters

optimize_dof

def optimize_dof(r2_perm: np.ndarray, dof_init: int, tol=1e-4)

Optimize degrees of freedom such that Beta shape1 ≈ 1.

Arguments:

  • r2_perm: Array of permutation R² values
  • dof_init: Initial value of degrees of freedom
  • tol: Tolerance level

Returns

  • Optimized degrees of freedom

fit_beta_parameters

def fit_beta_parameters(r2_perm: np.ndarray, dof: float)

Fit Beta distribution parameters to permutation p-values.

Arguments:

  • r2_perm: Array of permutation R² values
  • dof: Optimized degrees of freedom

Returns:

  • Beta shape parameters 1 and 2

adjust_p_values

def adjust_p_values(r2_perm: np.ndarray,
                    r2_nominal: float,
                    dof_init=10,
                    tol=1e-4)

Calculate Beta-approximated p-values from permutation results.

Arguments:

  • r2_perm: Array of permutation R² values
  • r2_nominal: The nominal R² value
  • dof_init: Initial value of degrees of freedom
  • tol: Tolerance level

Returns:

  • The permutation adjusted p-value

get_slope

def get_slope(corr: float, phenotype_sd: np.ndarray, genotype_sd: np.ndarray)

Calculate the slope.

Arguments:

  • corr: Pearson correlation
  • phenotype_sd: Standard deviation of phenotypes
  • genotype_sd: Standard deviation of genotypes

Returns:

  • slope

calculate_slope_and_se

def calculate_slope_and_se(regression_data: pd.DataFrame, corr: float)

Calculate the slope and its standard error.

Arguments:

  • regression_data - A dataframe with residualized "GEX" and "cur_genotypes" columns.
  • corr - The correlation between residualized "GEX" and "cur_genotypes".

Returns:

  • slope - The slope of the linear relationship.
  • slope_se - The standard error of the slope.

calculate_pvalue

def calculate_pvalue(df: pd.DataFrame, corr: float)

Calculate the p-value using the residualized data and correlation.

Arguments:

  • df - A dataframe with residualized "GEX" and "cur_genotypes" columns.
  • corr - The correlation between residualized "GEX" and "cur_genotypes".

Returns:

  • pval - The p-value for testing whether the slope is different from 0.