class Cis()def load_and_validate_file(file_path: str, index_col: int)Load a CSV file and validate its existence and content.
Arguments:
- file_path: Path to file
Returns:
- Dataframe from contents of the CSV file
Raises:
- FileNotFoundError: If the file does not exist at the given path.
- ValueError: If the CSV file is empty (i.e., has no rows).
def start_end_gene_window(gene_index: int)Find position of the window of a given gene.
Arguments:
- gene_index: Index of the desired gene on the quantification file
Returns:
- Tuple of window_start and window_end, which define the start and end positions of the window
def get_variants_for_gene_window(current_start: int, current_end: int)Find all the variants inside a window of a gene.
Arguments:
- current_start: Start position of a window
- current_end: End position of a window
Returns:
- variants: Subset of genotype dataframe that contains only those variants that are inside the given window
def gene_variants_common_segment(start: int, end: int, variants: pd.DataFrame)Filter variants to ensure that the gene and variants that are in the same window are also on a same segment.
Arguments:
- start: Start position of a window
- end: End position of a window
- variants: Subset of genotype file. Only variants that are in the same window as the gene of interest
Returns:
- variants: Subset of genotype dataframe that is filtered and masked by segmentation and window.
def gene_variant_regressions_permutations(gene_index: int,
transf_variants: pd.DataFrame,
variant: str,
regression_data: pd.DataFrame)Perform permutations to obtain adjusted p-values. In case of 0 permutations, do only nominal pass.
Arguments:
- gene_index: Index of a gene of interest on the quantification file.
- transf_variants: Dataframe of transformed variants that are processed for window and segmentation.
- variant: Variant id
- regression_data: Dataframe with current gene expression levels, genotypes, and covariates
Returns:
- actual_associations: Dataframe of association testing results for a gene. When > 0 permutations are used, also adjusted p-values are provided.
def permutation_data(gene_index: int, perm_index: int,
transf_variants: pd.DataFrame, variant: str)Find data for association testing for permutations. In this case all dependent variable values are fixed, only the phenotype levels are permuted.
Arguments:
- gene_index: Index of the actual gene on the quantification file.
- perm_index: Index of a gene on the quantification file that is used for permutation.
- transf_variants: Dataframe of transformed variants that are processed for window and segmentation.
- variant: Variant ID
Returns:
- perm_gex, perm_genotypes: Arrays of residualized permuted phenotype levels and residualized fixed genotype values
def check_grouping(cur_genotypes_filtered: np.ndarray)Find if the genotype dosages have adequate variation in the data.
Arguments:
- cur_genotypes_filtered: Array of genotype dosages
Returns:
- Boolean value showing if there are enough instances in the different genotype groups.
def filter_arrays(GEX: np.ndarray, CN: np.ndarray, cur_genotypes: np.ndarray,
cov_values: np.ndarray)Filter data arrays and do validity checks.
Arguments:
- GEX: Gene expression levels
- CN: Gene copy numbers
- cur_genotypes: Genotype dosages
- cov_values: All other covariate values
- group_check: Whether to check representation of values in the middle and in the tails in genotypes
Returns:
Tuple of:
- GEX_filtered: Filtered gene expression values
- CN_filtered: Filtered copy numbers
- cur_genotypes_filtered: Filtered genotypes dosages
- cov_values_filtered: Filtered covariate values
def best_variant_data(gene_index: int, transf_variants: pd.DataFrame,
quantifications: pd.DataFrame)Find variant and linked data for a gene that has strongest Pearson correlation with the independent variable.
Arguments:
- gene_index: Index of a gene of interest on the quantification file.
- transf_variants: Dataframe of transformed variants that are processed for window and segmentation.
- quantifications: Dataframe of quantifications.
Returns:
- best_variant: Id of the variant with strongest correlation
- data_best_corr: Dataframe of data linked with the chosen variant
def data_all_variants(GEX: np.ndarray, CN: np.ndarray, cov_values: np.ndarray,
cur_genotypes: np.ndarray)Process data for association testing when in all variants mode.
Arguments:
- GEX: Gene expression levels.
- CN: Gene copy numbers
- cov_values: All other covariate values
- cur_genotypes: Genotype dosages
Returns:
- Dataframe of filtered regression data.
def process_all_variants(gene_index: int, transf_variants: pd.DataFrame)Conduct association testing for all variants in a window instead of selecting only best correlated variant. Construct regression data and then run the regressions.
Arguments:
- gene_index: Index of a gene of interest on the quantification file.
- transf_variants: Dataframe of transformed variants that are processed for window and segmentation.
Returns:
- Dataframe with all association testing results for a gene.
def gene_variant_regressions(gene_index: int, quantifications: pd.DataFrame,
variant: str, regression_data: pd.DataFrame)Find associations between the gene expression values of a gene and variants by performing regressions. Using ordinary least square regression, log-likelihood calculations, and likelihood ratio test to pinpoint the effect of genotypes.
Arguments:
- gene_index: Index of a gene of interest on the quantification file.
- quantifications: Dataframe of quantifications.
- variant: Variant ID
- regression_data: Regression data for current gene variant pair including covariates
Returns:
- associations dataframe with statistics of the strenghts of associations
def calculate_associations()Calculate associations for gene indices using multiprocessing.
Steps:
- Initializes the multiprocessing pool with the specified number of cores.
- Maps gene indices to the helper function using the pool.
- Closes the pool and waits for the processes to complete.
- Concatenates the resulting DataFrames from each process into one DataFrame.
Returns:
- full_associations: A concatenated dataframe containing the association results for all gene indices.
def calculate_associations_helper(gene_index: int)Helper function to calculate associations for a single gene index.
This function performs several steps to calculate the associations for a specific gene index:
- Prints the current progress of the calculation.
- Determines the start and end positions for the gene window.
- Retrieves the variants within the gene window.
- Transforms the variants based on a common segment.
- Performs regressions to calculate associations.
Arguments:
- gene_index (int): The index of the gene for which associations are being calculated.
Returns:
- A dataframe containing the association results for the specified gene index.
def combine_chromosome(outdir: str)Combine all csv files from the given directory.
Arguments:
- outdir: Directory to which the mapping results have been saved.
Returns:
- combined_df: Dataframe with data from all csv files from the folder.
def fdr(outdir: str)Perform Benjamini Hochberg false discovery rate correction to mapping results.
Arguments:
- outdir: Directory to which the mapping results have been saved.
- threshold: Cutoff value for fdr correction.
Returns:
- full_res: Dataframe with all mapping results including a column for fdr corrected p-values.
def box_and_whisker(df: pd.DataFrame, gene_name: str, variant: str,
output_folder: str)Create a box-and-whisker plot with significance bars and Kruskal-Wallis test for grouped data.
Arguments:
- df: Dataframe containing 'GEX' and 'cur_genotypes' columns.
- gene_name: Name of the gene that is used for the plot title and file name.
- output_folder: Path to the folder where the plot should be saved.
def residualize(regression_data: pd.DataFrame)Residualize the GEX and cur_genotypes columns by removing the variance explained by covariates.
Arguments:
- regression_data: The input dataframe with GEX, cur_genotypes, and covariates.
Returns:
- Residualized GEX and genotypes.
def get_tstat2(corr: float, df: int)Calculate t-statistic squared from correlation and degrees of freedom.
Arguments:
- corr: Pearson correlation
- df: Degrees of freedom
Returns:
- t-statistic squared
def get_pvalue_from_tstat2(tstat2: float, df: int)Calculate the p-value from the t-statistic and degrees of freedom.
Arguments:
- tstat2: t-statistic squared
- df: Degrees of freedom
Returns:
- p-value
def get_pvalue_from_corr(r2: float, df: int)Calculate p-value from correlation r2 and degrees of freedom.
Arguments:
- r2: R² value
- dof: Degrees of freedom
Returns:
- p-value
def beta_shape1_from_dof(r2_values: np.ndarray, dof: float)Estimate Beta shape1 parameter from moment matching.
Arguments:
- r2_perm: Array of permutation R² values
- dof: Optimized degrees of freedom
def beta_log_likelihood(pvals: np.ndarray, shape1: float, shape2: float)Negative log-likelihood for the Beta distribution.
Arguments:
- pvals : Array of permutation p-values
- shape1 : Beta shape parameter 1
- shape2 : Beta shape parameter 2
Returns
- The negative log-likelihood of the observed p-values given the specified Beta distribution parameters
def optimize_dof(r2_perm: np.ndarray, dof_init: int, tol=1e-4)Optimize degrees of freedom such that Beta shape1 ≈ 1.
Arguments:
- r2_perm: Array of permutation R² values
- dof_init: Initial value of degrees of freedom
- tol: Tolerance level
Returns
- Optimized degrees of freedom
def fit_beta_parameters(r2_perm: np.ndarray, dof: float)Fit Beta distribution parameters to permutation p-values.
Arguments:
- r2_perm: Array of permutation R² values
- dof: Optimized degrees of freedom
Returns:
- Beta shape parameters 1 and 2
def adjust_p_values(r2_perm: np.ndarray,
r2_nominal: float,
dof_init=10,
tol=1e-4)Calculate Beta-approximated p-values from permutation results.
Arguments:
- r2_perm: Array of permutation R² values
- r2_nominal: The nominal R² value
- dof_init: Initial value of degrees of freedom
- tol: Tolerance level
Returns:
- The permutation adjusted p-value
def get_slope(corr: float, phenotype_sd: np.ndarray, genotype_sd: np.ndarray)Calculate the slope.
Arguments:
- corr: Pearson correlation
- phenotype_sd: Standard deviation of phenotypes
- genotype_sd: Standard deviation of genotypes
Returns:
- slope
def calculate_slope_and_se(regression_data: pd.DataFrame, corr: float)Calculate the slope and its standard error.
Arguments:
regression_data- A dataframe with residualized "GEX" and "cur_genotypes" columns.corr- The correlation between residualized "GEX" and "cur_genotypes".
Returns:
slope- The slope of the linear relationship.slope_se- The standard error of the slope.
def calculate_pvalue(df: pd.DataFrame, corr: float)Calculate the p-value using the residualized data and correlation.
Arguments:
df- A dataframe with residualized "GEX" and "cur_genotypes" columns.corr- The correlation between residualized "GEX" and "cur_genotypes".
Returns:
pval- The p-value for testing whether the slope is different from 0.