The results of GWAS are represented as summary statistics, with the following columns:
SNP- marker name, for example rs#.CHR- chromosome labelBP- base-pair positionA1- effect allele forZandBETAcolumnsA2- other alleleN- sample sizeCaseN,ControlN- sample size for cases and controls (logistic regression only)FRQ- frequency of A1 alleleZ- z-score (or t-score) of associationBETA- effect size; for logistic regression, this containslog(OR)SE- standard error of theBETAcolumnL95,U95- lower and upper 95% confidence interval of theBETA.P- p-value
For SNP, CHR, BP, A1 and A2 columns the scripts/gwas/gwas.py script will simply copy over the information from the genetic file, i.e. from .bgen or .bim files. This means that SNP is likely to be dbSNP rs#, or some other form of identifyied such as CHR:BP:A1:A2.
For CHR and BP, there we don't enforce a specific genomic build - it all depends on what build was used by the genotype data.
Finally, A1 and A2 are not guarantied to be minor or major alleles, but A1 will be used as an effect allele for signed summary statistics (i.e. Z and BETA columns).
The sample size N is as reported by the software (plink2 or regenie). For case-control traits, this appears to be a sum of cases and controls (not the effective sample size which would take into account imbalance between cases and controls).
L95 and U95 columns are only provided for plink2 results.
CaseN and ControlN columns are only provided for plink2 results for logistic regression.
If you need these columns for regenie analysis consider also running plink2 analysis, and copy over the columns into your regenie output.
- CoMorMent: this file
- LDSC: https://github.com/precimed/ldsc/blob/master/munge_sumstats.py
- BioPsyk: https://github.com/BioPsyk/cleansumstats/blob/dev/assets/schemas/cleaned-sumstats.yaml
- NORMENT: https://github.com/precimed/python_convert/blob/master/sumstats_utils.py
| CoMorMent | daner | LDSC | BioPsyk | NORMENT | Description |
|---|---|---|---|---|---|
| missing | ? | missing | 0 | missing | good idea to provide this column and referencing a line in .bim file |
| CHR | CHR | CHR | CHR | CHR | OK |
| BP | BP | BP | POS | BP | keep BP which is more informative ( "POS" could also stand for genomic position ) |
| SNP | SNP | SNP | RSID | SNP | keep SNP which makes more sense as we copy over marker name from genetic file |
| A1 | A1 | A2 | EffectAllele | A1 | keep A1 for consistency with LDSC even thought EffectAllele is more informative |
| A2 | A2 | A2 | OtherAllele | A2 | keep A2 for consistency with LDSC even though OtherAllele is more informative |
| P | P | P | P | PVAL | OK |
| SE | SE | SE | SE | SE | OK |
| L95 | ? | missing | ORL95 | missing | keep "L95" as confidence interval may also be for the BETA or LOG(OR) |
| U95 | ? | missing | ORU95 | missing | keep "U95" |
| N | ? | N | N | N | OK |
| CaseN | Nca | N_CAS | CaseN | NCASE | OK |
| ControlN | Nco | N_CON | ControlN | NCONTROL | OK |
| INFO | INFO | INFO | INFO | INFO | OK |
| Direction | Direction | missing | Direction | DIRECTION | OK |
| BETA | BETA or OR | BETA | B | BETA or OR | keep "BETA" for consistency with LDSC (and also BETA is more informative) |
| Z | ? | Z | Z | Z | OK |
| FRQ | FRQ_A_NNN | FRQ | EAF | FRQ | keep "FRQ" which makes more sense for non-EUR populations |
| missing | ? | missing | EAF_1KG | missing | not needed |
v0.9- first version of this document