Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -692,6 +692,14 @@ The earth sciences folder contain subfolders for different data formats encounte
- 1000GP.chr*.chunks.txt: chunks of the chromosome obtain with GLIMPSE_chunk
- AFR.gwas: Study locus file. From [SuShiE](https://github.com/mancusolab/sushie).
- AFR.ld: LD matrix file. From [SuShiE](https://github.com/mancusolab/sushie).
- hdl/reference/chr1.1_toy.bim: Synthetic toy HDL-format BIM sidecar for chunk 1.1, generated by `generate_toy_hdl_data.R` for HDL-compatible inputs.
- hdl/reference/chr1.1_toy.rda: Synthetic toy HDL-format LD reference payload for chunk 1.1, generated by `generate_toy_hdl_data.R` for HDL-compatible inputs.
- hdl/reference/chr1.2_toy.bim: Synthetic toy HDL-format BIM sidecar for chunk 1.2, generated by `generate_toy_hdl_data.R` for HDL-compatible inputs.
- hdl/reference/chr1.2_toy.rda: Synthetic toy HDL-format LD reference payload for chunk 1.2, generated by `generate_toy_hdl_data.R` for HDL-compatible inputs.
- hdl/reference/toy_snp_counter.RData: Synthetic toy HDL-format SNP count metadata, generated by `generate_toy_hdl_data.R` for HDL-compatible inputs.
- hdl/reference/toy_snp_list.RData: Synthetic toy HDL-format SNP list metadata, generated by `generate_toy_hdl_data.R` for HDL-compatible inputs.
- sumstats/trait1_canonical.tsv: Synthetic canonical toy summary statistics for trait 1, generated by `hdl/generate_toy_hdl_data.R` for small GWAS-style module inputs.
- sumstats/trait2_canonical.tsv: Synthetic canonical toy summary statistics for trait 2, generated by `hdl/generate_toy_hdl_data.R` for small GWAS-style module inputs.
- svsig:

- NA03697B2_new.pbmm2.repeats.svsig.gz: structural variant file for NA03697B2_new.pbmm2.repeats.bam, created with PBSV discover version (2.9.0 default settings)
Expand Down
41 changes: 41 additions & 0 deletions data/genomics/homo_sapiens/popgen/hdl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# HDL Toy Test Dataset

These files are synthetic toy fixtures for HDL module testing in the companion
`nf-core/modules` work for `nf-core/modules#10912`. They are intended to exercise
[HDL](https://github.com/zhenin/HDL) inputs in tests, not to provide a
scientific LD reference panel or redistributed upstream reference bundle.

## Layout

- `reference/`: toy HDL LD reference chunks and metadata sidecars
- `../sumstats/`: canonical toy summary-statistics tables aligned to the toy SNPs

## Regeneration

From this directory:

```bash
Rscript generate_toy_hdl_data.R
```

From the root of the `nf-core/test-datasets` worktree:

```bash
Rscript data/genomics/homo_sapiens/popgen/hdl/generate_toy_hdl_data.R
```

## R Objects

The `.bim` sidecars, both canonical `sumstats/*.tsv` files, and the R binary
payloads are all generated locally by `generate_toy_hdl_data.R` from fully
synthetic constants in this directory.

- `reference/chr1.1_toy.rda` and `reference/chr1.2_toy.rda` each contain
synthetic `LDsc`, `lam`, and `V` objects for one toy HDL chunk.
- `reference/toy_snp_counter.RData` contains `nsnps.list` and
`nsnps.list.imputed`, each as a named one-element list with the toy chunk SNP
counts.
- `reference/toy_snp_list.RData` contains `snps.list.imputed.vector`, the four
synthetic SNP IDs shared by the toy fixtures.
- `../sumstats/trait1_canonical.tsv` and `../sumstats/trait2_canonical.tsv` are
tiny canonical summary-statistics tables keyed to those synthetic SNP IDs.
110 changes: 110 additions & 0 deletions data/genomics/homo_sapiens/popgen/hdl/generate_toy_hdl_data.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
#!/usr/bin/env Rscript

args <- commandArgs(trailingOnly = FALSE)
file_arg <- "--file="
script_path <- sub(file_arg, "", args[grep(file_arg, args)])

if (length(script_path) != 1 || script_path == "") {
stop("Unable to determine the script path from commandArgs().")
}

script_dir <- dirname(normalizePath(script_path))
reference_dir <- file.path(script_dir, "reference")
sumstats_dir <- file.path(script_dir, "..", "sumstats")

dir.create(reference_dir, recursive = TRUE, showWarnings = FALSE)
dir.create(sumstats_dir, recursive = TRUE, showWarnings = FALSE)

writeLines(
c(
"1 rs1 0 101 A G",
"1 rs2 0 102 C T"
),
file.path(reference_dir, "chr1.1_toy.bim")
)

writeLines(
c(
"1 rs3 0 201 A C",
"1 rs4 0 202 G T"
Comment on lines +20 to +29
),
file.path(reference_dir, "chr1.2_toy.bim")
)

lam <- c(1.3, 0.85)
LDsc <- c(1.1, 1.4)
V <- diag(2)
save(
LDsc,
lam,
V,
file = file.path(reference_dir, "chr1.1_toy.rda"),
compress = "gzip"
)

lam <- c(1.25, 0.9)
LDsc <- c(1.2, 1.35)
V <- diag(2)
save(
LDsc,
lam,
V,
file = file.path(reference_dir, "chr1.2_toy.rda"),
compress = "gzip"
)

nsnps.list <- list("1" = c(2, 2))
nsnps.list.imputed <- list("1" = c(2, 2))
save(
nsnps.list.imputed,
nsnps.list,
file = file.path(reference_dir, "toy_snp_counter.RData"),
compress = "gzip"
)

snps.list.imputed.vector <- c("rs1", "rs2", "rs3", "rs4")
save(
snps.list.imputed.vector,
file = file.path(reference_dir, "toy_snp_list.RData"),
compress = "gzip"
)

trait1 <- data.frame(
SNP = c("rs1", "rs2", "rs3", "rs4"),
A1 = c("A", "C", "A", "G"),
A2 = c("G", "T", "C", "T"),
CHR = c(1, 1, 1, 1),
POS = c(101, 102, 201, 202),
RSID = c("rs1", "rs2", "rs3", "rs4"),
EffectAllele = c("A", "C", "A", "G"),
OtherAllele = c("G", "T", "C", "T"),
N = c(10000, 10000, 10000, 10000),
Z = c(0.5, -0.2, 0.4, -0.1)
)
write.table(
trait1,
file.path(sumstats_dir, "trait1_canonical.tsv"),
sep = "\t",
quote = FALSE,
row.names = FALSE
)

trait2 <- data.frame(
SNP = c("rs1", "rs2", "rs3", "rs4"),
A1 = c("A", "C", "A", "G"),
A2 = c("G", "T", "C", "T"),
CHR = c(1, 1, 1, 1),
POS = c(101, 102, 201, 202),
RSID = c("rs1", "rs2", "rs3", "rs4"),
EffectAllele = c("A", "C", "A", "G"),
OtherAllele = c("G", "T", "C", "T"),
N = c(12000, 12000, 12000, 12000),
Z = c(0.3, -0.4, 0.2, -0.2)
)
write.table(
trait2,
file.path(sumstats_dir, "trait2_canonical.tsv"),
sep = "\t",
quote = FALSE,
row.names = FALSE
)
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
1 rs1 0 101 A G
1 rs2 0 102 C T
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
1 rs3 0 201 A C
1 rs4 0 202 G T
Binary file not shown.
Binary file not shown.
Binary file not shown.
31 changes: 31 additions & 0 deletions data/genomics/homo_sapiens/popgen/sumstats/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Toy Population-Genetics Summary Statistics

These files are tiny synthetic GWAS-style summary-statistics tables intended for
module testing. They are generated from fixed constants by the companion HDL
fixture generator at `../hdl/generate_toy_hdl_data.R`.

## Layout

- `trait1_canonical.tsv`: synthetic canonical summary statistics for trait 1
- `trait2_canonical.tsv`: synthetic canonical summary statistics for trait 2

## Regeneration

From the `hdl/` directory:

```bash
Rscript generate_toy_hdl_data.R
```

From the root of the `nf-core/test-datasets` worktree:

```bash
Rscript data/genomics/homo_sapiens/popgen/hdl/generate_toy_hdl_data.R
```

## Notes

These tables are not HDL-specific at the file-format level. They are kept under
`popgen/sumstats/` so they can be reused by modules that consume small
GWAS-style tabular inputs, while the HDL reference panel assets remain grouped
under `popgen/hdl/reference/`.
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
SNP A1 A2 CHR POS RSID EffectAllele OtherAllele N Z
rs1 A G 1 101 rs1 A G 10000 0.5
rs2 C T 1 102 rs2 C T 10000 -0.2
rs3 A C 1 201 rs3 A C 10000 0.4
rs4 G T 1 202 rs4 G T 10000 -0.1
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
SNP A1 A2 CHR POS RSID EffectAllele OtherAllele N Z
rs1 A G 1 101 rs1 A G 12000 0.3
rs2 C T 1 102 rs2 C T 12000 -0.4
rs3 A C 1 201 rs3 A C 12000 0.2
rs4 G T 1 202 rs4 G T 12000 -0.2