Skip to content

LOEUF compatiblity with gnomAD 4.1.1#837

Open
likhitha-surapaneni wants to merge 1 commit into
Ensembl:postreleasefix/116from
likhitha-surapaneni:update/LOEUF
Open

LOEUF compatiblity with gnomAD 4.1.1#837
likhitha-surapaneni wants to merge 1 commit into
Ensembl:postreleasefix/116from
likhitha-surapaneni:update/LOEUF

Conversation

@likhitha-surapaneni
Copy link
Copy Markdown
Contributor

@jamie-m-a jamie-m-a self-requested a review May 7, 2026 12:01
@jamie-m-a jamie-m-a self-assigned this May 7, 2026
Copy link
Copy Markdown
Contributor

@jamie-m-a jamie-m-a left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some issues with file surrounding refseq transcripts and a line drop with that current sort.

Comment thread LOEUF.pm
These files can be tabix-processed by:
zcat gnomad.v4.1.1.constraint_metrics.tsv.bgz | (head -n 1 && tail -n +2 | sort -t$'\t' -k 9,9 -k 10,10n ) > loeuf_temp.tsv
sed '1s/.*/#&/' loeuf_temp.tsv > loeuf_dataset.tsv
bgzip loeuf_dataset.tsv
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok first up, the sort and zip can be combined and the current sort is losing a line - this is bettrer and a bit faster:
zcat gnomad.v4.1.1.constraint_metrics.tsv.bgz | (sed -u 1q; sort -k 9,9 -k 10,10n) | sed '1s/.*/#&/' | bgzip -c - > loeuf_dataset.tsv.bgz

Comment thread LOEUF.pm
zcat gnomad.v4.1.1.constraint_metrics.tsv.bgz | (head -n 1 && tail -n +2 | sort -t$'\t' -k 9,9 -k 10,10n ) > loeuf_temp.tsv
sed '1s/.*/#&/' loeuf_temp.tsv > loeuf_dataset.tsv
bgzip loeuf_dataset.tsv
tabix -f -s 9 -b 10 -e 11 loeuf_dataset.tsv.gz
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However when you try to tabix either file you run in to a bunch of errors, because not everything in the file has chr and sequence data - basically only the ENSG rows have that, RefSeq has NA, which breaks tabix. We either have to skip refseq entries (grep -v NM* on transcript_id) or insert the correct coordinates for those.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However when you try to tabix either file you run in to a bunch of errors, because not everything in the file has chr and sequence data - basically only the ENSG rows have that, RefSeq has NA, which breaks tabix. We either have to skip refseq entries (grep -v NM* on transcript_id) or insert the correct coordinates for those.

Hi @jamie-m-a , comparing RefSeq transcripts to see if they're unique compared to Ensembl ids, it seems like there are about 282 transcripts which are unique to RefSeq

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants