Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Round 3 (final challenge): DNA sequence matcher

Problem

Given a FASTA-like file (genome.fasta) containing DNA sequences using only A, C, G, T, find every record whose sequence contains a target pattern and return the positions of each occurrence inside that record.

A FASTA record starts with a > header line containing the record id, followed by one or more lines of sequence data. The file packs many records back-to-back:

>seq_000001
ACGTACGTACGT
ACGTACGTACGT
>seq_000002
TGCATGCATGCA
  • Input: data/genome.fasta (default ~512 MB; scale with --size-mb).
  • Target pattern: b"AGTCCGTA" (recorded in data/truth.json).
  • Output: list[tuple[record_id, list[int positions]]] in file order.

You are encouraged to combine techniques from rounds 1 and 2.

Files

File Purpose
baseline.py Intentionally slow starting point. Don't edit: it is the reference for the comparison.
solution.py Edit this. Starts out delegating to baseline.py; replace with your faster implementation.
gen_data.py Generates the FASTA file and a truth.json with expected matches.
test_dna.py Correctness tests and the pytest-codspeed benchmark. Every test is parametrized over both the baseline and your solution.

Generate the data

uv run rounds/3_dna/gen_data.py             # default ~512 MB.
uv run rounds/3_dna/gen_data.py --size-mb 100

Or run uv run scripts/setup.py to generate every round's data in one go.

Verify correctness

uv run pytest rounds/3_dna/

Benchmark

Walltime, locally:

uv run pytest --codspeed rounds/3_dna/

Same benchmarks, run through the CodSpeed CLI for low-noise instrumented measurements:

codspeed run --mode walltime -- uv run pytest --codspeed rounds/3_dna/