Skip to content
This repository was archived by the owner on Apr 4, 2026. It is now read-only.

Latest commit

 

History

History
155 lines (118 loc) · 5.68 KB

File metadata and controls

155 lines (118 loc) · 5.68 KB

Sequence Finder Methodology: Scientific Validation

Core Algorithmic Strengths

1. Exhaustive Sliding Window Search

  • Guarantees completeness: Identifies every match within specified parameters
  • Deterministic & reproducible: No probabilistic elements or training data required
  • Precedent: Same approach as FIMO (MEME Suite) and standard motif scanning tools

Implementation (sequence_finder.py:130):

# Exhaustive search with Hamming distance calculation
for i in range(len(sequence) - pattern_length + 1):
    subseq = sequence[i:i + pattern_length]
    mismatch_count, mismatch_positions = count_mismatches(subseq, full_pattern)

2. Biologically-Informed Mismatch Handling

Degenerate Base Support (sequence_finder.py:76-78):

if pat_char.lower() == 'n':
    # Wildcard always matches - does NOT count as mismatch
    continue

Why critical:

  • Reflects real TF binding site flexibility (positional tolerance)
  • Uses IUPAC degenerate base notation standards
  • 2 mismatches / 15bp = 13.3% tolerance (conservative vs. typical 15-20%)
  • Red-highlighted mismatches in Excel enable manual validation of biological plausibility

Position Tracking (sequence_finder.py:58-83):

  • Tracks WHICH bases mismatch for manual validation
  • Distinguishes between critical vs. peripheral mismatches
  • Allows researchers to assess biological plausibility

3. Gap/Spacer Analysis: Unique Advantage

Implementation (sequence_finder.py:86-98):

def build_pattern_with_gaps(patterns: List[str], gap_size: int) -> str:
    """Inserts uniform gaps between all sub-patterns"""
    gap = 'n' * gap_size
    return gap.join(patterns)

Biological rationale:

  • TF dimers/heterodimers have flexible spacing requirements (±1-2 bp typical)
  • DNA helical geometry accommodates slight spacing variations
  • Evolutionary drift can shift spacing while maintaining function
  • Systematically tests 0-3 bp gaps between motifs

Advantage: Most database tools do not systematically test spacer variations; this implementation does it automatically.


4. TSS-Relative Positioning

Implementation (sequence_finder.py:137):

tss_position = start_pos - seq_length  # Negative = upstream

Why this matters:

  • Enables functional interpretation (proximal vs. distal elements)
  • Normalizes position across sequences of different lengths
  • Standard in promoter literature
  • Position-binned analysis identifies regulatory "zones"

Positional Categories:

  • Very Upstream: < -1000 bp
  • Upstream: -1000 to -500 bp
  • Near End: -500 to -100 bp
  • Very Near End: -100 to 0 bp

5. Validation Against False Positives

Built-In Controls:

  1. Exact Sequence Reporting: Full subsequence stored, not just position

    • Allows manual BLAST verification
    • Enables structure prediction
  2. Mismatch Visualization: Red-highlighted bases in Excel

    • Rapid assessment of biological plausibility
    • Identifies systematic artifacts
  3. Statistical Stratification: Results broken down by mismatch count

    • Can analyze 0-mismatch hits separately (highest confidence)
    • Gradient of stringency for follow-up experiments
  4. Sequence-Specific Analysis: Per-sequence breakdown

    • Identifies outlier sequences (potential artifacts)
    • Enables species-specific or allele-specific interpretation

Sequence Finder (This Tool) - Best for:

  1. Novel composite element discovery with user-defined motifs
  2. Architectural flexibility testing (systematic gap size optimization)
  3. Hypothesis-driven searches from ChIP-seq/DAP-seq experiments
  4. Spacing variant analysis (e.g., "2bp vs. 3bp spacing preference?")
  5. Manual validation workflow (red-highlighted mismatches)
  6. Comparative promoter analysis with sequence-stratified statistics
  7. Publication-ready statistical analysis and figures

Unique Advantages:

  • Not restricted to database motifs - accepts ANY pattern
  • Explicitly tests spacing variations (0-3bp gaps, configurable)
  • Visual mismatch tracking enables manual curation
  • Statistical output designed for manuscript methods sections

Critical Advantages Over Existing Tools

vs. PlantCARE:

  • ✅ Tests spacing flexibility (gap analysis)
  • ✅ Finds novel composite elements (not database-limited)
  • ✅ Statistical comparison across sequences
  • ✅ Hypothesis testing capability

vs. PlantPAN 4.0:

  • ✅ User-defined motifs (not restricted to TF families)
  • ✅ Systematic gap/spacer testing (0-3bp automatic)
  • ✅ Visual mismatch validation (red highlighting)
  • ✅ Explicit architectural flexibility analysis

Bottom Line:

  • PlantCARE identifies known standard elements
  • PlantPAN 4.0 leverages conservation and ChIP-seq evidence across species
  • Sequence Finder discovers novel composite architectures and tests spacing hypotheses

Comparison to Established Bioinformatics Tools

Feature Sequence Finder FIMO (MEME) PlantPAN 4.0 PlantCARE
Exhaustive Search ⚠️ PWM-based
Gap/Spacer Analysis ✅ Systematic
Mismatch Tracking ✅ Position-specific ⚠️ PWM scoring ⚠️ Exact match
TSS-Relative Coords ⚠️ Manual
Multi-Sequence Stats ✅ Built-in ⚠️ Separate
Visual Validation ✅ Red highlights
Custom Motifs ✅ Any pattern ✅ User PWM ⚠️ TF families ❌ Database only
Conservation Analysis ✅ CNS (115 species)
ChIP-seq Integration ✅ 18,305 TFs