- Guarantees completeness: Identifies every match within specified parameters
- Deterministic & reproducible: No probabilistic elements or training data required
- Precedent: Same approach as FIMO (MEME Suite) and standard motif scanning tools
Implementation (sequence_finder.py:130):
# Exhaustive search with Hamming distance calculation
for i in range(len(sequence) - pattern_length + 1):
subseq = sequence[i:i + pattern_length]
mismatch_count, mismatch_positions = count_mismatches(subseq, full_pattern)Degenerate Base Support (sequence_finder.py:76-78):
if pat_char.lower() == 'n':
# Wildcard always matches - does NOT count as mismatch
continueWhy critical:
- Reflects real TF binding site flexibility (positional tolerance)
- Uses IUPAC degenerate base notation standards
- 2 mismatches / 15bp = 13.3% tolerance (conservative vs. typical 15-20%)
- Red-highlighted mismatches in Excel enable manual validation of biological plausibility
Position Tracking (sequence_finder.py:58-83):
- Tracks WHICH bases mismatch for manual validation
- Distinguishes between critical vs. peripheral mismatches
- Allows researchers to assess biological plausibility
Implementation (sequence_finder.py:86-98):
def build_pattern_with_gaps(patterns: List[str], gap_size: int) -> str:
"""Inserts uniform gaps between all sub-patterns"""
gap = 'n' * gap_size
return gap.join(patterns)Biological rationale:
- TF dimers/heterodimers have flexible spacing requirements (±1-2 bp typical)
- DNA helical geometry accommodates slight spacing variations
- Evolutionary drift can shift spacing while maintaining function
- Systematically tests 0-3 bp gaps between motifs
Advantage: Most database tools do not systematically test spacer variations; this implementation does it automatically.
Implementation (sequence_finder.py:137):
tss_position = start_pos - seq_length # Negative = upstreamWhy this matters:
- Enables functional interpretation (proximal vs. distal elements)
- Normalizes position across sequences of different lengths
- Standard in promoter literature
- Position-binned analysis identifies regulatory "zones"
Positional Categories:
- Very Upstream: < -1000 bp
- Upstream: -1000 to -500 bp
- Near End: -500 to -100 bp
- Very Near End: -100 to 0 bp
Built-In Controls:
-
Exact Sequence Reporting: Full subsequence stored, not just position
- Allows manual BLAST verification
- Enables structure prediction
-
Mismatch Visualization: Red-highlighted bases in Excel
- Rapid assessment of biological plausibility
- Identifies systematic artifacts
-
Statistical Stratification: Results broken down by mismatch count
- Can analyze 0-mismatch hits separately (highest confidence)
- Gradient of stringency for follow-up experiments
-
Sequence-Specific Analysis: Per-sequence breakdown
- Identifies outlier sequences (potential artifacts)
- Enables species-specific or allele-specific interpretation
- ✅ Novel composite element discovery with user-defined motifs
- ✅ Architectural flexibility testing (systematic gap size optimization)
- ✅ Hypothesis-driven searches from ChIP-seq/DAP-seq experiments
- ✅ Spacing variant analysis (e.g., "2bp vs. 3bp spacing preference?")
- ✅ Manual validation workflow (red-highlighted mismatches)
- ✅ Comparative promoter analysis with sequence-stratified statistics
- ✅ Publication-ready statistical analysis and figures
Unique Advantages:
- Not restricted to database motifs - accepts ANY pattern
- Explicitly tests spacing variations (0-3bp gaps, configurable)
- Visual mismatch tracking enables manual curation
- Statistical output designed for manuscript methods sections
- ✅ Tests spacing flexibility (gap analysis)
- ✅ Finds novel composite elements (not database-limited)
- ✅ Statistical comparison across sequences
- ✅ Hypothesis testing capability
- ✅ User-defined motifs (not restricted to TF families)
- ✅ Systematic gap/spacer testing (0-3bp automatic)
- ✅ Visual mismatch validation (red highlighting)
- ✅ Explicit architectural flexibility analysis
- PlantCARE identifies known standard elements
- PlantPAN 4.0 leverages conservation and ChIP-seq evidence across species
- Sequence Finder discovers novel composite architectures and tests spacing hypotheses
| Feature | Sequence Finder | FIMO (MEME) | PlantPAN 4.0 | PlantCARE |
|---|---|---|---|---|
| Exhaustive Search | ✅ | ✅ | ✅ | |
| Gap/Spacer Analysis | ✅ Systematic | ❌ | ❌ | ❌ |
| Mismatch Tracking | ✅ Position-specific | ✅ | ||
| TSS-Relative Coords | ✅ | ❌ | ❌ | |
| Multi-Sequence Stats | ✅ Built-in | ✅ | ❌ | |
| Visual Validation | ✅ Red highlights | ❌ | ❌ | ❌ |
| Custom Motifs | ✅ Any pattern | ✅ User PWM | ❌ Database only | |
| Conservation Analysis | ❌ | ❌ | ✅ CNS (115 species) | ❌ |
| ChIP-seq Integration | ❌ | ❌ | ✅ 18,305 TFs | ❌ |