A Python tool for searching specific patterns in DNA/RNA sequences with support for mismatches, variable gap sizes, and comprehensive statistical analysis. Designed for promoter analysis, cis-regulatory element discovery, and transcription factor binding site identification.
| Feature | Description |
|---|---|
| Pattern Matching | Search for multiple sub-patterns with IUPAC wildcard support (n matches any nucleotide) |
| Flexible Gap Analysis | Systematically test spacer sizes (0-3 bp, configurable) between composite motifs |
| Mismatch Tolerance | Find matches with 0-2 mismatches (configurable), with position-specific tracking |
| Visual Validation | Excel output with red-highlighted mismatches for manual curation |
| TSS-Relative Positioning | Coordinates relative to transcription start site for functional interpretation |
| Batch Processing | Run multiple search tasks with different parameters and merge results |
| Statistical Analysis | 4D analysis (Pattern × Mismatch × Gap × Sequence) with publication-quality figures |
| Publication-Ready Output | Nature-quality visualizations (300 DPI, colorblind-friendly) |
# 1. Clone repository
git clone https://github.com/qiandemoni/HSE_sequence_finder.git
cd HSE_sequence_finder
# 2. Install dependencies
pip install -r requirements.txt
# 3. Run example search
python3 sequence_finder.py examples/sample_input.fasta output.xlsx \
--patterns "nGAAn" "nTTCn" "nGAAn" \
--gap-sizes 0 1 2 \
--max-mismatches 2
# 4. Analyze results
python3 analyze_results.py output.xlsx analysis.xlsx
# 5. View results
# - output.xlsx: Search results with red-highlighted mismatches
# - analysis.xlsx: Statistical analysis
# - output_charts/: Publication-quality figures- Python 3.7 or higher
- pip package manager
pip install -r requirements.txtbiopython>=1.79 # FASTA parsing
openpyxl>=3.0.0 # Excel output
pandas>=1.3.0 # Analysis module
matplotlib>=3.3.0 # Visualization
numpy>=1.21.0 # Statistical analysis
There are three main tools:
| Tool | Purpose | Input | Output |
|---|---|---|---|
sequence_finder.py |
Single pattern search | FASTA + patterns | Excel with matches |
batch_sequence_finder.py |
Multiple searches | FASTA + JSON config | Merged Excel results |
analyze_results.py |
Statistical analysis | Excel results | Analysis + figures |
Search for a single pattern combination across all sequences.
python3 sequence_finder.py <input.fasta> <output.xlsx> \
--patterns "nGAAn" "nTTCn" "nGAAn"| Argument | Required | Default | Description |
|---|---|---|---|
input_fasta |
✅ | - | Path to input FASTA file |
output_excel |
✅ | - | Path to output Excel file |
--patterns |
✅ | - | Space-separated pattern sequences (use n as wildcard) |
--gap-sizes |
❌ | 0 1 2 3 |
Space-separated gap sizes to test (in bp) |
--max-mismatches |
❌ | 2 |
Maximum allowed mismatches per match |
Example 1: Default settings (0-3 gaps, up to 2 mismatches)
python3 sequence_finder.py examples/sample_input.fasta output.xlsx \
--patterns "nGAAn" "nTTCn" "nGAAn"Example 2: Strict matching (only 1 mismatch, gaps 0-1 bp)
python3 sequence_finder.py input.fasta output.xlsx \
--patterns "nGAAn" "nTTCn" "nGAAn" \
--gap-sizes 0 1 \
--max-mismatches 1Example 3: Custom patterns (longer motifs, no wildcards)
python3 sequence_finder.py input.fasta output.xlsx \
--patterns "ACGTACGT" "TGCATGCA" \
--gap-sizes 0 1 2Run multiple pattern searches with different parameters and merge results.
python3 batch_sequence_finder.py <config.json>Create a JSON file (e.g., my_config.json):
{
"input_fasta": "examples/sample_input.fasta",
"output_excel": "batch_results.xlsx",
"output_mode": "single",
"tasks": [
{
"name": "Task1_GAA_TTC_pattern",
"patterns": ["nGAAn", "nTTCn", "nGAAn"],
"gap_sizes": [0, 1, 2],
"max_mismatches": 2
},
{
"name": "Task2_strict_matching",
"patterns": ["nGAAn", "nTTCn", "nGAAn"],
"gap_sizes": [0, 1],
"max_mismatches": 1
}
]
}Top-level fields:
input_fasta(required): Path to input FASTA fileoutput_excel(required): Path to output Excel fileoutput_mode(optional):"single"or"multi"(default:"single")"single": All results in one sheet with Task Name column"multi": Separate sheet per task
tasks(required): Array of task definitions
Task fields:
name(optional): Task identifier (default:"Task_1","Task_2", etc.)patterns(required): Array of pattern stringsgap_sizes(optional): Array of gap sizes to test (default:[0, 1, 2])max_mismatches(optional): Maximum allowed mismatches (default:2)
python3 batch_sequence_finder.py examples/config_example.jsonAnalyze search results to generate statistics and publication-quality figures.
python3 analyze_results.py <input.xlsx> <output_analysis.xlsx>The analysis generates 4 analysis sheets + 3 visualization charts:
- Summary statistics (total matches, unique patterns, unique sequences)
- Matches by sequence ID (with percentages)
- Pattern × Sequence cross-tabulation
- Mismatch × Sequence distribution
- Gap × Sequence distribution
- 4D Analysis: Pattern × Mismatch × Gap × Sequence (complete parameter interaction)
- TSS position distribution (binned)
- TSS position by pattern and sequence
- TSS Range Categories (with legend):
- Very Upstream (< -1000 bp)
- Upstream (-1000 to -500 bp)
- Near End (-500 to -100 bp)
- Very Near End (-100 to 0 bp)
- Gap size summary (match counts and percentages)
- Gap preferences by pattern
- Optimal gap per pattern (identifies best spacing)
- Gap × Sequence comparison table
- Mismatch distribution
- Mismatch tolerance by pattern and sequence
- Gap × Mismatch interaction effects
Three publication-ready charts saved to output_charts/ folder:
-
Gap Size Distribution by Sequence (stacked bar chart)
- File:
output_charts/3_Gap_Size_Distribution_by_Sequence.jpg
- File:
-
Pattern × Sequence Heatmap (with count annotations)
- File:
output_charts/4_Pattern_Sequence_Heatmap.jpg
- File:
-
TSS Range Categories by Sequence (grouped bar chart)
- File:
output_charts/7_TSS_Range_Categories_by_Sequence.jpg
- File:
Chart features:
- 300 DPI resolution
- Colorblind-friendly palettes
- Arial/Helvetica fonts (Nature standard)
- Clean styling
python3 analyze_results.py <input.xlsx> <output.xlsx> [OPTIONS]Options:
--position-bin SIZE: Position bin size for TSS analysis (default: 100)--no-charts: Skip chart generation (faster processing)
Example 1: Full analysis with charts
python3 analyze_results.py results.xlsx analyzed.xlsxExample 2: Custom bin size
python3 analyze_results.py results.xlsx analyzed.xlsx --position-bin 50Example 3: Skip charts (faster)
python3 analyze_results.py results.xlsx analyzed.xlsx --no-chartsExcel file with 7 columns:
| Column | Description |
|---|---|
| Sequence ID | Identifier from FASTA file |
| Found Subsequence | Matched sequence (red-highlighted mismatches) |
| Start Position | 1-based position in sequence |
| Transcript Start Site Position | Position relative to sequence end (negative = upstream) |
| Mismatch Count | Number of mismatches (0-2) |
| Gap Size | Number of nucleotides between patterns |
| Pattern | Input patterns used (e.g., nGAAn, nTTCn, nGAAn) |
Note: Red-highlighted bases in "Found Subsequence" indicate mismatch positions for easy manual validation.
Single Sheet Mode (output_mode: "single"): 8 columns (adds "Task Name" column)
Multi-Sheet Mode (output_mode: "multi"): Separate sheets per task (7 columns each)
-
Pattern Assembly: Patterns are concatenated with specified gap sizes
- Example:
nGAAn+ (gap=2) +nTTCn+ (gap=2) +nGAAn→nGAAnnnnTTCnnnGAAn
- Example:
-
Sliding Window: The combined pattern slides across the entire sequence
- Evaluates every possible starting position
-
Mismatch Calculation:
- Wildcards (
n) match any nucleotide (do NOT count as mismatches) - Specific bases (A, T, C, G) must match exactly or count as mismatches
- Tracks mismatch positions for visualization
- Wildcards (
-
Gap Testing: Process repeats for each gap size (e.g., 0, 1, 2 bp)
-
TSS Coordinates: Start positions converted to TSS-relative coordinates
- Formula:
TSS Position = Start Position - Sequence Length - Negative values indicate upstream positions
- Formula:
- Gap flexibility: Transcription factor dimers/heterodimers exhibit spacing tolerance (±1-2 bp typical)
- Mismatch tolerance: Accounts for natural sequence variation and imperfect binding
- TSS positioning: Enables functional interpretation (proximal vs. distal elements)
- Composite element search: Many regulatory modules require multiple motifs in proximity
If you use this software in your research, please cite:
@software{sequence_finder,
title = {DNA/RNA Sequence Pattern Finder},
author = {Qiande, Moni},
year = {2025},
url = {https://github.com/qiandemoni/HSE_sequence_finder},
version = {1.0.0},
license = {Apache-2.0}
}Also see: CITATION.cff for machine-readable citation information.
For methodology details, see: docs/Methodology.md
We welcome contributions! Please see CONTRIBUTING.md for:
- How to report issues
- How to suggest features
- Code style guidelines
- Pull request process
1. "ModuleNotFoundError: No module named 'Bio'"
# Solution: Install biopython
pip install biopython>=1.792. "Excel file output is empty"
- Check that your FASTA file is properly formatted
- Verify patterns are correctly specified (use
nfor wildcards) - Try reducing
--max-mismatchesor increasing--gap-sizesrange
3. "Charts not generating"
# Solution: Install matplotlib
pip install matplotlib>=3.3.0
# Or skip charts:
python3 analyze_results.py input.xlsx output.xlsx --no-charts4. "No matches found"
- Your patterns may be too strict
- Try: Increase
--max-mismatchesto 2 or 3 - Try: Expand
--gap-sizesrange (e.g.,0 1 2 3 4 5) - Verify your patterns are biologically plausible
5. "Sequence too short for pattern"
- Ensure your sequences are longer than:
pattern_length + (num_patterns-1) * max_gap_size - Example: 3 patterns of 5bp each with max gap=3 requires sequences ≥ 21bp
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: mqiande@ufl.edu
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Copyright 2025 Qiande, Moni
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
- Built using Biopython for FASTA parsing
- Visualization powered by Matplotlib
- Excel output via openpyxl
Last Updated: October 2025 Version: 1.0.0