|
1 | | -<div align='center'> |
| 1 | +# Tracebound - Asynchronous Web Phrase Scanner |
2 | 2 |
|
3 | | -<img src="https://fled.dev/assets/tracebound-banner.png" alt="Tracebound" width="700px"> |
4 | | -<br><br> |
5 | | -<p>Sitemap-based web crawler that efficiently searches for specific phrases across a website and logs results.</p> |
| 3 | +## Overview |
| 4 | +Tracebound is a highly optimized, asynchronous web scanner designed to efficiently search for specific phrases within domain-based web content. It leverages modern concurrency techniques, structured logging, and robust error handling to ensure high performance, scalability, and reliability. The scanner is capable of processing thousands of URLs in parallel while maintaining accuracy and security. |
6 | 5 |
|
7 | | -<h4><a href="https://github.com/fled-dev/Tracebound/blob/master/README.md"> Documentation </a> <span> · </span> <a href="https://github.com/fled-dev/Tracebound/issues"> Report Bug </a> <span> · </span> <a href="https://github.com/fled-dev/Tracebound/issues"> Request Feature </a> </h4> |
| 6 | +## Features |
| 7 | +### 🚀 Performance & Speed Enhancements |
| 8 | +- **Asynchronous networking** using `aiohttp` to eliminate blocking calls |
| 9 | +- **Multi-threaded URL scanning** for parallel execution |
| 10 | +- **Connection pooling** to reduce network latency |
8 | 11 |
|
| 12 | +### ⚙️ Robust Error Handling & Logging |
| 13 | +- **Centralized error handling** with structured logging |
| 14 | +- **Retry logic with exponential backoff** for transient network errors |
| 15 | +- **Logging verbosity control** (silent mode, minimal logs, debug mode) |
9 | 16 |
|
10 | | -</div> |
| 17 | +### 🔍 Advanced Web Scraping |
| 18 | +- **Recursive sitemap parsing** to discover hidden URLs |
| 19 | +- **Structured data extraction** for better accuracy |
| 20 | +- **Optimized HTML parsing** using `BeautifulSoup` |
11 | 21 |
|
12 | | -# Table of Contents |
13 | | -- [About the Project](#about-the-project) |
14 | | -- [Roadmap](#roadmap) |
15 | | -- [License](#license) |
16 | | -- [Contact](#contact) |
| 22 | +### 🛡️ Security & Compliance |
| 23 | +- **Secure request headers** to minimize detection by anti-scraping mechanisms |
| 24 | +- **Rate-limiting & request throttling** to prevent being blocked |
| 25 | +- **Defensive coding** with safe XML parsing using `defusedxml` |
17 | 26 |
|
| 27 | +### 📊 Efficient Data Storage & Output Options |
| 28 | +- **Supports multiple output formats**: TXT, JSON, CSV |
| 29 | +- **Batch file I/O operations** to minimize disk usage |
| 30 | +- **Database storage support (future release)** |
18 | 31 |
|
19 | | -## About the Project |
20 | | -Sitemaps are fantastic resources, but manually combing through them is tedious. I wanted a quick way to find specific content patterns within a website's structure. Tracebound does just that. It leverages the sitemap to crawl all linked pages efficiently, hunting for any phrase or keyword I specify. It's been a fun little experiment in focused web crawling! |
| 32 | +### 🛠️ Configurability & Ease of Use |
| 33 | +- **Command-line arguments** for flexible scanning options |
| 34 | +- **Real-time progress tracking** with a progress bar (`tqdm`) |
| 35 | +- **Automatic domain protocol detection** |
21 | 36 |
|
22 | | -### Screenshots |
23 | | -<a href=""><img src="https://fled.dev/assets/tracebound-demo.png" alt='image' width='700px'></a> |
| 37 | +## Installation |
| 38 | +### Prerequisites |
| 39 | +Ensure you have Python 3.7+ installed. You can install the required dependencies using: |
| 40 | +```sh |
| 41 | +pip install -r requirements.txt |
| 42 | +``` |
| 43 | + |
| 44 | +### Required Dependencies |
| 45 | +- `aiohttp` (Asynchronous HTTP requests) |
| 46 | +- `async_timeout` (Timeout management for async requests) |
| 47 | +- `beautifulsoup4` (HTML parsing) |
| 48 | +- `defusedxml` (Secure XML parsing) |
| 49 | +- `tqdm` (Progress tracking) |
| 50 | +- `pyfiglet` (Fancy ASCII banner, optional) |
24 | 51 |
|
25 | | -## Getting Started |
| 52 | +## Usage |
| 53 | +### Basic Command |
| 54 | +```sh |
| 55 | +python tracebound.py <domain> <phrase> |
| 56 | +``` |
26 | 57 |
|
27 | | -### Installation |
28 | | -```bash |
29 | | -pip install requirements.txt |
| 58 | +### Example |
| 59 | +```sh |
| 60 | +python tracebound.py example.com "contact us" |
30 | 61 | ``` |
| 62 | +This will scan `example.com` for occurrences of "contact us" across all indexed pages. |
31 | 63 |
|
32 | | -### Run Locally |
33 | | -```bash |
34 | | -python3 main.py |
| 64 | +### Advanced Options |
| 65 | +| Option | Description | |
| 66 | +|--------|-------------| |
| 67 | +| `--regex` | Enable regex pattern matching instead of simple text search | |
| 68 | +| `--concurrency N` | Set the number of concurrent requests (default: 10) | |
| 69 | +| `--timeout N` | Set request timeout in seconds (default: 10) | |
| 70 | +| `--output txt/json/csv` | Specify the output format (default: TXT) | |
| 71 | +| `--debug` | Enable verbose logging for debugging | |
| 72 | + |
| 73 | +Example with advanced options: |
| 74 | +```sh |
| 75 | +python tracebound.py example.com "data privacy" --regex --concurrency 20 --output json |
35 | 76 | ``` |
36 | 77 |
|
37 | | -## Roadmap |
38 | | -* [ ] Regular Expression Support |
39 | | -* [ ] Fuzzy Search |
40 | | -* [ ] CSV Export |
41 | | -* [ ] Web Interface |
| 78 | +## How It Works |
| 79 | +1. **Domain Validation**: Ensures a valid URL and auto-detects HTTP/HTTPS. |
| 80 | +2. **Sitemap Discovery**: Extracts all indexed URLs via `/sitemap.xml`. |
| 81 | +3. **Asynchronous Scanning**: Fetches and scans pages concurrently. |
| 82 | +4. **Phrase Matching**: Performs case-insensitive or regex-based search. |
| 83 | +5. **Logging & Output**: Saves results in TXT, JSON, or CSV format. |
| 84 | + |
| 85 | +## Contribution |
| 86 | +Want to contribute? Open a pull request! Feel free to improve performance, add new features, or fix bugs. |
42 | 87 |
|
43 | 88 | ## License |
44 | | -This project is licensed under the MIT License. This means you are free to use, copy, modify, and distribute the software for any purpose, even commercial ones, as long as you include the copyright notice and license information. |
| 89 | +This project is licensed under the MIT License. |
| 90 | + |
| 91 | +## Author |
| 92 | +Tracebound is developed and maintained by fled-dev. |
45 | 93 |
|
46 | | -## Contact |
47 | | -Paul - - mail@fled.dev |
|
0 commit comments