Skip to content

Commit e9cea75

Browse files
authored
Update README.md
1 parent 512de98 commit e9cea75

1 file changed

Lines changed: 76 additions & 30 deletions

File tree

README.md

Lines changed: 76 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,47 +1,93 @@
1-
<div align='center'>
1+
# Tracebound - Asynchronous Web Phrase Scanner
22

3-
<img src="https://fled.dev/assets/tracebound-banner.png" alt="Tracebound" width="700px">
4-
<br><br>
5-
<p>Sitemap-based web crawler that efficiently searches for specific phrases across a website and logs results.</p>
3+
## Overview
4+
Tracebound is a highly optimized, asynchronous web scanner designed to efficiently search for specific phrases within domain-based web content. It leverages modern concurrency techniques, structured logging, and robust error handling to ensure high performance, scalability, and reliability. The scanner is capable of processing thousands of URLs in parallel while maintaining accuracy and security.
65

7-
<h4><a href="https://github.com/fled-dev/Tracebound/blob/master/README.md"> Documentation </a> <span> · </span> <a href="https://github.com/fled-dev/Tracebound/issues"> Report Bug </a> <span> · </span> <a href="https://github.com/fled-dev/Tracebound/issues"> Request Feature </a> </h4>
6+
## Features
7+
### 🚀 Performance & Speed Enhancements
8+
- **Asynchronous networking** using `aiohttp` to eliminate blocking calls
9+
- **Multi-threaded URL scanning** for parallel execution
10+
- **Connection pooling** to reduce network latency
811

12+
### ⚙️ Robust Error Handling & Logging
13+
- **Centralized error handling** with structured logging
14+
- **Retry logic with exponential backoff** for transient network errors
15+
- **Logging verbosity control** (silent mode, minimal logs, debug mode)
916

10-
</div>
17+
### 🔍 Advanced Web Scraping
18+
- **Recursive sitemap parsing** to discover hidden URLs
19+
- **Structured data extraction** for better accuracy
20+
- **Optimized HTML parsing** using `BeautifulSoup`
1121

12-
# Table of Contents
13-
- [About the Project](#about-the-project)
14-
- [Roadmap](#roadmap)
15-
- [License](#license)
16-
- [Contact](#contact)
22+
### 🛡️ Security & Compliance
23+
- **Secure request headers** to minimize detection by anti-scraping mechanisms
24+
- **Rate-limiting & request throttling** to prevent being blocked
25+
- **Defensive coding** with safe XML parsing using `defusedxml`
1726

27+
### 📊 Efficient Data Storage & Output Options
28+
- **Supports multiple output formats**: TXT, JSON, CSV
29+
- **Batch file I/O operations** to minimize disk usage
30+
- **Database storage support (future release)**
1831

19-
## About the Project
20-
Sitemaps are fantastic resources, but manually combing through them is tedious. I wanted a quick way to find specific content patterns within a website's structure. Tracebound does just that. It leverages the sitemap to crawl all linked pages efficiently, hunting for any phrase or keyword I specify. It's been a fun little experiment in focused web crawling!
32+
### 🛠️ Configurability & Ease of Use
33+
- **Command-line arguments** for flexible scanning options
34+
- **Real-time progress tracking** with a progress bar (`tqdm`)
35+
- **Automatic domain protocol detection**
2136

22-
### Screenshots
23-
<a href=""><img src="https://fled.dev/assets/tracebound-demo.png" alt='image' width='700px'></a>
37+
## Installation
38+
### Prerequisites
39+
Ensure you have Python 3.7+ installed. You can install the required dependencies using:
40+
```sh
41+
pip install -r requirements.txt
42+
```
43+
44+
### Required Dependencies
45+
- `aiohttp` (Asynchronous HTTP requests)
46+
- `async_timeout` (Timeout management for async requests)
47+
- `beautifulsoup4` (HTML parsing)
48+
- `defusedxml` (Secure XML parsing)
49+
- `tqdm` (Progress tracking)
50+
- `pyfiglet` (Fancy ASCII banner, optional)
2451

25-
## Getting Started
52+
## Usage
53+
### Basic Command
54+
```sh
55+
python tracebound.py <domain> <phrase>
56+
```
2657

27-
### Installation
28-
```bash
29-
pip install requirements.txt
58+
### Example
59+
```sh
60+
python tracebound.py example.com "contact us"
3061
```
62+
This will scan `example.com` for occurrences of "contact us" across all indexed pages.
3163

32-
### Run Locally
33-
```bash
34-
python3 main.py
64+
### Advanced Options
65+
| Option | Description |
66+
|--------|-------------|
67+
| `--regex` | Enable regex pattern matching instead of simple text search |
68+
| `--concurrency N` | Set the number of concurrent requests (default: 10) |
69+
| `--timeout N` | Set request timeout in seconds (default: 10) |
70+
| `--output txt/json/csv` | Specify the output format (default: TXT) |
71+
| `--debug` | Enable verbose logging for debugging |
72+
73+
Example with advanced options:
74+
```sh
75+
python tracebound.py example.com "data privacy" --regex --concurrency 20 --output json
3576
```
3677

37-
## Roadmap
38-
* [ ] Regular Expression Support
39-
* [ ] Fuzzy Search
40-
* [ ] CSV Export
41-
* [ ] Web Interface
78+
## How It Works
79+
1. **Domain Validation**: Ensures a valid URL and auto-detects HTTP/HTTPS.
80+
2. **Sitemap Discovery**: Extracts all indexed URLs via `/sitemap.xml`.
81+
3. **Asynchronous Scanning**: Fetches and scans pages concurrently.
82+
4. **Phrase Matching**: Performs case-insensitive or regex-based search.
83+
5. **Logging & Output**: Saves results in TXT, JSON, or CSV format.
84+
85+
## Contribution
86+
Want to contribute? Open a pull request! Feel free to improve performance, add new features, or fix bugs.
4287

4388
## License
44-
This project is licensed under the MIT License. This means you are free to use, copy, modify, and distribute the software for any purpose, even commercial ones, as long as you include the copyright notice and license information.
89+
This project is licensed under the MIT License.
90+
91+
## Author
92+
Tracebound is developed and maintained by fled-dev.
4593

46-
## Contact
47-
Paul - - mail@fled.dev

0 commit comments

Comments
 (0)