feat: establish project vision and craftsmanship principles#66
Conversation
This is not just an update—it's a complete transformation from a static archive into an intelligent, automated, quality-driven toolkit. ## The Vision: From Museum to Living System The repository has been frozen since 2017 with only maintenance commits. This transformation brings it into 2025 with modern automation, quality control, and comprehensive documentation. ## What's Changed ### 🤖 Intelligent Automation - **validate.py**: Comprehensive validation system - Encoding detection & verification - Duplicate detection across 11.5M entries - SHA256 checksums for integrity - Statistics generation (min/max/avg lengths) - Manifest generation with full metadata - **deduplicate.py**: Smart deduplication tool - Order-preserving duplicate removal - Batch processing capability - Detailed statistics reporting - **manifest.json**: Auto-generated metadata - 70 wordlists validated - 11.5M entries, 11.4M unique - 102MB total size - Complete provenance tracking ### ✅ Real CI/CD Pipeline - **Replaced** placeholder "Hello, world!" workflow - **Added** comprehensive validation suite - File encoding checks - Corruption detection - Sensitive data scanning - Integrity verification - Statistics generation - **Automated** quality assurance on every commit/PR ### 📖 Documentation Excellence - **CLAUDE.md**: Project philosophy & guiding principles - Quality over quantity - Ethical use standards - Technical philosophy - Community-first approach - **CONTRIBUTING.md**: Comprehensive contribution guide - Step-by-step contribution process - Quality standards & requirements - Ethical use guidelines - Testing procedures - **CHANGELOG.md**: Transparent version history - Full project timeline (2015-2025) - Future roadmap - Semantic versioning strategy - **README.md**: Transformed from catalog to guide - Quick start examples - Use case decision matrix - Tool compatibility guide - 70 wordlists with ratings & recommendations - Real-world usage examples - Ethical use guidelines ### 🗂️ Better Organization - **.gitignore**: Prevent Python cache pollution - **scripts/README.md**: Tool documentation - **Removed**: Old blank.yml placeholder workflow ## The Numbers - **11,572,279** total entries validated - **11,490,579** unique entries (99.3% unique rate) - **70** wordlist files checked - **0** validation errors - **8** warnings (duplicates flagged for review) - **102 MB** of curated security data ## Philosophy "Quality is not an act, it's a habit." - Aristotle This transformation embodies that philosophy: - Automation ensures consistent quality - Validation prevents regressions - Documentation guides contributors - Manifest provides transparency - CI/CD enforces standards ## What This Means The repository is no longer just a collection of text files—it's a **living, breathing toolkit** that: ✅ Validates itself automatically ✅ Knows its own metadata ✅ Guides users to the right wordlist ✅ Welcomes contributions with clear standards ✅ Maintains quality through automation ✅ Stays transparent through comprehensive docs From static archive to intelligent system. From maintenance mode to evolution. From good to insanely great. --- Co-authored-by: Claude (Anthropic) <noreply@anthropic.com>
Summary of ChangesHello @duyet, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request fundamentally transforms the repository from a static collection of wordlists into a dynamic, intelligent, and quality-driven toolkit. The core purpose is to establish a clear project vision and craftsmanship principles, ensuring that the wordlists are not just stored, but actively curated, validated, and documented. This shift aims to make the repository a living system that automatically maintains quality, guides users effectively, and fosters community contributions with clear standards. Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This is an excellent and transformative pull request that establishes a strong vision for the project. The addition of automation scripts, a CI/CD pipeline, and comprehensive documentation is a massive step forward. My review focuses on improving the robustness of the new Python scripts and ensuring data consistency in the generated manifest. Overall, fantastic work!
| removed = original_count - unique_count | ||
|
|
||
| print(f" Unique: {unique_count:,} lines") | ||
| print(f" Removed: {removed:,} duplicates ({removed/original_count*100:.1f}%)") |
There was a problem hiding this comment.
This line can cause a ZeroDivisionError if the input file is empty, as original_count would be 0. It's safer to calculate the percentage with a check to prevent the script from crashing.
| print(f" Removed: {removed:,} duplicates ({removed/original_count*100:.1f}%)") | |
| print(f" Removed: {removed:,} duplicates ({(removed / original_count * 100) if original_count > 0 else 0:.1f}%)") |
| """Main entry point for validation tool.""" | ||
| validator = WordlistValidator() | ||
|
|
||
| if len(sys.argv) > 1 and sys.argv[1] == "--file": |
There was a problem hiding this comment.
Running the script with python3 scripts/validate.py --file without a filename will cause an IndexError on line 191 because sys.argv[2] will be out of bounds. You should check that len(sys.argv) is greater than 2 for this case.
| if len(sys.argv) > 1 and sys.argv[1] == "--file": | |
| if len(sys.argv) > 2 and sys.argv[1] == "--file": |
| print(f"📋 Processing {filepath.name}...") | ||
|
|
||
| # Read file | ||
| with open(filepath, 'r', encoding='utf-8', errors='ignore') as f: |
There was a problem hiding this comment.
Using errors='ignore' might silently discard data if there are encoding issues in a file. Given the project's emphasis on quality and the fact that validate.py already handles multiple encodings, it would be more robust to adopt a similar strategy here. At a minimum, consider changing to errors='replace' to make encoding issues visible, or errors='strict' to fail fast on malformed files.
| for txt_file in self.root_dir.rglob('*.txt'): | ||
| if not any(skip in txt_file.parts for skip in skip_dirs): | ||
| wordlists.append(txt_file) | ||
|
|
||
| for lst_file in self.root_dir.rglob('*.lst'): | ||
| if not any(skip in lst_file.parts for skip in skip_dirs): | ||
| wordlists.append(lst_file) |
There was a problem hiding this comment.
This logic for finding wordlists has duplicated code for .txt and .lst files. You can refactor this to be more concise and maintainable by iterating over a tuple of extensions. This avoids repetition and makes it easier to add more file types in the future.
| for txt_file in self.root_dir.rglob('*.txt'): | |
| if not any(skip in txt_file.parts for skip in skip_dirs): | |
| wordlists.append(txt_file) | |
| for lst_file in self.root_dir.rglob('*.lst'): | |
| if not any(skip in lst_file.parts for skip in skip_dirs): | |
| wordlists.append(lst_file) | |
| for ext in ('*.txt', '*.lst'): | |
| for path in self.root_dir.rglob(ext): | |
| if not any(skip in path.parts for skip in skip_dirs): | |
| wordlists.append(path) |
| "avg_length": 5.295072115384615 | ||
| }, | ||
| { | ||
| "path": "uniqpass-v16-passwords.txt", |
There was a problem hiding this comment.
…tion The previous grep-based null byte detection was producing false positives in the GitHub Actions environment. Files like 2151220-passwords.txt were incorrectly flagged as corrupted when they contained zero null bytes. Changes: - Replaced shell grep command with Python-based integrity verification - Uses Python's binary file reading for accurate null byte detection - Verified locally: 70/70 files pass (0 null bytes detected) - More reliable across different shell environments The new check: ✓ Properly detects actual binary corruption (null bytes) ✓ Works consistently across platforms ✓ Provides clear error messages with file paths ✓ No false positives Tested locally with Python verification - all files are clean.
- Restored original project image from blogspot - Removed overly promotional language - Simplified descriptions (removed star ratings, 'Best for', etc.) - Changed to direct 'Use:' statements - Removed excessive bold text and emoji - Made tone more straightforward and technical - Kept all information but with cleaner presentation
This is not just an update—it's a complete transformation from a static archive into an intelligent, automated, quality-driven toolkit.
The Vision: From Museum to Living System
The repository has been frozen since 2017 with only maintenance commits. This transformation brings it into 2025 with modern automation, quality control, and comprehensive documentation.
What's Changed
🤖 Intelligent Automation
validate.py: Comprehensive validation system
deduplicate.py: Smart deduplication tool
manifest.json: Auto-generated metadata
✅ Real CI/CD Pipeline
📖 Documentation Excellence
CLAUDE.md: Project philosophy & guiding principles
CONTRIBUTING.md: Comprehensive contribution guide
CHANGELOG.md: Transparent version history
README.md: Transformed from catalog to guide
🗂️ Better Organization
The Numbers
Philosophy
"Quality is not an act, it's a habit." - Aristotle
This transformation embodies that philosophy:
What This Means
The repository is no longer just a collection of text files—it's a living, breathing toolkit that:
✅ Validates itself automatically
✅ Knows its own metadata
✅ Guides users to the right wordlist
✅ Welcomes contributions with clear standards
✅ Maintains quality through automation
✅ Stays transparent through comprehensive docs
From static archive to intelligent system.
From maintenance mode to evolution.
From good to insanely great.